I recently discovered the Super Data Science podcast hosted by Kirill Eremenko. What I like about this podcast series is that it is applied data science. You can talk all day about theory, theorems, proofs, and mathematical details and assumptions. Even if you could master every technical detail underlying 'data science' you have only scratched the surface. What distinguishes data science from the academic discipline of statistics, computer science, or machine learning is application to solve a problem for business or society. Its not theory for theory's sake. There are huge gaps between theory and application that can easily stump a team of PhD's or experienced practitioners (see also applied econometrics). Podcasts like this can help bridge the gap.
Episode 014 featured Greg Poppe who is Sr Vice President for risk management at an auto lending firm. They discussed how data science is leveraged in loan approvals and rate setting among other things.
The general modeling approach that Greg discussed is very similar to work that I have done before in student risk modeling in higher education (see here and here).
"So think of it like -- you know, I would have a hard time telling you with any high degree of certainty, “This loan will pay. This loan will pay. But this loan won’t.” However, if you give me a portfolio of a hundred loans, I should be able to say “15 aren’t going to pay. I don’t know which 15, but 15 won’t.” And then if you give me another portfolio that’s say riskier, I should be able to measure that risk and say “This is a riskier pool. 25 aren’t going to pay. And again, I don’t know which 25, but I’m estimating 25.” And that’s how we measure our accuracy. So it’s not so much on a loan-by-loan basis. It’s “If we just select a random sample, how many did not pay, and what was our expectation of that?” And if they’re very close, we consider our models to be accurate."
A toy example in R that seems very similar can be found here (Predictive Modeling and Custom Reporting in R).
So at a basic level they are just using predictive models to get a score and using cutoffs to determine different pools of risk and making approvals, declines, and setting interest rates based on this. He doesn't discuss the specifics of the model testing, but to me the key here sounds a lot like calibration (see Is the ROC curve a good metric for model calibration?). In terms of the types of models they use of this it gets very interesting. As Kirill says, the whole podcast is worth listening to for this very point. For their credit scoring models they use regression, even though they could get improved performance from other algorithms like decision trees or ensembles. Why?
"so primarily in the credit decisioning models, we use regression models. And the reason why—well, there’s quite a few. One is it’s very computationally easy. It’s easy to explain, it’s easy for people to understand but it’s also not a black box in the sense that a lot of models can be, and what we need to do is we need to provide a continuity to a dealership because they can adjust the parameters of the application and that will adjust the risk accordingly…..If we were to go with a CART model or any other decision tree model, if the first break point or the first cut point in that model is down payment and they go from one side to the other, it can throw it down a completely separate set of decision logic and they can get very strange approvals. From a data science perspective and from an analytics perspective, that may be more accurate but it’s not sellable, it’s not marketable to the dealership."
Yes huge gap just filled and well worth repeating. Its interesting, in a different scenario you could go the other way around. For instance, in my work in higher education student risk modeling we went with decision trees instead of regression but based on a similar line of reasoning. Our end users however were not going to be tweaking parameters but getting sign off and buy in required that they understand more about what the model was doing. The explicit nature of the splits and decision logic of the trees was easier to explain and understand for untrained statisticians than was regression models or neural networks.
If you have been a practitioner for a while you might think of course every data scientist knows there is a tradeoff between accuracy, complexity, and functional practicality. I agree but it still can't be emphasized enough. And more time should be spent on applied examples like this vs the waste we see in social media discussion who is or isn't a fake data scientist. The real data scientists are too busy working in the gaps between theory and practice to care. To be continued....