In his well known paper, Leo Breiman discusses the 'cultural' differences between algorithmic (machine learning) approaches and traditional methods related to inferential statistics. Recently, I discussed how important understanding these kinds of distinctions are when it comes to understanding how current automated machine learning tools can be leveraged in the data science space.
In his paper Leo Breiman states:
"Approaching problems by looking for a data model imposes an apriori straight jacket that restricts the ability of statisticians to deal with a wide range of statistical problems."
On the other hand, Susan Athey's work highlights the fact that no one has developed the asymptotic theory necessary to adequately address causal questions using methods from machine learning (i.e. how does a given machine learning algorithm fit into the context of the Rubin Causal Model/potential outcomes framework?)
Dr. Athey is working to bridge some of this gap, but it's very complicated. I think there is a lot that can also be done, just understanding and communicating about the differences between inferential and causal questions vs. machine learning/predictive modeling questions. When should each be used for a given business problem? What methods does this entail?
In an MIT Data Made to Matter podcast, economist Joseph Doyle discusses his paper investigating the relationship between more aggressive (and expensive) treatments by hospitals and improved outcomes for medicare patients. Using this as an example, I hope to broadly illustrate some of these differences looking at this problem through all three lenses.
Suppose we just want to know if there is a significant relationship between aggressive treatments 'A' and health outcomes (mortality) 'M.' We might estimate a regression equation (similar to one of the models in the paper) such as:
M = b0 + b1*A + b2*X + e where X is a vector of relevant controls.
We would be very careful about the nature of our data, correct functional form, and getting our standard errors correct to make valid inferences about our estimate 'b1' of the relationship between aggressive treatments A and mortality M. A lot of this is traditionally taught in econometrics, biostatistics, and epidemiology (things like heteroskedasticity, multicollinearity, distributional assumptions related to the error terms etc.)
Suppose we wanted to know if the estimate b1 in the equation above is causal. In Doyle's paper they discuss some of the challenges:
"A major issue that arises when comparing hospitals is that they may treat different types of patients. For example, greater treatment levels may be chosen for populations in worse health. At the individual level, higher spending is strongly associated with higher mortality rates, even after risk adjustment, which is consistent with more care provided to patients in (unobservably) worse health. At the hospital level, long-term investments in capital and labor may reflect the underlying health of the population as well. Differences in unobservable characteristics may therefore bias results toward finding no effect of greater spending."
One of the points he is making is that even if we control for everything we typically measure in these studies (captured by X above) there are unobservable characteristics related to patients that weaken our estimate of b1. Recall that methods like regression and matching (which are two flavors of identification strategies based on selection on observables) achieve identification by assuming that conditional on observed characteristics (X), selection bias disappears. We want to make conditional on X comparisons of Y (or M in the model above) that mimic as much as possible the experimental benchmark of random assignment (see more on matching estimators here.)
However, if there are important characteristics related to selection that we don't observe and can't include in X, then in order to make valid causal statements about our results, we need a method that identifies treatment effects within a selection on 'un'-observables framework. (examples include difference-in-differences, fixed effects, and instrumental variables).
In Doyle's paper, they used ambulance service as an instrument for hospital choice to make causal statements about A.
Machine Learning/Predictive Modeling
Suppose we just want to predict mortality by hospital to support some policy or operational objective where the primary need is accurate predictions. A number of algorithmic methods might be exploited including logistic regression, decision trees, random forests, neural networks etc. Based on the mixed findings in the literature, a machine learning algorithm may not exploit 'A' at all even though Doyle finds a significant causal effect based on his instrumental variables estimator. The point is, in many cases a black box algorithm that includes or excludes treatment intensity as a predictor doesn't really care about the significance of this relationship or its causal mechanism, as long as at the end of the day the algorithm predicts well out of sample and maintains reliability and usefulness in application over time.
If we wanted to know if the relationship between intensity of care 'A' was statistically significant or causal, we would not rely on machine learning methods. At least nothing available on the shelf today pending further work by researchers like Susan Athey. We would develop the appropriate causal or inferential model designed to answer the particular question at hand. In fact, as Susan Athey points out in a past Quora commentary, models used for causal inference could possibly give worse predictions:
"Techniques like instrumental variables seek to use only some of the information that is in the data – the “clean” or “exogenous” or “experiment-like” variation in price—sacrificing predictive accuracy in the current environment to learn about a more fundamental relationship that will help make decisions...This type of model has not received almost any attention in ML."
The point is, for the data scientist caught in the middle of so much disruption related to tools like automated machine learning, as well as technologies producing and leveraging large amounts of data, it is important to focus on business understanding and map the appropriate method to address what is trying to be achieved. The ability to understand the differences in tools and methodologies related to statistical inference, causal inference, and machine learning and explaining those differences to stakeholders will be important to prevent 'straight jacket' thinking about solutions to complex problems.
Doyle, Joseph et al. “Measuring Returns to Hospital Care: Evidence from Ambulance Referral Patterns.” The journal of political economy 123.1 (2015): 170–214. PMC. Web. 11 July 2017.
Matt Bogard. "A Guide to Quasi-Experimental Designs" (2013)
Available at: http://works.bepress.com/matt_bogard/24/