Sunday, March 18, 2018

Will there be a credibility revolution in data science and AI?

Summary: Understanding where AI and automation are going to be the most disruptive to data scientists in the near term relates to understanding methodological differences between explaining and predicting. It will require the ability to ask a different kind of question than machine learning algorithms are capable of answering off of the shelf today. At the end of the day we have to think about how the world adjusts along multiple margins. This may put a premium on soft skills. At the heart of causal inference, it's not about data and algorithms. As Judea Pearl says, it's about something that is not in the data to begin with. A credibility revolution in AI that embraces causal inference will only get us as far as good theory can take us.

There is a lot of enthusiasm about the disruptive role of automation and AI in data science. Products like H20ai and DataRobot offer tools to automate or fast track many aspects of the data science work stream. Combined with other AI based applications and machine learning, large language models (LLMs) will likely help us free up resources from routine tasks, gather information, and have more time to focus and ask better questions. If this trajectory continues, what will the work of the future data scientist look like?

Many have already pointed out the very difficult task of automating the soft skills possessed by data scientists. In a previous LinkedIn post I discussed this in the trading space where automation and AI could create substantial disruptions for both data scientists and traders. Here I quoted Matthew Hoyle:

"Strategies have a short shelf life-what is valuable is the ability and energy to look at new and interesting things and put it all together with a sense of business development and desire to explore"

Understanding disruption from Ai and automation will also depend largely on making a distinction between explaining and predicting. 

Once armed with predictions, businesses will start to ask questions about 'why'. This will transcend prediction or any of the visualizations of the patterns and relationships coming out of black box prediction algorithms or simply summarizing and parroting pre-existing information. Decision makers will need to know what decisions or factors are moving the needle on revenue or customer satisfaction and engagement or improved efficiencies. Essentially they will want to ask questions related to causality [i.e. compared to what? what makes a difference?]  There is a significant difference between understanding what drivers correlate with or 'predict' the outcome of interest and what actually makes a difference in the outcome. What they will be asking for is a different paradigm or a credibility revolution in AI and data science.

What do we mean by a credibility revolution?

Economist Jayson Lusk puts it well:

"Fortunately economics (at least applied microeconomics) has undergone a bit of credibility revolution.  If you attend a research seminar in virtually any economi(cs) department these days, you're almost certain to hear questions like, "what is your identification strategy?" or "how did you deal with endogeneity or selection?"  In short, the question is: how do we know the effects you're reporting are causal effects and not just correlations."

Healthcare Economist Austin Frakt has a similar take:

"A “research design” is a characterization of the logic that connects the data to the causal inferences the researcher asserts they support. It is essentially an argument as to why someone ought to believe the results. It addresses all reasonable concerns pertaining to such issues as selection bias, reverse causation, and omitted variables bias. In the case of a randomized controlled trial with no significant contamination of or attrition from treatment or control group there is little room for doubt about the causal effects of treatment so there’s hardly any argument necessary. But in the case of a natural experiment or an observational study causal inferences must be supported with substantial justification of how they are identified. Essentially one must explain how a random experiment effectively exists where no one explicitly created one."

How are these questions and differences unlike your typical machine learning application? Susan Athey does a great job explaining in a Quora response about how causal inference is different from off the shelf machine learning methods (the kind being automated today):

"Sendhil Mullainathan (Harvard) and Jon Kleinberg with a number of coauthors have argued that there is a set of problems where off-the-shelf ML methods for prediction are the key part of important policy and decision problems.  They use examples like deciding whether to do a hip replacement operation for an elderly patient; if you can predict based on their individual characteristics that they will die within a year, then you should not do the operation...Despite these fascinating examples, in general ML prediction models are built on a premise that is fundamentally at odds with a lot of social science work on causal inference. The foundation of supervised ML methods is that model selection (cross-validation) is carried out to optimize goodness of fit on a test sample. A model is good if and only if it predicts well. Yet, a cornerstone of introductory econometrics is that prediction is not causal inference.....Techniques like instrumental variables seek to use only some of the information that is in the data – the “clean” or “exogenous” or “experiment-like” variation in price—sacrificing predictive accuracy in the current environment to learn about a more fundamental relationship that will help make decisions...This type of model has not received almost any attention in ML."

Developing an identification strategy, as Jayson Lusk discussed above, and all that goes along with that (finding natural experiments or valid instruments, or navigating the garden of forking paths related to propensity score matching or a number of other quasi-experimental methods) involves careful considerations and decisions to be made and defended in ways that would be very challenging to automate. Even when human's do this there is rarely a single best approach to these problems. They are far from routine. Just ask anyone that has been through peer review or given a talk at an economics seminar or conference. 

The kinds of skills that will be useful in this space would be similar to those of the econometrician or epidemiologist or any quantitative decision maker comfortable with the norms and practices that have evolved out of the credibility revolution.. as data science thought leader Eugene Dubossarsky puts it:

“the most elite skills…the things that I find in the most elite data scientists are the sorts of things econometricians these days have…bayesian statistics…inferring causality” 

Noone has a crystal ball.  It is not to say that the current advances in automation are falling short on creating value [just look again at the growing interest in LLMs]. They should no doubt create value like any other form of capital complementing the labor and soft skills of the data scientist. And as mentioned above they could free up more resources to focus on more causal questions that previously may not have been answered. I discussed this type of synergy previously in a related post before LLMs when the hype was mostly focused on 'big data':

 "correlations or 'flags' from big data might not 'identify' causal effects, but they are useful for prediction and might point us in directions where we can more rigorously investigate causal relationships if interested" 

If automation of causal inference is possible, it will require a different approach than what we have seen so far. We might look to the pioneering work that Susan Athey is doing converging machine learning and causal inference:

"I’m also working on developing statistical theory for some of the most widely used and successful estimators, like random forests, and adapting them so that they can be used to predict an individual’s treatment effects as a function of their characteristics. For example, I can tell you for a particular individual, given their characteristics, how they would respond to a price change, using a method adapted from regression trees or random forests. This will come with a confidence interval as well." 

Or - ongoing research related to causal structure discovery (CSD) as discussed recently in Nature.

At the end of the day however, when thinking about the disruption of AI and automation we have to think about how the world adjusts along multiple margins. This may put a premium on soft skills. At the heart of causal inference, its not about data and algorithms, instead as Judea Pearl says, its about something that is not in the data to begin with. It is a way of thinking. As noted before from the 10th edition of Heyne, Boettke, and Pryschitko's The Economic Way of Thinking:

"We can observe facts, but it takes a theory to explain the causes. It takes a theory to weed out the irrelevant facts from the relevant ones....Our observations of the world are in fact drenched with theory, which is why we can usually make sense out of the buzzing confusion that assaults our eyes and ears. Actually we observe only a small fraction of what we "know," a hint here and a suggestion there. The rest we fill in from the theories we hold: small and broad, vague and precise..."

A credibility revolution in AI that embraces causality will only get us as far as good theory can take us. 

[This post was updated on June 9, 2023]

Additional References:

Shen, X., Ma, S., Vemuri, P. et al. Challenges and Opportunities with Causal Discovery Algorithms: Application to Alzheimer’s Pathophysiology. Sci Rep 10, 2975 (2020).

From 'What If?' To 'What Next?' : Causal Inference and Machine Learning for Intelligent Decision Making

Susan Athey on Machine Learning, Big Data, and Causation 

Machine Learning and Econometrics (Susan Athey, Guido Imbens) 

Related Posts:

Why Data Science Needs Economics

To Explain or Predict

Culture War: Classical Statistics vs. Machine Learning: 

HARK! - flawed studies in nutrition call for credibility revolution -or- HARKing in nutrition research

Econometrics, Math, and Machine Learning

Big Data: Don't Throw the Baby Out with the Bathwater

Big Data: Causality and Local Expertise Are Key in Agronomic Applications

The Use of Knowledge in a Big Data Society II: Thick Data 

The Use of Knowledge in a Big Data Society 

Big Data, Deep Learning, and SQL

Economists as Data Scientists