Sunday, March 18, 2018

Will there be a credibility revolution in data science and AI?

Summary: Understanding where AI and automation are going to be the most disruptive to data scientists in the near term relates to understanding methodological differences between explaining and predicting, between machine learning and causal inference. It will require the ability to ask a different kind of question than machine learning algorithms are capable of answering off of the shelf today.

There is a lot of enthusiasim about the disruptive role of automation and AI in data science. Products like H20ai and DataRobot offer tools to automate or fast track many aspects of the data science work stream. If this trajectory continues, what will the work of the future data scientist look like?

Many have already pointed out the very difficult task of automating the soft skills possessed by data scientists. In a previous LinkedIn post I discussed this in the trading space where automation and AI could create substantial disruptions for both data scientists and traders. Here I quoted Matthew Hoyle:

"Strategies have a short shelf life-what is valuable is the ability and energy to look at new and interesting things and put it all together with a sense of business development and desire to explore"

My conclusion: They are talking about bringing a portfolio of useful and practical skills together to do a better job than was possible before open source platforms and computing power became so proliferate. I think that is the future.

So the future is about rebalancing the data scientists portfolio of skills. However, in the near term I think the disruption from AI and automation in data science will do more than increase the emphasis on soft skills. In fact there will remain a significant portion of 'hard skills' that will see an increase in demand because of the difficulty of automation.

Understanding this will depend largely on making a distinction between explaining and predicting. Much of what appears to be at the forefront of automation involves tasks supporting  supervised and unsupervised machine learning algorithms as well as other prediction and forecasting tools like time series analysis.

Once armed with predictions, businesses will start to ask questions about 'why'. This will transcend prediction or any of the visualizations of the patterns and relationships coming out of black box algorithms. They will want to know what decisions or factors are moving the needle on revenue or customer satisfaction and engagement or improved efficiencies. Essentially they will want to ask questions related to causality, which requires a completely different paradigm for data analysis than questions of prediction. And they will want scientifically formulated answers that are convincing vs. mere reports about rates of change or correlations. There is a significant difference between understanding what drivers correlate with or 'predict' the outcome of interest and what is actually driving the outcome. What they will be asking for is a credibility revolution in data science.

What do we mean by a credibility revolution?

Economist Jayson Lusk puts it well:

"Fortunately economics (at least applied microeconomics) has undergone a bit of credibility revolution.  If you attend a research seminar in virtually any economi(cs) department these days, you're almost certain to hear questions like, "what is your identification strategy?" or "how did you deal with endogeneity or selection?"  In short, the question is: how do we know the effects you're reporting are causal effects and not just correlations."

Healthcare Economist Austin Frakt has a similar take:

"A “research design” is a characterization of the logic that connects the data to the causal inferences the researcher asserts they support. It is essentially an argument as to why someone ought to believe the results. It addresses all reasonable concerns pertaining to such issues as selection bias, reverse causation, and omitted variables bias. In the case of a randomized controlled trial with no significant contamination of or attrition from treatment or control group there is little room for doubt about the causal effects of treatment so there’s hardly any argument necessary. But in the case of a natural experiment or an observational study causal inferences must be supported with substantial justification of how they are identified. Essentially one must explain how a random experiment effectively exists where no one explicitly created one."

How are these questions and differences unlike your typical machine learning application? Susan Athey does a great job explaining in a Quora response about how causal inference is different from off the shelf machine learning methods (the kind being automated today):

"Sendhil Mullainathan (Harvard) and Jon Kleinberg with a number of coauthors have argued that there is a set of problems where off-the-shelf ML methods for prediction are the key part of important policy and decision problems.  They use examples like deciding whether to do a hip replacement operation for an elderly patient; if you can predict based on their individual characteristics that they will die within a year, then you should not do the operation...Despite these fascinating examples, in general ML prediction models are built on a premise that is fundamentally at odds with a lot of social science work on causal inference. The foundation of supervised ML methods is that model selection (cross-validation) is carried out to optimize goodness of fit on a test sample. A model is good if and only if it predicts well. Yet, a cornerstone of introductory econometrics is that prediction is not causal inference.....Techniques like instrumental variables seek to use only some of the information that is in the data – the “clean” or “exogenous” or “experiment-like” variation in price—sacrificing predictive accuracy in the current environment to learn about a more fundamental relationship that will help make decisions...This type of model has not received almost any attention in ML."

Developing an identification strategy, as Jayson Lusk discussed above, and all that goes along with that (finding natural experiments or valid instruments, or navigating the garden of forking paths related to propensity score matching or a number of other quasi-experimental methods) involves careful considerations and decisions to be made and defended in ways that would be very challenging to automate. Even when human's do this there is rarely a single best approach to these problems. They are far from routine. Just ask anyone that has been through peer review or given a talk at an economics seminar or conference.

The kinds of skills required to work in this space would be similar to those of the econometrician or epidemiologist or any quantitative researcher that has been culturally immersed in the social norms and practices that have evolved out of the credibility revolution.. as data science thought leader Eugene Dubossarsky puts it:

“the most elite skills…the things that I find in the most elite data scientists are the sorts of things econometricians these days have…bayesian statistics…inferring causality” 

Noone has a crystal ball.  It is not to say that the current advances in automation are falling short on creating value. They should no doubt create value like any other form of capital complementing the labor and soft skills of the data scientist. And they could free up more resources to focus on more causal questions that previously may not have been answered. I discussed this complementarity previously in a related post:

 "correlations or 'flags' from big data might not 'identify' causal effects, but they are useful for prediction and might point us in directions where we can more rigorously investigate causal relationships if interested" 

However, if automation in this space is possible, it will require a different approach than what we have seen so far. We might look to the pioneering work that Susan Athey is doing converging machine learning and causal inference. It will require thinking in terms of potential outcomes, endogeniety, and counterfactuals which requires the ability to ask a different kind of question than machine learning algorithms are capable of answering off of the shelf today.

Additional References:

From 'What If?' To 'What Next?' : Causal Inference and Machine Learning for Intelligent Decision Making https://sites.google.com/view/causalnips2017

Susan Athey on Machine Learning, Big Data, and Causation http://www.econtalk.org/archives/2016/09/susan_athey_on.html 

Machine Learning and Econometrics (Susan Athey, Guido Imbens) https://www.aeaweb.org/conference/cont-ed/2018-webcasts 

Related Posts:

Why Data Science Needs Economics
http://econometricsense.blogspot.com/2016/10/why-data-science-needs-economics.html

To Explain or Predict
http://econometricsense.blogspot.com/2015/03/to-explain-or-predict.html

Culture War: Classical Statistics vs. Machine Learning: http://econometricsense.blogspot.com/2011/01/classical-statistics-vs-machine.html 

HARK! - flawed studies in nutrition call for credibility revolution -or- HARKing in nutrition research  http://econometricsense.blogspot.com/2017/12/hark-flawed-studies-in-nutrition-call.html

Econometrics, Math, and Machine Learning
http://econometricsense.blogspot.com/2015/09/econometrics-math-and-machine.html

Big Data: Don't Throw the Baby Out with the Bathwater
http://econometricsense.blogspot.com/2014/05/big-data-dont-throw-baby-out-with.html

Big Data: Causality and Local Expertise Are Key in Agronomic Applications
http://econometricsense.blogspot.com/2014/05/big-data-think-global-act-local-when-it.html

The Use of Knowledge in a Big Data Society II: Thick Data
https://www.linkedin.com/pulse/use-knowledge-big-data-society-ii-thick-matt-bogard/ 

The Use of Knowledge in a Big Data Society
https://www.linkedin.com/pulse/use-knowledge-big-data-society-matt-bogard/ 

Big Data, Deep Learning, and SQL
https://www.linkedin.com/pulse/deep-learning-regressionand-sql-matt-bogard/

Economists as Data Scientists
http://econometricsense.blogspot.com/2012/10/economists-as-data-scientists.html 

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.