Many have already pointed out the very difficult task of automating the soft skills possessed by data scientists. In a previous LinkedIn post I discussed this in the trading space where automation and AI could create substantial disruptions for both data scientists and traders. Here I quoted Matthew Hoyle:
"Strategies have a short shelf life-what is valuable is the ability and energy to look at new and interesting things and put it all together with a sense of business development and desire to explore"
Understanding disruption from Ai and automation will also depend largely on making a distinction between explaining and predicting. Much of what appears to be at the forefront of automation involves tasks supporting supervised and unsupervised machine learning algorithms as well as other prediction and forecasting tools like time series analysis.
Once armed with predictions, businesses will start to ask questions about 'why'. This will transcend prediction or any of the visualizations of the patterns and relationships coming out of black box algorithms. They will want to know what decisions or factors are moving the needle on revenue or customer satisfaction and engagement or improved efficiencies. Essentially they will want to ask questions related to causality, which requires a completely different paradigm for data analysis than questions of prediction. And they will want scientifically formulated answers that are convincing vs. mere reports about rates of change or correlations. There is a significant difference between understanding what drivers correlate with or 'predict' the outcome of interest and what is actually driving the outcome. What they will be asking for is a credibility revolution in data science.
What do we mean by a credibility revolution?
Economist Jayson Lusk puts it well:
"Fortunately economics (at least applied microeconomics) has undergone a bit of credibility revolution. If you attend a research seminar in virtually any economi(cs) department these days, you're almost certain to hear questions like, "what is your identification strategy?" or "how did you deal with endogeneity or selection?" In short, the question is: how do we know the effects you're reporting are causal effects and not just correlations."
Healthcare Economist Austin Frakt has a similar take:
"A “research design” is a characterization of the logic that connects the data to the causal inferences the researcher asserts they support. It is essentially an argument as to why someone ought to believe the results. It addresses all reasonable concerns pertaining to such issues as selection bias, reverse causation, and omitted variables bias. In the case of a randomized controlled trial with no significant contamination of or attrition from treatment or control group there is little room for doubt about the causal effects of treatment so there’s hardly any argument necessary. But in the case of a natural experiment or an observational study causal inferences must be supported with substantial justification of how they are identified. Essentially one must explain how a random experiment effectively exists where no one explicitly created one."
How are these questions and differences unlike your typical machine learning application? Susan Athey does a great job explaining in a Quora response about how causal inference is different from off the shelf machine learning methods (the kind being automated today):
"Sendhil Mullainathan (Harvard) and Jon Kleinberg with a number of coauthors have argued that there is a set of problems where off-the-shelf ML methods for prediction are the key part of important policy and decision problems. They use examples like deciding whether to do a hip replacement operation for an elderly patient; if you can predict based on their individual characteristics that they will die within a year, then you should not do the operation...Despite these fascinating examples, in general ML prediction models are built on a premise that is fundamentally at odds with a lot of social science work on causal inference. The foundation of supervised ML methods is that model selection (cross-validation) is carried out to optimize goodness of fit on a test sample. A model is good if and only if it predicts well. Yet, a cornerstone of introductory econometrics is that prediction is not causal inference.....Techniques like instrumental variables seek to use only some of the information that is in the data – the “clean” or “exogenous” or “experiment-like” variation in price—sacrificing predictive accuracy in the current environment to learn about a more fundamental relationship that will help make decisions...This type of model has not received almost any attention in ML."
Developing an identification strategy, as Jayson Lusk discussed above, and all that goes along with that (finding natural experiments or valid instruments, or navigating the garden of forking paths related to propensity score matching or a number of other quasi-experimental methods) involves careful considerations and decisions to be made and defended in ways that would be very challenging to automate. Even when human's do this there is rarely a single best approach to these problems. They are far from routine. Just ask anyone that has been through peer review or given a talk at an economics seminar or conference.
The kinds of skills required to work in this space would be similar to those of the econometrician or epidemiologist or any quantitative researcher that has been culturally immersed in the social norms and practices that have evolved out of the credibility revolution.. as data science thought leader Eugene Dubossarsky puts it:
“the most elite skills…the things that I find in the most elite data scientists are the sorts of things econometricians these days have…bayesian statistics…inferring causality”
Noone has a crystal ball. It is not to say that the current advances in automation are falling short on creating value. They should no doubt create value like any other form of capital complementing the labor and soft skills of the data scientist. And they could free up more resources to focus on more causal questions that previously may not have been answered. I discussed this synergy previously in a related post:
"correlations or 'flags' from big data might not 'identify' causal effects, but they are useful for prediction and might point us in directions where we can more rigorously investigate causal relationships if interested"
However, if automation in this space is possible, it will require a different approach than what we have seen so far. We might look to the pioneering work that Susan Athey is doing converging machine learning and causal inference:
"I’m also working on developing statistical theory for some of the most widely used and successful estimators, like random forests, and adapting them so that they can be used to predict an individual’s treatment effects as a function of their characteristics. For example, I can tell you for a particular individual, given their characteristics, how they would respond to a price change, using a method adapted from regression trees or random forests. This will come with a confidence interval as well."
From 'What If?' To 'What Next?' : Causal Inference and Machine Learning for Intelligent Decision Making https://sites.google.com/view/causalnips2017
Susan Athey on Machine Learning, Big Data, and Causation http://www.econtalk.org/archives/2016/09/susan_athey_on.html
Machine Learning and Econometrics (Susan Athey, Guido Imbens) https://www.aeaweb.org/conference/cont-ed/2018-webcasts
Why Data Science Needs Economics
To Explain or Predict
Culture War: Classical Statistics vs. Machine Learning: http://econometricsense.blogspot.com/2011/01/classical-statistics-vs-machine.html
HARK! - flawed studies in nutrition call for credibility revolution -or- HARKing in nutrition research http://econometricsense.blogspot.com/2017/12/hark-flawed-studies-in-nutrition-call.html
Econometrics, Math, and Machine Learning
Big Data: Don't Throw the Baby Out with the Bathwater
Big Data: Causality and Local Expertise Are Key in Agronomic Applications
The Use of Knowledge in a Big Data Society II: Thick Data
The Use of Knowledge in a Big Data Society
Big Data, Deep Learning, and SQL
Economists as Data Scientists