Friday, December 21, 2018

Thinking About Confidence Intervals: Horseshoes and Hand Grenades

In a previous post, Confidence Intervals: Fad or Fashion I wrote about Dave Giles' post on interpreting confidence intervals. A primary focus of these discussions was how confidence intervals are often mis-interpreted. For instance the two statements below are common mischaracterizations of CIs:

1) There's a 95% probability that the true value of the regression coefficient lies in the interval [a,b].
2) This interval includes the true value of the regression coefficient 95% of the time.

You can read the previous post or Dave's post for more details. But in re-reading Dave's post myself recently one statement had me thinking:

"So, the first interpretation I gave for the confidence interval in the opening paragraph above is clearly wrong. The correct probability there is not 95% - it's either zero or 100%! The second interpretation is also wrong. "This interval" doesn't include the true value 95% of the time. Instead, 95% of such intervals will cover the true value."

I like the way he put that...'95% of such intervals' distinguishing this from a particular observed/calculated confidence interval. I think someone trained to think about CIs in the incorrect probabilistic way may have trouble getting at this. So how might we think about this in a way that captures CIs in a way that is still useful, but doesn't get us tripped up with incorrect probability statements?

My favorite statistics text is Degroot's Probability and Statistics. In the 4th edition they are very careful about explaining confidence intervals:

"Once we compute the observed values of a and b, the observed interval (a,b) is not so easy to interpret....Before observing the data we can be 95% confident that the random interval (A,B) will contain mu, but after observing the data, the safest interpretation is that (a,b) is simply the observed value of the random interval (A,B)"

While Degroot is careful, it still may not be very intuitive. However, in Principles and Procedures of Statistics: A Biometrical Approach (Steel, Torie, and Dickey) they present a more intuitive explanation.

"since mu will either be or not be in the interval, that is P=0 or 1, the probability will actually be a measure of confidence we placed in the procedure that led to the statement. This is like throwing a ring at a fixed post; the ring doesn't land in the same position or even catch on the post every time. However we are able to say that we can circle the post 9 times out of 10, or whatever the value should be for the measure of our confidence in our proficiency."

The ring tossing analogy seems to work pretty well. I'll customize it by using horseshoes instead. Yes 95 out of 100 times you might throw a ringer (in the game of horseshoes that is when the horse shoe circles the peg or stake when you toss it). You know this before you toss it. And to use Dave Giles language, *before* calculating a confidence interval we know that 95% of such intervals will cover the population parameter of interest. And, after we toss the shoe, it either circles the peg or not, that is a 1 or a 0 in terms of probability. Similarly, *after* computing a confidence interval, the true mean or population parameter of interest is covered or not with a probability of 0 or 100%.

This isn't perfect, but thinking of confidence intervals this way at least keeps us honest about making probability statements.

Going back to my previous post, I still like the description of confidence intervals Angrist and Pishke provide in Mastering 'Metrics, that is 'describing a set of parameter values consistent with our data.' 

For instance if we run the regression:

y = b0 + b1X + e  to estimate y = B0 + B1 + e

and get our parameter estimate b with a 95% confidence interval like (1.2,1.8), we can say that our sample data is consistent with any population that has a B taking a value that falls in the interval. That implies there are a number of populations that our data would be consistent with. Narrower intervals imply very similar populations, very similar values of B, and speaks to more precision in our estimate of B.

I really can't make an analogy for hand grenades. It just gave me a title with a ring to it.

See also:
Interpreting Confidence Intervals
Bayesian Statistics Confidence Intervals and Regularization
Overconfident Confidence Intervals

Saturday, October 20, 2018

Power and Sample Size Analysis in Applied Econometrics

In applied work in econometrics I've done a limited amount of power and sample size analysis. Recently I was thinking about a conversation from an episode of the EconTalk podcast with Russ Roberts and John Ioannidis where the topic of power came up:

“though I was trained as a Ph.D., got a Ph.D. in economics at the U. of Chicago, I never heard that phrase, 'power,' applied to a statistical analysis. What we did--and I think what most economists, many economists, still do, is: we had a data set; we had something we wanted to discover and test or examine or explore, depending on the nature of the problem.”

That rings familiar to me. In eight years of attending talks and seminars in applied economics, what stands out are discussions of identification, endogeneity, standard errors etc. Not power or sample size. So I went back and looked at all of my copies of econometrics textbooks. These are well known and have been commonly used by masters and PhD graduate students in economics. Econometric Analysis by Greene, Econometric Analysis of Cross Section and Panel Data by Wooldridge,  A Course in Econometrics by Goldberger, A Guide to Econometrics by Kennedy, Using Econometrics by Studenmund. I even threw in Mastering 'Metrics and Mostly Harmless Econometrics by Angrist and Pischke.

While Wooldridge did discuss clustering and stratified sampling, most of the emphasis was placed on getting the correct standard errors and appropriate weighting. From my previous years of referencing these texts, as well as a cursory review again of the index and chapters of each one I could not find any treatment of power or sample size calculations.

So I thought, maybe this is something covered in prerequisite courses. Going back to the undergraduate level in economics I recall very little about this. Checking a popular text, Statistics for Business and Economics by Anderson, Sweeney, Williams, Camm, and Cochran I did find a basic example in relation to power and sample sizes for a t-test.  What about a graduate level pre-requisite for econometrics? In my first year of graduate school I took a graduate level course in mathematical statistics (this was a course doing business under a research methods title) that used Degroot's text Probability and Statistics. Definitely a lot about the concept of power in theory, but no emphasis on various calculations for sample size. The one textbook I own with treatment of this is Principles and Procedures of Statistics, A Biometrical Approach by Steel, Torrie, and Dickey. But that does not count because that was the text used in my experimental design course in graduate school. Not part of a standard econometrics curriculum.

I've come to the conclusion that power and sample size analysis may not be widely emphasized in graduate econometrics training across the board in all programs. It's not something missed in a lecture a decade ago. Similar to advanced specialized topics like spatial econometrics, details related to power and sample size analysis, survey design, stratified random sampling etc. are likely covered depending on one's specialty in the field and the program.

However,  it is evident that some economists do this kind of work.

For instance, here is an example from a paper with food economist Jayson Lusk:

"However, there are many economic problems where sample size directly affects a benefit or loss function. In these cases, sample size is an endogenous variable that should be considered jointly with other choice variables in an optimization problem. In this article we introduce an economic approach to sample size determination utilizing a Bayesian decision theoretic framework."

As well as healthcare economist Austin Frakt. 

So why do we care about power and sample size and what is 'power'?

Jim Manzi, Author of Uncontrolled: The Surprising Payoff of Trial-and-Error for Business, Politics, and Society offers the following analogy in an Econ Talk podcast:

“Well, the power in a statistical experiment, and I often use this analogy, is sort of like the magnification power on the microscope you probably used in high school biology. It has on the side, 4x, 8x, 16x, which is how many times it can increase the apparent size of a physical object. And the metaphor I'd use is, if I try and use a child's microscope to carefully observe a section of a leaf looking for an insect that's a little smaller than an ant, and I don't observe the ant, I can reliably say: I don't see the insect, and therefore there is no bug there. If I use that exact same microscope to try and find on that exact same piece of leaf, not a bug but a tiny microbe that's, you know, smaller than a speck of dust, I'll look at it and I'll say: it's all kind of fuzzy, I see a lot of squiggly things; I think that little squiggle might be something or it might not. I don't see the microbe, but I can't reliably say that therefore there is no microbe there, because trying to zoom in closer and closer to look for something that small, all I see is a bunch of fuzz. So my failure to see the microbe is a statement about the precision of my instrument, not about whether there's really a microbe on the leaf.”

So, if we have a sample that is ‘not sufficiently powered’ it is possible that we could fail to find a relationship between treatment and outcome, even if one actually exists. Equivalently, our estimated regression coefficient may not be statistically significant when a relationship actually does exist. Increasing sample size is one primary way to increase power in an experiment. So the question becomes how large does ‘n’ have to be to have a sample sufficiently powered to detect the effect of a treatment on an outcome (at some stated level of significance)?

So how do you do these calculations? If you can't find examples in your econometrics textbook (if you do find one let me know!) there are plenty of texts in the biostatistics genre that probably cover this. Principles and Procedures of Statistics, A Biometrical Approach by Steel, Torrie, and Dickey is one example that I started with. Cochran, W (1977). Sampling. Techniques, 3rd ed. is another often cited source.

See also: Andrew Gelman on Econtalk discussing "what does not kill my statistical significance makes it stronger"

Sunday, July 29, 2018

Performance of Machine Learning Models on Time Series Data

In the past few years there has been an increased interest among economists in machine learning. For more discussion see herehere, here, here, here, here, here,  and here.  See also Mindy Mallory's recent post here.

While some folks like Susan Athey are beginning to develop the theory to understand how machine learning can contribute to causal inference, it has carved out a niche in the area of prediction. But what about times series analysis and forecasting?

That is a question taken up by authors this past March in an interesting paper (Statistical and Machine Learning forecasting methods: Concerns and ways forward). They took a good look at the performance of popular machine learning algorithms relative to traditional statistical time series approaches. The authors found that traditional approaches including exponential smoothing and econometric time series approaches out performed algorithmic approaches from machine learning across a number of model specifications, algorithms, and time series data sources.

Below are some interesting excerpts and takeaways from the paper:

When I think of time series methods, I think of things like cointegration, stationarity, autocorrelation, seasonality, auto-regressive conditional heteroskedasticity etc. (I recommend Mindy Mallory's posts on time series here)

Hearing so much about the ability of some machine learning approaches (like deep learning) to mimick feature engineering, I wondered how well algorithmic approaches would handle these issues in time series applications. The authors looked at some of the previous literature in relation to this:

"In contrast to sophisticated time series forecasting methods, where achieving stationarity in both the mean and variance is considered essential, the literature of ML is divided with some studies claiming that ML methods are capable of effectively modelling any type of data pattern and can therefore be applied to the original data [62]. Other studies however, have concluded the opposite, claiming that without appropriate preprocessing, ML methods may become unstable and yield suboptimal results [28]."

One thing about this paper, as I read it, is that it does not take an adversarial or luddite tone toward machine learning methods in favor of more traditional approaches. While they found challenges related to predictive accuracy, they seemed to proactively look deeper to understand why ML algorithms performed the way they did and how to make ML approaches better at time series.

One of the challenges with ML, even with crossvalidation was overfitting and confusion of signals, patterns, and noise in the data:

"An additional concern could be the extent of randomness in the series and the ability of ML models to distinguish the patterns from the noise of the data, avoiding over-fitting....A possible reason for the improved accuracy of the ARIMA models is that their parameterization is done through the minimization of the AIC criterion, which avoids over-fitting by considering both goodness of fit and model complexity."

They also recommend instances where ML methods may offer advantages:

"even though M3 might be representative of the reality when it comes to business applications, the findings may be different if nonlinear components are present, or if the data is being dominated by other factors. In such cases, the highly flexible ML methods could offer significant advantage over statistical ones"

It was interesting that basic exponential smoothing approaches outperformed much more complicated ML methods:

"the only thing exponential smoothing methods do is smoothen the most recent errors exponentially and then extrapolate the latest pattern in order to forecast. Given their ability to learn, ML methods should do better than simple benchmarks, like exponential smoothing."

However the authors note it is often the case that smoothing methods can offer advantages over more complex econometric time series as well (i.e. ARIMA, VAR, GARCH etc.)

Toward the end of the paper the authors go on to discuss in detail the differences in the domains where we have seen a lot of success in machine learning (speech and image recognition, games, self driving cars etc. ) vs. time series and forecasting applications.

In table 10 of the paper, they drill into some of these specific differences and discuss structural instabilities related to time series data, how the 'rules' change and how forecasts themselves can influence future values, and how this kind of noise might be hard for ML algorithms to capture.

This paper is definitely worth going through again and one to keep in mind if you are about to embark on an applied forecasting project.


Makridakis S, Spiliotis E, Assimakopoulos V (2018) Statistical and Machine Learning forecasting methods: Concerns and ways forward. PLoS ONE 13(3): e0194889.

See also Paul Cuckoo's LinkedIn post on this paper: 

Sunday, July 15, 2018

The Credibility Revolution(s) in Econometrics and Epidemiology

I've written before about the credibility revolution in economics. It also seems that in parallel with econometrics, epidemiology has its own revolution to speak of. In The Deconstruction of Paradoxes in Epidemiology, Miquel Porta writes:

"If a “revolution” in our field or area of knowledge was ongoing, would we feel it and recognize it? And if so, how?...The “revolution” is partly founded on complex mathematics, and concepts as “counterfactuals,” as well as on attractive “causal diagrams” like Directed Acyclic Graphs (DAGs). Causal diagrams are a simple way to encode our subject-matter knowledge, and our assumptions, about the qualitative causal structure of a problem. Causal diagrams also encode information about potential associations between the variables in the causal network. DAGs must be drawn following rules much more strict than the informal, heuristic graphs that we all use intuitively. Amazingly, but not surprisingly, the new approaches provide insights that are beyond most methods in current use......The possible existence of a “revolution” might also be assessed in recent and new terms as collider, M-bias, causal diagram, backdoor (biasing path), instrumental variable, negative controls, inverse probability weighting, identifiability, transportability, positivity, ignorability, collapsibility, exchangeable, g-estimation, marginal structural models, risk set, immortal time bias, Mendelian randomization, nonmonotonic, counterfactual outcome, potential outcome, sample space, or false discovery rate."

There is a lot said there. Most economists find themselves at home in relation to discussions involving most of this including anything related to potential outcomes and counterfactuals and the methods like those mentioned in the last paragraph. However, what might seem to make the revolution in epidemiology different from econometrics (at least for some applied economists) is the emphasis on directed acyclic graphs (DAGs).

Over at the Causal Analysis in Theory and Practice blog in a post titled "are economists smarter than epidemiologists (comments on imbens' recent paper)" they discuss comments by Guido Imbens from a statistical science paper (worth a read)

"In observational studies in social science, both these assumptions tend to be controversial. In this relatively simple setting, I do not see the causal graphs as adding much to either the understanding of the problem, or to the analyses."

The blog post is quite critical of this stance:

"Can economists do in their heads what epidemiologists observe in their graphs? Can they, for instance, identify the testable implications of their own assumptions? Can they decide whether the IV assumptions (i.e., exogeneity and exclusion) are satisfied in their own models of reality? Of course the can’t; such decisions are intractable to the graph-less mind....Or, are problems in economics different from those in epidemiology? I have examined the structure of typical problems in the two fields, the number of variables involved, the types of data available, and the nature of the research questions. The problems are strikingly similar."

Being trained in both biostatistics and econometrics, I encountered the credibility revolution and causal analysis mostly through seminars and talks on applied econometrics.  As economist Jayson Lusk puts it:

"if you attend a research seminar in virtually any economics department these days, you're almost certain to hear questions like, "what is your identification strategy?" or "how did you deal with endogeneity or selection?"  In short, the question is: how do we know the effects you're reporting are causal effects and not just correlations."

The first applications I encountered utilizing DAGs were either from economist Marc Bellemare with regard to one of his papers related to lagged explanatory variables, or it was a from a Statistics in Medicine paper authored by Davey Smith et al featuring Mendelian randomization.

See also:

How is it that SEMs subsume potential outcomes? 
Mediators and moderators

Thursday, May 24, 2018

Statistical Inference vs. Causal Inference vs. Machine Learning: A motivating example

In his well known paper, Leo Breiman discusses the 'cultural' differences between algorithmic (machine learning) approaches and traditional methods related to inferential statistics. Recently, I discussed how important understanding these kinds of distinctions are when it comes to understanding how current automated machine learning tools can be leveraged in the data science space.

In his paper Leo Breiman states:

"Approaching problems by looking for a data model imposes an apriori straight jacket that restricts the ability of statisticians to deal with a wide range of statistical problems."

On the other hand, Susan Athey's work highlights the fact that no one has developed the asymptotic theory necessary to adequately address causal questions using methods from machine learning (i.e. how does a given machine learning algorithm fit into the context of the Rubin Causal Model/potential outcomes framework?)

Dr. Athey is working to bridge some of this gap, but it's very complicated. I think there is a lot that can also be done, just understanding and communicating about the differences between inferential and causal questions vs. machine learning/predictive modeling questions. When should each be used for a given business problem? What methods does this entail?

In an MIT Data Made to Matter podcast, economist Joseph Doyle discusses his paper investigating the relationship between more aggressive (and expensive) treatments by hospitals and improved outcomes for medicare patients. Using this as an example, I hope to broadly illustrate some of these differences looking at this problem through all three lenses.

Statistical Inference

Suppose we just want to know if there is a significant relationship between aggressive treatments 'A' and health outcomes (mortality) 'M.' We might estimate a regression equation (similar to one of the models in the paper) such as:

M = b0 + b1*A + b2*X + e where X is a vector of relevant controls.

We would be very careful about the nature of our data, correct functional form, and getting our standard errors correct to make valid inferences about our estimate 'b1' of the relationship between aggressive treatments A and mortality M. A lot of this is traditionally taught in econometrics, biostatistics, and epidemiology (things like heteroskedasticity, multicollinearity, distributional assumptions related to the error terms etc.)

Causal Inference

Suppose we wanted to know if the estimate b1 in the equation above is causal. In Doyle's paper they discuss some of the challenges:

"A major issue that arises when comparing hospitals is that they may treat different types of patients. For example, greater treatment levels may be chosen for populations in worse health. At the individual level, higher spending is strongly associated with higher mortality rates, even after risk adjustment, which is consistent with more care provided to patients in (unobservably) worse health. At the hospital level, long-term investments in capital and labor may reflect the underlying health of the population as well. Differences in unobservable characteristics may therefore bias results toward finding no effect of greater spending."

One of the points he is making is that even if we control for everything we typically measure in these studies (captured by X above) there are unobservable characteristics related to patients that weaken our estimate of b1. Recall that methods like regression and matching (which are two flavors of identification strategies based on selection on observables) achieve identification by assuming that conditional on observed characteristics (X), selection bias disappears.  We want to make conditional on X comparisons of Y (or M in the model above) that mimic as much as possible the experimental benchmark of random assignment (see more on matching estimators here.)

However, if there are important characteristics related to selection that we don't observe and can't include in X, then in order to make valid causal statements about our results, we need a method that identifies treatment effects within a selection on 'un'-observables framework. (examples include difference-in-differences, fixed effects, and instrumental variables).

In Doyle's paper, they used ambulance service as an instrument for hospital choice to make causal statements about A.

Machine Learning/Predictive Modeling

Suppose we just want to predict mortality by hospital to support some policy or operational objective where the primary need is accurate predictions. A number of algorithmic methods might be exploited including logistic regression, decision trees, random forests, neural networks etc. Based on the mixed findings in the literature, a machine learning algorithm may not exploit 'A' at all even though Doyle finds a significant causal effect based on his instrumental variables estimator. The point is, in many cases a black box algorithm that includes or excludes treatment intensity as a predictor doesn't really care about the significance of this relationship or its causal mechanism, as long as at the end of the day the algorithm predicts well out of sample and maintains reliability and usefulness in application over time.


If we wanted to know if the relationship between intensity of care 'A' was statistically significant or causal, we would not rely on machine learning methods. At least nothing available on the shelf today pending further work by researchers like Susan Athey. We would develop the appropriate causal or inferential model designed to answer the particular question at hand. In fact, as Susan Athey points out in a past Quora commentary, models used for causal inference could possibly give worse predictions:

"Techniques like instrumental variables seek to use only some of the information that is in the data – the “clean” or “exogenous” or “experiment-like” variation in price—sacrificing predictive accuracy in the current environment to learn about a more fundamental relationship that will help make decisions...This type of model has not received almost any attention in ML."

The point is, for the data scientist caught in the middle of so much disruption related to tools like automated machine learning, as well as technologies producing and leveraging large amounts of data, it is important to focus on business understanding and map the appropriate method to address what is trying to be achieved. The ability to understand the differences in tools and methodologies related to statistical inference, causal inference, and machine learning and explaining those differences to stakeholders will be important to prevent 'straight jacket' thinking about solutions to complex problems.


Doyle, Joseph et al. “Measuring Returns to Hospital Care: Evidence from Ambulance Referral Patterns.” The journal of political economy 123.1 (2015): 170–214. PMC. Web. 11 July 2017.

Matt Bogard. "A Guide to Quasi-Experimental Designs" (2013)
Available at:

Tuesday, April 17, 2018

He who must not be named....or can we say 'causal'?

Recall in the Harry Potter series, the wizard community refused to say the name of 'Voldemort' and it got to the point where they almost stopped teaching and practicing magic (at least officially as mandated by the Ministry of Magic). In the research community, by refusing to use the term 'causal' when and where appropriate, are we discouraging researchers from asking interesting questions and putting forth the effort required to implement the kind of rigorous causal inferential methods necessary to push forward the frontiers of science? Could we somehow be putting a damper on teaching and practicing economagic...I mean know the mostly harmless kind? Will the credibility revolution be lost?

In a recent May 2018 article in the American Journal of Public Health (by Miguel Hernan of the Departments of Epidemiology and Biostatistics, Harvard School of Public Health) there is an important discussion about the somewhat tiring mantra 'correlation is not causation' and disservice to scientific advancement that it can lead to in absence of critical thinking about research objectives and designs. Some people might think this is ironic, since often the phrase is invoked as a means to point out fallacious conclusions that have been uncritically based on mere correlations found in the data. However, the pendulum can swing too far in the other direction causing as much harm.

I highly recommend reading this article! It is available ungated and will be one of those you hold onto for a while. See the reference section below.

Key to the discussion are important distinctions between questions of association, prediction, and causality. Below are some spoilers:

While it is wrong to assume causality based on association or correlation alone, refusing to recognize a causal approach in the analysis because of growing cultural 'norms' is also not good either....and should stop:

"The resulting ambiguity impedes a frank discussion about methodology because the methods used to estimate causal effects are not the same as those used to estimate associations...We need to stop treating “causal” as a dirty word that respectable investigators do not say in public or put in print. It is true that observational studies cannot definitely prove causation, but this statement misses the point"

All the glitters isn't gold, as the author notes on randomized controlled trials :

"Interestingly, the same is true of randomized trials. All we can estimate from randomized trials data are associations; we just feel more confident giving a causal interpretation to the association between treatment assignment and outcome because of the expected lack of confounding that physical randomization entails. However, the association measures from randomized trials cannot be given a free pass. Although randomization eliminates systematic confounding, even a perfect randomized trial only provides probabilistic bounds on “random confounding”—as reflected in the confidence interval of the association measure—and many randomized trials are far from perfect."

There are important distinctions between analysis and methodological approaches when asking questions related to prediction and association vs causality. Saying a bit more, this is not just about model interpretation. We are familiar with discussions about challenges related to interpreting predictive models derived from complicated black box algorithms, but causality hinges on much more than just the ability to interpret the impact of features on an outcome. Also note that while we are seeing applications of AI and automated feature engineering and algorithm selection, models optimized to predict well may not explain well at all. In fact, a causal model may perform worse in out of sample predictions of the 'target' while giving the most rigorous estimate of causal effects:

"In associational or predictive models, we do not try to endow the parameter estimates with a causal interpretation because we are not trying to adjust for confounding of the effect of every variable in the model. Confounding is a causal concept that does not apply to associations...By contrast, in a causal analysis, we need to think carefully about what variables can be confounders so that the parameter estimates for treatment or exposure can be causally interpreted. Automatic variable selection procedures may work for prediction, but not necessarily for causal inference. Selection algorithms that do not incorporate sufficient subject matter knowledge may select variables that introduce bias in the effect estimate, and ignoring the causal structure of the problem may lead to apparent paradoxes."

It all comes down to being a question of identification....or why AI has a long way to go in the causal space...or as Angrist and Pischke would put it....if applied econometrics were easy theorists would do it:

"Associational inference (prediction)or causal inference (counterfactual prediction)? The answer to this question has deep implications for (1) how we design the observational analysis to emulate a particular target trial and (2) how we choose confounding adjustment variables. Each causal question corresponds to a different target trial, may require adjustment for a different set of confounders, and is amenable to different types of sensitivity analyses. It then makes sense to publish separate articles for various causal questions based on the same data."

I really liked how they phrased 'prediction' in terms of distinctly being associational or prospective vs. counterfactual. Also, what a nice way to think about 'identification' being about how we emulate a particular trial and handle confounding/selection bias/endogneity.


Miguel A. Hernán, “The C-Word: Scientific Euphemisms Do Not Improve Causal Inference From Observational Data”, American Journal of Public Health 108, no. 5 (May 1, 2018): pp. 616-619.

See also:

Will there be a credibility revolution in data science and AI?

To Explain or Predict?

Sunday, March 18, 2018

Will there be a credibility revolution in data science and AI?

Summary: Understanding where AI and automation are going to be the most disruptive to data scientists in the near term relates to understanding methodological differences between explaining and predicting, between machine learning and causal inference. It will require the ability to ask a different kind of question than machine learning algorithms are capable of answering off of the shelf today.

There is a lot of enthusiasim about the disruptive role of automation and AI in data science. Products like H20ai and DataRobot offer tools to automate or fast track many aspects of the data science work stream. If this trajectory continues, what will the work of the future data scientist look like?

Many have already pointed out the very difficult task of automating the soft skills possessed by data scientists. In a previous LinkedIn post I discussed this in the trading space where automation and AI could create substantial disruptions for both data scientists and traders. Here I quoted Matthew Hoyle:

"Strategies have a short shelf life-what is valuable is the ability and energy to look at new and interesting things and put it all together with a sense of business development and desire to explore"

My conclusion: They are talking about bringing a portfolio of useful and practical skills together to do a better job than was possible before open source platforms and computing power became so proliferate. I think that is the future.

So the future is about rebalancing the data scientists portfolio of skills. However, in the near term I think the disruption from AI and automation in data science will do more than increase the emphasis on soft skills. In fact there will remain a significant portion of 'hard skills' that will see an increase in demand because of the difficulty of automation.

Understanding this will depend largely on making a distinction between explaining and predicting. Much of what appears to be at the forefront of automation involves tasks supporting  supervised and unsupervised machine learning algorithms as well as other prediction and forecasting tools like time series analysis.

Once armed with predictions, businesses will start to ask questions about 'why'. This will transcend prediction or any of the visualizations of the patterns and relationships coming out of black box algorithms. They will want to know what decisions or factors are moving the needle on revenue or customer satisfaction and engagement or improved efficiencies. Essentially they will want to ask questions related to causality, which requires a completely different paradigm for data analysis than questions of prediction. And they will want scientifically formulated answers that are convincing vs. mere reports about rates of change or correlations. There is a significant difference between understanding what drivers correlate with or 'predict' the outcome of interest and what is actually driving the outcome. What they will be asking for is a credibility revolution in data science.

What do we mean by a credibility revolution?

Economist Jayson Lusk puts it well:

"Fortunately economics (at least applied microeconomics) has undergone a bit of credibility revolution.  If you attend a research seminar in virtually any economi(cs) department these days, you're almost certain to hear questions like, "what is your identification strategy?" or "how did you deal with endogeneity or selection?"  In short, the question is: how do we know the effects you're reporting are causal effects and not just correlations."

Healthcare Economist Austin Frakt has a similar take:

"A “research design” is a characterization of the logic that connects the data to the causal inferences the researcher asserts they support. It is essentially an argument as to why someone ought to believe the results. It addresses all reasonable concerns pertaining to such issues as selection bias, reverse causation, and omitted variables bias. In the case of a randomized controlled trial with no significant contamination of or attrition from treatment or control group there is little room for doubt about the causal effects of treatment so there’s hardly any argument necessary. But in the case of a natural experiment or an observational study causal inferences must be supported with substantial justification of how they are identified. Essentially one must explain how a random experiment effectively exists where no one explicitly created one."

How are these questions and differences unlike your typical machine learning application? Susan Athey does a great job explaining in a Quora response about how causal inference is different from off the shelf machine learning methods (the kind being automated today):

"Sendhil Mullainathan (Harvard) and Jon Kleinberg with a number of coauthors have argued that there is a set of problems where off-the-shelf ML methods for prediction are the key part of important policy and decision problems.  They use examples like deciding whether to do a hip replacement operation for an elderly patient; if you can predict based on their individual characteristics that they will die within a year, then you should not do the operation...Despite these fascinating examples, in general ML prediction models are built on a premise that is fundamentally at odds with a lot of social science work on causal inference. The foundation of supervised ML methods is that model selection (cross-validation) is carried out to optimize goodness of fit on a test sample. A model is good if and only if it predicts well. Yet, a cornerstone of introductory econometrics is that prediction is not causal inference.....Techniques like instrumental variables seek to use only some of the information that is in the data – the “clean” or “exogenous” or “experiment-like” variation in price—sacrificing predictive accuracy in the current environment to learn about a more fundamental relationship that will help make decisions...This type of model has not received almost any attention in ML."

Developing an identification strategy, as Jayson Lusk discussed above, and all that goes along with that (finding natural experiments or valid instruments, or navigating the garden of forking paths related to propensity score matching or a number of other quasi-experimental methods) involves careful considerations and decisions to be made and defended in ways that would be very challenging to automate. Even when human's do this there is rarely a single best approach to these problems. They are far from routine. Just ask anyone that has been through peer review or given a talk at an economics seminar or conference.

The kinds of skills required to work in this space would be similar to those of the econometrician or epidemiologist or any quantitative researcher that has been culturally immersed in the social norms and practices that have evolved out of the credibility revolution.. as data science thought leader Eugene Dubossarsky puts it:

“the most elite skills…the things that I find in the most elite data scientists are the sorts of things econometricians these days have…bayesian statistics…inferring causality” 

Noone has a crystal ball.  It is not to say that the current advances in automation are falling short on creating value. They should no doubt create value like any other form of capital complementing the labor and soft skills of the data scientist. And they could free up more resources to focus on more causal questions that previously may not have been answered. I discussed this complementarity previously in a related post:

 "correlations or 'flags' from big data might not 'identify' causal effects, but they are useful for prediction and might point us in directions where we can more rigorously investigate causal relationships if interested" 

However, if automation in this space is possible, it will require a different approach than what we have seen so far. We might look to the pioneering work that Susan Athey is doing converging machine learning and causal inference. It will require thinking in terms of potential outcomes, endogeniety, and counterfactuals which requires the ability to ask a different kind of question than machine learning algorithms are capable of answering off of the shelf today.

Additional References:

From 'What If?' To 'What Next?' : Causal Inference and Machine Learning for Intelligent Decision Making

Susan Athey on Machine Learning, Big Data, and Causation 

Machine Learning and Econometrics (Susan Athey, Guido Imbens) 

Related Posts:

Why Data Science Needs Economics

To Explain or Predict

Culture War: Classical Statistics vs. Machine Learning: 

HARK! - flawed studies in nutrition call for credibility revolution -or- HARKing in nutrition research

Econometrics, Math, and Machine Learning

Big Data: Don't Throw the Baby Out with the Bathwater

Big Data: Causality and Local Expertise Are Key in Agronomic Applications

The Use of Knowledge in a Big Data Society II: Thick Data 

The Use of Knowledge in a Big Data Society 

Big Data, Deep Learning, and SQL

Economists as Data Scientists