Tuesday, April 17, 2018

He who must not be named....or can we say 'causal'?

Recall in the Harry Potter series, the wizard community refused to say the name of 'Voldemort' and it got to the point where they almost stopped teaching and practicing magic (at least officially as mandated by the Ministry of Magic). In the research community, by refusing to use the term 'causal' when and where appropriate, are we discouraging researchers from asking interesting questions and putting forth the effort required to implement the kind of rigorous causal inferential methods necessary to push forward the frontiers of science? Could we somehow be putting a damper on teaching and practicing economagic...I mean econometrics...you know the mostly harmless kind? Will the credibility revolution be lost?

In a recent May 2018 article in the American Journal of Public Health (by Miguel Hernan of the Departments of Epidemiology and Biostatistics, Harvard School of Public Health) there is an important discussion about the somewhat tiring mantra 'correlation is not causation' and disservice to scientific advancement that it can lead to in absence of critical thinking about research objectives and designs. Some people might think this is ironic, since often the phrase is invoked as a means to point out fallacious conclusions that have been uncritically based on mere correlations found in the data. However, the pendulum can swing too far in the other direction causing as much harm.

I highly recommend reading this article! It is available ungated and will be one of those you hold onto for a while. See the reference section below.

Key to the discussion are important distinctions between questions of association, prediction, and causality. Below are some spoilers:

While it is wrong to assume causality based on association or correlation alone, refusing to recognize a causal approach in the analysis because of growing cultural 'norms' is also not good either....and should stop:

"The resulting ambiguity impedes a frank discussion about methodology because the methods used to estimate causal effects are not the same as those used to estimate associations...We need to stop treating “causal” as a dirty word that respectable investigators do not say in public or put in print. It is true that observational studies cannot definitely prove causation, but this statement misses the point"

All the glitters isn't gold, as the author notes on randomized controlled trials :

"Interestingly, the same is true of randomized trials. All we can estimate from randomized trials data are associations; we just feel more confident giving a causal interpretation to the association between treatment assignment and outcome because of the expected lack of confounding that physical randomization entails. However, the association measures from randomized trials cannot be given a free pass. Although randomization eliminates systematic confounding, even a perfect randomized trial only provides probabilistic bounds on “random confounding”—as reflected in the confidence interval of the association measure—and many randomized trials are far from perfect."

There are important distinctions between analysis and methodological approaches when asking questions related to prediction and association vs causality. Saying a bit more, this is not just about model interpretation. We are familiar with discussions about challenges related to interpreting predictive models derived from complicated black box algorithms, but causality hinges on much more than just the ability to interpret the impact of features on an outcome. Also note that while we are seeing applications of AI and automated feature engineering and algorithm selection, models optimized to predict well may not explain well at all. In fact, a causal model may perform worse in out of sample predictions of the 'target' while giving the most rigorous estimate of causal effects:

"In associational or predictive models, we do not try to endow the parameter estimates with a causal interpretation because we are not trying to adjust for confounding of the effect of every variable in the model. Confounding is a causal concept that does not apply to associations...By contrast, in a causal analysis, we need to think carefully about what variables can be confounders so that the parameter estimates for treatment or exposure can be causally interpreted. Automatic variable selection procedures may work for prediction, but not necessarily for causal inference. Selection algorithms that do not incorporate sufficient subject matter knowledge may select variables that introduce bias in the effect estimate, and ignoring the causal structure of the problem may lead to apparent paradoxes."

It all comes down to being a question of identification....or why AI has a long way to go in the causal space...or as Angrist and Pischke would put it....if applied econometrics were easy theorists would do it:

"Associational inference (prediction)or causal inference (counterfactual prediction)? The answer to this question has deep implications for (1) how we design the observational analysis to emulate a particular target trial and (2) how we choose confounding adjustment variables. Each causal question corresponds to a different target trial, may require adjustment for a different set of confounders, and is amenable to different types of sensitivity analyses. It then makes sense to publish separate articles for various causal questions based on the same data."

I really liked how they phrased 'prediction' in terms of distinctly being associational or prospective vs. counterfactual. Also, what a nice way to think about 'identification' being about how we emulate a particular trial and handle confounding/selection bias/endogneity.


Miguel A. Hernán, “The C-Word: Scientific Euphemisms Do Not Improve Causal Inference From Observational Data”, American Journal of Public Health 108, no. 5 (May 1, 2018): pp. 616-619.

See also:

Will there be a credibility revolution in data science and AI?

To Explain or Predict?