Wednesday, May 15, 2019

Causal Invariance and Machine Learning

In an EconTalk podcast with Cathy O'Neil Russ Roberts discusses her book Weapons of Math Destruction and some of the unintentional negative consequences of certain machine learning applications in society. One of the problems with these algorithms and the features they leverage is that they are based on correlational relationships that may not be causal. As Russ states:

"Because there could be a correlation that's not causal. And I think that's the distinction that machine learning is unable to make--even though "it fit the data really well," it's really good for predicting what happened in the past, it may not be good for predicting what happens in the future because those correlations may not be sustained."

This echoes a theme in a recent blog post by Paul Hunermund:

“All of the cutting-edge machine learning tools—you know, the ones you’ve heard about, like neural nets, random forests, support vector machines, and so on—remain purely correlational, and can therefore not discern whether the rooster’s crow causes the sunrise, or the other way round”

I've made similar analogies before myself and still think this makes a lot of sense.

However, a talk at the International Conference on Learning Representations definitely made me stop and think about the kind of progress that has been made in the last decade and the direction research is headed. The talk was titled:  'Learning Representations Using Causal Invariance' (you can actually see it here: https://www.facebook.com/iclr.cc/videos/534780673594799/):

Abstract:

"Learning algorithms often capture spurious correlations present in the training data distribution instead of addressing the task of interest. Such spurious correlations occur because the data collection process is subject to uncontrolled confounding biases. Suppose however that we have access to multiple datasets exemplifying the same concept but whose distributions exhibit different biases. Can we learn something that is common across all these distributions, while ignoring the spurious ways in which they differ? This can be achieved by projecting the data into a representation space that satisfy a causal invariance criterion. This idea differs in important ways from previous work on statistical robustness or adversarial objectives. Similar to recent work on invariant feature selection, this is about discovering the actual mechanism underlying the data instead of modeling its superficial statistics."

This is pretty advanced machine learning and I am not an expert in this area by any means. The way I want to interpret this is that this represents ways of learning from multiple environments that prevent overfitting in any single environment such that predictions are robust to any spurious correlation you might find in any given environment. It has a flavor of causality because the presenter argues that invariance is a common thread underpinning both the works of Rubin and Pearl. It potentially offers powerful predictions/extrapolations while avoiding some of the pitfalls/biases of non-causal machine learning methods.

Going back to Paul Hunermund's post I might draw a dangerous parallel (because I'm still trying to fully grasp the talk) but here goes. If we used invariant learning to predict when or if the sun will rise, the algorithm would leverage those environments where the sun rises even if the rooster does not crow, as well as instances where the rooster crows, but the sun fails to rise. As a result, the biases that are merely correlational (like the sun rising when the rooster crows) will drop out and only the more causal variables will enter the model – which will be invariant to the environment. If this analogy is on track this is a very exciting advancement!

Putting this into the context of predictive modeling/machine learning and causal inference however, these methods create value by giving better answers (less biased/robustness to confounding) to questions or solving problems that sit on the first rung of Judea Pearl’s ladder of causation (see the intro of The Book of Why). Invariant regression is still machine learning and as such does not appear to offer any means to make statistical inferences. However at the same time Susan Athey is doing really cool stuff in this area .

While invariant regression seems to share the invariance properties associated with causal mechanisms emphasized in Rosenbaum and Rubin’s potential outcomes framework and Pearl’s DAGs and ‘do’ operator, it still doesn’t appear to allow us to reach the 3rd rung in Pearl’s ladder of causation which allows us to answer counterfactual questions. And it sounds dangerously close to the idea he criticises in his book that "the data themselves will guide us to the right answers whenever causal questions come up" and allow us to skip the "hard step of constructing or acquiring a causal model."

I’m not sure that is the intention of the method or the talk. Still, its an exciting advancement to be able to build a model with feature selection mechanisms that have more of a causal vs. merely correlational flavor