Monday, September 30, 2019

Wicked Problems and The Role of Expertise and AI in Data Science

In 2018, an article in Science characterized the challenge of pesticide resistance as a wicked problem:

“If we are to address this recalcitrant issue of pesticide resistance, we must treat it as a “wicked problem,” in the sense that there are social, economic, and biological uncertainties and complexities interacting in ways that decrease incentives for actions aimed at mitigation.”

In graduate school, I worked on this same problem, attempting to model the social and economic systems with game theory and behavioral economics and capturing biological complexities leveraging population genetics. 

Wicked vs. Kind Environments

In data science, we also have 'wicked' learning environments in which we try to train our models. In the EconTalk podcast with Russ Roberts, Mastery, Specialization, and Range, David Epstein discusses wicked and kind learning environments:

"The way that chess works makes it what's called a kind learning environment. So, these are terms used by psychologist Robin Hogarth. And what a kind learning environment is, is one where patterns recur; ideally a situation is constrained--so, a chessboard with very rigid rules and a literal board is very constrained; and, importantly, every time you do something you get feedback that is totally see the consequences. The consequences are completely immediate and accurate. And you adjust accordingly. And in these kinds of kind learning environments, if you are cognitively engaged you get better just by doing the activity."

"On the opposite end of the spectrum are wicked learning environments. And this is a spectrum, from kind to wicked. Wicked learning environments: often some information is hidden. Even when it isn't, feedback may be delayed. It may be infrequent. It may be nonexistent. And it maybe be partly accurate, or inaccurate in many of the cases. So, the most wicked learning environments will reinforce the wrong types of behavior."

As discussed in the podcast, many problems fall within some spectrum ranging between very kind environments like Chess to more complex environments like self driving cars or medical diagnosis. What do experts have to offer where AI/ML falls short? The type of environment determines to a great extent the scope of disruption we might be able to expect from AI applications.

The Role of Human Expertise

In Thinking Fast and Slow, Kahneman discusses two conditions for acquiring skill:

1) an environment that is sufficiently regular to be predictable
2) an opportunity to learn these regularities through prolonged practice

This sounds a lot like the 'kind' environments discussed above. Based on research by Robin Hogarth, Kahneman also makes these distinctions describing 'wicked' environments as those environments in which those with expertise are likely to learn the wrong lessons from experience. The problem is that with wicked environments, experts often default to heuristics which can lead to wrong conclusions. Even if aware of these biases, social norms often nudge experts into the wrong direction. Kahneman gives an example involving physicians:

"Generally it is considered a weakness and a sign of vulnerability for clinicians to appear unsure. Confidence is valued over uncertainty and there is a prevailing censure against disclosing uncertainty to patients...acting on pretended knowledge is often the preferred solution."

This likely explains many of the mistakes and low value care that are problematic with healthcare delivery as well as dissatisfaction with both the quality and costs of healthcare. How many of us want our physicians to pretend to know what they are talking about? On the other hand, how many people are willing to accept an answer from their physician that rhymes with "let me look this up and get back to you later." 

One advantage AI may have over experts in kind environments is as Kahneman puts it, the opportunity to learn through prolonged practice. Machine learning can handle many more training examples than a human so to speak.

Even in kind environments, an expert may swing and miss when dealing with cases where the correct decision is like a pitch straight over the plate. One reason Kahneman discusses in Thinking Fast and Slow is the idea of 'ego' depletion. This is related to the idea that mental energy can become exhausted after significant exertion. As self-control breaks down, its easy to default to heuristics and biases that can lead to decisions that look like careless mistakes. This would certainly apply to physicians given the number of stories we hear about burnout in the profession. 

The solution seems to be what polymath economist Tyler Cowen suggested several years ago in the econtalk podcast discussion he had about his book Average is Over with Russ Roberts:

"I would stress much more that humans can always complement robots. I'm not saying every human will be good at this. That's a big part of the problem. But a large number of humans will work very effectively with robots and become far more productive, and this will be one of the driving forces behind that inequality."

Imagine the clinical situation where a physician's 'ego' is substantially depleted from a difficult case. They could then lean on AI to prevent mistakes treating more routine decisions that follow. Or perhaps leveraging AI tools, a clinician could conserve additional mental energy throughout the day so that they are less likely to default to heuristics when they encounter more complex issues. The way this synergy materializes is uncertain, but it will certainly continue to involve substantial expertise on the part of many professionals going forward. Together human expertise and AI might have the greatest chance tackling the most wicked problems.


Wicked evolution: Can we address the sociobiological dilemma of pesticide resistance? | Science

Thinking Fast and Slow. Daniel Kahneman. 2011

EconTalk:David Epstein on Mastery, Specialization, and Range

EconTalk: Tyler Cowen on Inequality, the Future, and Average is Over

Wednesday, May 15, 2019

Causal Invariance and Machine Learning

In an EconTalk podcast with Cathy O'Neil Russ Roberts discusses her book Weapons of Math Destruction and some of the unintentional negative consequences of certain machine learning applications in society. One of the problems with these algorithms and the features they leverage is that they are based on correlational relationships that may not be causal. As Russ states:

"Because there could be a correlation that's not causal. And I think that's the distinction that machine learning is unable to make--even though "it fit the data really well," it's really good for predicting what happened in the past, it may not be good for predicting what happens in the future because those correlations may not be sustained."

This echoes a theme in a recent blog post by Paul Hunermund:

“All of the cutting-edge machine learning tools—you know, the ones you’ve heard about, like neural nets, random forests, support vector machines, and so on—remain purely correlational, and can therefore not discern whether the rooster’s crow causes the sunrise, or the other way round”

I've made similar analogies before myself and still think this makes a lot of sense.

However, a talk at the International Conference on Learning Representations definitely made me stop and think about the kind of progress that has been made in the last decade and the direction research is headed. The talk was titled:  'Learning Representations Using Causal Invariance' (you can actually see it here:


"Learning algorithms often capture spurious correlations present in the training data distribution instead of addressing the task of interest. Such spurious correlations occur because the data collection process is subject to uncontrolled confounding biases. Suppose however that we have access to multiple datasets exemplifying the same concept but whose distributions exhibit different biases. Can we learn something that is common across all these distributions, while ignoring the spurious ways in which they differ? This can be achieved by projecting the data into a representation space that satisfy a causal invariance criterion. This idea differs in important ways from previous work on statistical robustness or adversarial objectives. Similar to recent work on invariant feature selection, this is about discovering the actual mechanism underlying the data instead of modeling its superficial statistics."

This is pretty advanced machine learning and I am not an expert in this area by any means. The way I want to interpret this is that this represents ways of learning from multiple environments that prevent overfitting in any single environment such that predictions are robust to any spurious correlation you might find in any given environment. It has a flavor of causality because the presenter argues that invariance is a common thread underpinning both the works of Rubin and Pearl. It potentially offers powerful predictions/extrapolations while avoiding some of the pitfalls/biases of non-causal machine learning methods.

Going back to Paul Hunermund's post I might draw a dangerous parallel (because I'm still trying to fully grasp the talk) but here goes. If we used invariant learning to predict when or if the sun will rise, the algorithm would leverage those environments where the sun rises even if the rooster does not crow, as well as instances where the rooster crows, but the sun fails to rise. As a result, the biases that are merely correlational (like the sun rising when the rooster crows) will drop out and only the more causal variables will enter the model – which will be invariant to the environment. If this analogy is on track this is a very exciting advancement!

Putting this into the context of predictive modeling/machine learning and causal inference however, these methods create value by giving better answers (less biased/robustness to confounding) to questions or solving problems that sit on the first rung of Judea Pearl’s ladder of causation (see the intro of The Book of Why). Invariant regression is still machine learning and as such does not appear to offer any means to make statistical inferences. However at the same time Susan Athey is doing really cool stuff in this area .

While invariant regression seems to share the invariance properties associated with causal mechanisms emphasized in Rosenbaum and Rubin’s potential outcomes framework and Pearl’s DAGs and ‘do’ operator, it still doesn’t appear to allow us to reach the 3rd rung in Pearl’s ladder of causation which allows us to answer counterfactual questions. And it sounds dangerously close to the idea he criticises in his book that "the data themselves will guide us to the right answers whenever causal questions come up" and allow us to skip the "hard step of constructing or acquiring a causal model."

I’m not sure that is the intention of the method or the talk. Still, its an exciting advancement to be able to build a model with feature selection mechanisms that have more of a causal vs. merely correlational flavor

Tuesday, April 23, 2019

Synthetic Controls - An Example with Toy Data

Abadie Diamond and Hainmueller introduce the method of synthetic controls as an alternative to difference-in-differences for evaluating the effectiveness of a tobacco control program in California (2010).

A very good summarization for how this method works is given by Bret Zeldow and Laura Hatfield at the website:

"The idea behind synthetic control is that a weighted combination of control units can form a closer match to the treated group than than any one (or several) control unit (Abadie, Diamond, and Hainmueller (2010)). The weights are chosen to minimize the distance between treated and control on a set of matching variables, which can include covariates and pre-treatment outcomes. The post-period outcomes for the synthetic control are calculated by taking a weighted average of the control groups’ outcomes. Many authors have extended synthetic control work recently (Kreif et al. 2016; Xu 2017; Ferman, Pinto, and Possebom 2017; Kaul et al. 2015)"

Bouttell and Lewsey (2018) provide a nice survey and introduction to the method related to public health interventions.

For a very nice tour of the math and example R code see this post at The Samuelson Condition blog.

A Toy Example:

See below for some completely made up data for this oversimplified example. But let's assume that we have some intervention in Kentucky in 1995 that impacts some outcome 'Y' in years 1996,1997, and 1998, maybe we are trying to improve the percentages of restaurants with smoke free policies.

Perhaps we want to consider comparing KY to a synthetic control based on the pool of states including TN, IN, CA and values of covariates and predictors measured prior to the intervention (X1,X2,X3) as well as pre-period values of Y.

Using the package Synth in R and the data below the weights used for constructing synthetic controls using states TN, IN and CA with KY as the treatment group are:


We could think of the synthetic control heuristically being approximately 2.1% of TN,  4.4% CA, and 93.6% IN.  If you look at the data, you can see that these wieghts make intuitive sense. I created the toy data so that IN looked a lot more like KY than the other states.

As an additional smell test, if I constructed a synthetic control using only CA and TN, changing one line of R code to reflect only these two states:

controls.identifier = c(2,3), # these states are part of our control pool which will be weighted

I get the following different set of weights:


This makes sense because I made up data for CA that really is quite a bit different from KY. It should contribute very little as a control unit used to calculate a synthetic KY. (in fact maybe it should not be used at all)

The package allows us to plot the trend in outcome Y for the pre and post period. But we could roughly calculate the synthetic (counterfactual) values for KY by hand in excel and get the same plot using this small toy data set.

For instance, the value for KY in 1998 is .51 but the counter factual value created by the synthetic control is the weighted combination of outcomes for TN, CA, & IN or .021*.41 + .044*.95+ .936*.46 = .48097.

Using this small data set with only 3 states being part of the donor pool these results are not perfectly ideal, but we can see roughly that the synthetic control tracks KY's trend in the pre-period and we get a very noticeable divergence in the post period.

The difference between .51 and .48097 or 'gap' between KY and its synthetic control represents the counterfactual impact of the program in KY.  Placebo tests can be ran and visualized using each state from the donor pool as a 'placebo treatment' and constructing synthetic controls using the remaining states. This can be used to produce a distribution of gaps that characterize the uncertainty in our estimate of the treatment effects based on the KY vs KY* synthetic control comparison.

The code excerpt below is an example of how we would designate CA to be our placebo treatment and use the remaining states to create its synthetic control. This could be iterated across all of the remaining controls.

treatment.identifier = 3, # indicates our 'placebo' treatment group
controls.identifier = c(1,2,4), # these states are part of our control pool which will be weighted

R Code:


Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. 2010. “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.” Journal of the American Statistical Association 105: 493–505. doi:10.1198/jasa.2009.ap08746.

Alberto Abadie, Alexis Diamond, Jens Hainmueller
Synth: An R Package for Synthetic Control Methods in Comparative Case Studies
Journal of Statistical Software. 2011

Bouttell J, Craig P, Lewsey J, et al Synthetic control methodology as a tool for evaluating population-level health interventions J Epidemiol Community Health 2018;72:673-678.

More public policy analysis: synthetic control in under an hour