Wednesday, February 12, 2020

Randomized Encouragement: When noncompliance may be a feature and not a bug

Many times in a randomized controlled trial (RCT) issues related to non-compliance arise. Subjects assigned to the treatment fail to comply, while in other cases subjects that were supposed to be in the control group actually receive treatment. Other times we may have a new intervention (maybe it is a mobile app or some kind of product, service, or employer or government benefit) that law, contract, or nature implies that it can be accessed by everyone in our population of interest. We know that if we let nature take its course, users, adopters, or engagers are very likely going to be a self selected group that is different from others in a number of important ways. In a situation like this it could be very hard to know if observed outcomes from the new intervention are related to the treatment itself, or explained by other factors related to characteristics of those who choose to engage.

In a 2008 article in the American Journal of Public Health, alternatives to randomized controlled trials are discussed, and for situations like this the authors discuss randomized encouragement:

 "participants may be randomly assigned to an opportunity or an encouragement to receive a specific treatment, but allowed to choose whether to receive the treatment."

In this scenario, less than full compliance is the norm, a feature and not a bug. The idea is to roll out access in conjunction with randomized encouragement. A randomized nudge.

For example, in Developing a Digital Marketplace for Family Planning: Pilot Randomized Encouragement Trial (Green, et. al;  2018) randomized encouragement was used to study the impact of a digital health intervention related to family planning:

“women with an unmet need for family planning in Western Kenya were randomized to receive an encouragement to try an automated investigational digital health intervention that promoted the uptake of family planning”

If you have a user base or population already using a mobile app you could randomize encouragement to utilize new features through the app. In other instances, you could randomize encouragement to use a new product, feature, or treatment through text messaging. Traditional ways this has been done is through mailers or phone calls.

While treatment assignment or encouragement is random, non-compliance or the choice to engage or not engage is not! How exactly do we analyze results from a randomized encouragement trial in a way that allows us to infer causal effects?  While common approaches include intent-to-treat (ITT) or maybe even per-protocol analysis, treatment effects for a randomized encouragement trial can also be estimated based on complier average causal effects or CACE.

CACEs compare outcomes for individuals in the treatment group who complied with treatment (engaged as a result of encouragement) with individuals in the control group who would have complied if given the opportunity to do so.  This is key. If you think this sounds a lot like local average treatment effects in an instrumental variables framework this is exactly what we are talking about.

Angrist and Pishke (2015) discuss how instrumental variables can be used in the context of a randomized controlled trial (RCT) with non-compliance issues:

 "Instrumental variable methods allow us to capture the causal effect of treatment on the treated in spite of the nonrandom compliance decisions made by participants in experiments....Use of randomly assigned intent to treat as an instrumental variable for treatment delivered eliminates this source of selection bias." 

Instrumental varaible analysis gives us an estimation of local average treatment effects (LATE), which are the same as CACE. In simplest terms, LATE is the average treatment effect for the sub-population of compliers in a RCT. Or, the compliers or engagers in a randomized encouragement design.

There are obviously some assumptions involved and more technical details. Please see the references and other links below to read more about the mechanics, assumptions, and details involved as well as some toy examples.


Mastering 'Metrics: The Path from Cause to Effect Joshua D. Angrist and Jörn-Steffen Pischke. 2015.

Connell A. M. (2009). Employing complier average causal effect analytic methods to examine effects of randomized encouragement trials. The American journal of drug and alcohol abuse, 35(4), 253–259. doi:10.1080/00952990903005882

Green EP, Augustine A, Naanyu V, Hess AK, Kiwinda L
Developing a Digital Marketplace for Family Planning: Pilot Randomized Encouragement Trial
J Med Internet Res 2018;20(7):e10756

Stephen G. West, Naihua Duan, Willo Pequegnat, Paul Gaist, Don C. Des Jarlais, David Holtgrave, José Szapocznik, Martin Fishbein, Bruce Rapkin, Michael Clatts, and Patricia Dolan Mullen, 2008:
Alternatives to the Randomized Controlled Trial
American Journal of Public Health 98, 1359_1366,

See also: 

Intent to Treat, Instrumental Variables and LATE Made Simple(er) 

Instrumental Variables and LATE 

Instrumental Variables vs. Intent to Treat 

Instrumental Explanations of Instrumental Variables

A Toy Instrumental Variable Application

Other posts on instrumental variables...

Monday, December 16, 2019

Some Recommended Podcasts and Episodes on AI and Machine Learning

Something I have been interested in for some time now is both is the convergence of big data and genomics and the convergence of causal inference and machine learning. 

I am a big fan of the Talking Biotech Podcast which allows me to keep up with some of the latest issues and research in biotechnology and medicine. A recent episode related to AI and machine learning covered a lot of topics that resonated with me. 

There was excellent discussion on the human element involved in this work, and the importance of data data prep/feature engineering (the 80% of work that has to happen before the ML/AI can do its job) and the challenges of non-standard 'omics' data.  Also the potential biases that researchers and developers can inadvertently introduce in this process. Much more including applications of machine learning and AI in this space and best ways to stay up to speed on fast changing technologies without having to be a heads down programmer. 

I've been in a data science role since 2008 and have transitioned from SAS to R to python. I've been able to keep up within the domain of causal inference to the extent possible, but I keep up with broader trends I am interested in via podcasts like Talking Biotech. Below is a curated list of my favorites related to data science with a few of my favorite episodes highlighted.

1) Casual Inference - This is my new favorite podcast by two biostatisticians covering epidemiology/biostatistics/causal inference - and keeping it casual.

Fairness in Machine Learning with Sherri Rose | Episode 03 -

This episode was the inspiration for my post: When Wicked Problems Meet Biased Data.

#093 Evolutionary Programming - 

#266 - Can we trust scientific discoveries made using machine learning

How social science research can inform the design of AI systems 

#37 Causality and potential outcomes with Irineo Cabreros -  

Andrew Gelman - Social Science, Small Samples, and the Garden of Forking Paths 
James Heckman - Facts, Evidence, and the State of Econometrics

Wednesday, December 11, 2019

When Wicked Problems Meet Biased Data

In "Dissecting racial bias in an algorithm used to manage the health of populations" (Science, Vol 366 25 Oct. 2019) the authors discuss inherent racial bias in widely adopted algorithms in healthcare. In a nutshell these algorithms use predicted cost as a proxy for health status. Unfortunately, in healthcare, costs can proxy for other things as well:

"Black patients generate lesser medical expenses, conditional on health, even when we account for specific comorbidities. As a result, accurate prediction of costs necessarily means being racially biased on health."

So what happened? How can it be mitigated? What can be done going forward?

 In data science, there are some popular frameworks for solving problems. One widely known approach is the CRISP-DM framework. Alternatively, in The Analytics Lifecycle Toolkit a similar process is proposed:

(1) - Problem Framing
(2) - Data Sense Making
(3) - Analytics Product Development
(4) - Results Activation

The wrong turn in Albuquerque here may have been at the corner of problem framing and data understanding or data sense making.

The authors state:

"Identifying patients who will derive the greatest benefit from these programs is a challenging causal inference problem that requires estimation of individual treatment effects. To solve this problem health systems make a key assumption: Those with the greatest care needs will benefit the most from the program. Under this assumption, the targeting problem becomes a pure prediction public policy problem."

The distinctions between 'predicting' and 'explaining' have been made in the literature by multiple authors in the last two decades. The problem with this substitution has important implications. To quote Galit Shmueli:

"My thesis is that statistical modeling, from the early stages of study design and data collection to data usage and reporting, takes a different path and leads to different results, depending on whether the goal is predictive or explanatory."

Almost a decade before, Leo Brieman struggled to get statisticians to think outside the box when solving problems by considering multiple approaches:

"Approaching problems by looking for a data model imposes an a priori straight jacket that restricts the ability of statisticians to deal with a wide range of statistical problems. The best available solution to a data problem might be a data model; then again it might be an algorithmic model. The data and the problem guide the solution. To solve a wider range of data problems, a larger set of tools is needed."

A number of data scientists and researchers today may not be cognizant of the differences in predictive vs explanatory modeling and statistical inference. It may not be clear to them how that impacts their work. This could be related to background, training, or the kinds of problems they have worked on given their experience.  It is also important that we don't compartmentalize so much that we miss opportunities to approach our problem from a number of different angles (Leo Breiman's 'straight jacket') This is perhaps what happened in the Science article, once the problem was framed as a predictive modeling problem other modes of thinking may have shut down even if developers were aware of all of these distinctions.

The take away is that we think differently when doing statistical inference/explaining vs. predicting or doing machine learning. Making the substitution of one for the other impacts the way we approach the problem (things we care about, things we consider vs. discount etc.) and this impacts the data preparation, modeling, and interpretation.

For instance, in the Science article, after framing the problem as a predictive modeling problem, a pivotal focus became the 'labels' or target for prediction.

"The dilemma of which label to choose relates to a growing literature on 'problem formulation' in data science: the task of turning an often amorphous concept we wish to predict into a concrete variable that can be predicted in a given dataset."

As noted in the paper 'labels are often measured with errors that reflect structural inequalities.'

Addressing the issue with label choice can come with a number of challenges briefly alluded to in the article:

1) deep understanding of the domain - i.e subject matter expertise
2) identification and extraction of relevant data - i.e. data engineering
3) capacity to iterate and experiment -i.e. statistical programming, simulation, and interdisciplinary collaboration

Data science problems in healthcare are wicked problems defined by interacting complexities with social, economic, and biological dimensions that transcend simply fitting a model to data. Expertise in a number of disciplines is required.

Bias in Risk Adjustment

In the Science article, the specific example was in relation to predictive models targeting patients for disease management programs. However, there are a number of other predictive modeling applications where these same issues can be prevalent in the healthcare space.

In Fair Regression for Health Care Spending, Sherri Rose and Anna Zink discuss these challenges in relation to popular regression based risk adjustment applications. Aligning with the analytics lifecycle discussed above, they point out there are several places where issues of bias can be addressed including pre-processing, model fitting, and post processing stages of analysis. In this article they focus largely on the modeling stage leveraging a number of constrained and penalized regression algorithms designed to optimize fairness. This work looks really promising, but the authors point out a number of challenges related to scalability and optimizing fairness across a number of metrics or groups.

Toward Causal AI and ML

Previously I referenced Galit Shmueli's research that discussed how differently we approach and think about predictive vs explanatory modeling. In the Book of Why, Judea Pearl discusses causal inferential thinking:

"Causal Analysis is emphatically not just about data; in causal analysis we must incorporate some understanding of the process that produces the data and then we get something that was not in the data to begin with." 

There is currently a lot of work fusing machine learning and causal inference that could help not only blur these distinctions but create more robust learning algorithms. For example, Susan Athey's work with causal forests, Leon Bottou's work related to causal invariance, and Elias Barenboim's work on the data fusion problem.  This work, including the kind of work mentioned before related to fair regression will help inform the next generation of predictive modeling, machine learning, and causal inference models in the healthcare space that hopefully will represent a marked improvement over what is possible today.

However, we can't wait half a decade or more while the theory is developed and adopted by practitioners. In the Science article, the authors found alternative metrics for targeting disease management programs besides total costs that calibrate much more fairly across groups. Bridging the gap in other areas will require a combination of awareness of these issues and creativity throughout the analytics product lifecycle. As the authors conclude:

"careful choice can allow us to enjoy the benefits of algorithmic predictions while minimizing the risks."

References and Additional Reading:

This paper was recently discussed on the Casual Inference podcast.

Measures of Racism, Sexism, Heterosexism, and Gender Binarism for Health Equity Research: From Structural Injustice to Embodied Harm—an Ecosocial Analysis. Nancy Krieger

Annual Review of Public Health 2020 41:1

Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell 1, 206–215 (2019) doi:10.1038/s42256-019-0048-x

Breiman, Leo. Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statist. Sci. 16 (2001), no. 3, 199--231. doi:10.1214/ss/1009213726.

Shmueli, G., "To Explain or To Predict?", Statistical Science, vol. 25, issue 3, pp. 289-310, 2010.

Fair Regression for Health Care Spending. Anna Zink, Sherri Rose. arXiv:1901.10566v2 [stat.AP]

Monday, September 30, 2019

Wicked Problems and The Role of Expertise and AI in Data Science

In 2018, an article in Science characterized the challenge of pesticide resistance as a wicked problem:

“If we are to address this recalcitrant issue of pesticide resistance, we must treat it as a “wicked problem,” in the sense that there are social, economic, and biological uncertainties and complexities interacting in ways that decrease incentives for actions aimed at mitigation.”

In graduate school, I worked on this same problem, attempting to model the social and economic systems with game theory and behavioral economics and capturing biological complexities leveraging population genetics. 

Wicked vs. Kind Environments

In data science, we also have 'wicked' learning environments in which we try to train our models. In the EconTalk podcast with Russ Roberts, Mastery, Specialization, and Range, David Epstein discusses wicked and kind learning environments:

"The way that chess works makes it what's called a kind learning environment. So, these are terms used by psychologist Robin Hogarth. And what a kind learning environment is, is one where patterns recur; ideally a situation is constrained--so, a chessboard with very rigid rules and a literal board is very constrained; and, importantly, every time you do something you get feedback that is totally see the consequences. The consequences are completely immediate and accurate. And you adjust accordingly. And in these kinds of kind learning environments, if you are cognitively engaged you get better just by doing the activity."

"On the opposite end of the spectrum are wicked learning environments. And this is a spectrum, from kind to wicked. Wicked learning environments: often some information is hidden. Even when it isn't, feedback may be delayed. It may be infrequent. It may be nonexistent. And it maybe be partly accurate, or inaccurate in many of the cases. So, the most wicked learning environments will reinforce the wrong types of behavior."

As discussed in the podcast, many problems fall within some spectrum ranging between very kind environments like Chess to more complex environments like self driving cars or medical diagnosis. What do experts have to offer where AI/ML falls short? The type of environment determines to a great extent the scope of disruption we might be able to expect from AI applications.

The Role of Human Expertise

In Thinking Fast and Slow, Kahneman discusses two conditions for acquiring skill:

1) an environment that is sufficiently regular to be predictable
2) an opportunity to learn these regularities through prolonged practice

This sounds a lot like the 'kind' environments discussed above. Based on research by Robin Hogarth, Kahneman also makes these distinctions describing 'wicked' environments as those environments in which those with expertise are likely to learn the wrong lessons from experience. The problem is that with wicked environments, experts often default to heuristics which can lead to wrong conclusions. Even if aware of these biases, social norms often nudge experts into the wrong direction. Kahneman gives an example involving physicians:

"Generally it is considered a weakness and a sign of vulnerability for clinicians to appear unsure. Confidence is valued over uncertainty and there is a prevailing censure against disclosing uncertainty to patients...acting on pretended knowledge is often the preferred solution."

This likely explains many of the mistakes and low value care that are problematic with healthcare delivery as well as dissatisfaction with both the quality and costs of healthcare. How many of us want our physicians to pretend to know what they are talking about? On the other hand, how many people are willing to accept an answer from their physician that rhymes with "let me look this up and get back to you later." 

One advantage AI may have over experts in kind environments is as Kahneman puts it, the opportunity to learn through prolonged practice. Machine learning can handle many more training examples than a human so to speak.

Even in kind environments, an expert may swing and miss when dealing with cases where the correct decision is like a pitch straight over the plate. One reason Kahneman discusses in Thinking Fast and Slow is the idea of 'ego' depletion. This is related to the idea that mental energy can become exhausted after significant exertion. As self-control breaks down, its easy to default to heuristics and biases that can lead to decisions that look like careless mistakes. This would certainly apply to physicians given the number of stories we hear about burnout in the profession. 

The solution seems to be what polymath economist Tyler Cowen suggested several years ago in the econtalk podcast discussion he had about his book Average is Over with Russ Roberts:

"I would stress much more that humans can always complement robots. I'm not saying every human will be good at this. That's a big part of the problem. But a large number of humans will work very effectively with robots and become far more productive, and this will be one of the driving forces behind that inequality."

Imagine the clinical situation where a physician's 'ego' is substantially depleted from a difficult case. They could then lean on AI to prevent mistakes treating more routine decisions that follow. Or perhaps leveraging AI tools, a clinician could conserve additional mental energy throughout the day so that they are less likely to default to heuristics when they encounter more complex issues. The way this synergy materializes is uncertain, but it will certainly continue to involve substantial expertise on the part of many professionals going forward. Together human expertise and AI might have the greatest chance tackling the most wicked problems.


Wicked evolution: Can we address the sociobiological dilemma of pesticide resistance? | Science

Thinking Fast and Slow. Daniel Kahneman. 2011

EconTalk:David Epstein on Mastery, Specialization, and Range

EconTalk: Tyler Cowen on Inequality, the Future, and Average is Over

Wednesday, May 15, 2019

Causal Invariance and Machine Learning

In an EconTalk podcast with Cathy O'Neil Russ Roberts discusses her book Weapons of Math Destruction and some of the unintentional negative consequences of certain machine learning applications in society. One of the problems with these algorithms and the features they leverage is that they are based on correlational relationships that may not be causal. As Russ states:

"Because there could be a correlation that's not causal. And I think that's the distinction that machine learning is unable to make--even though "it fit the data really well," it's really good for predicting what happened in the past, it may not be good for predicting what happens in the future because those correlations may not be sustained."

This echoes a theme in a recent blog post by Paul Hunermund:

“All of the cutting-edge machine learning tools—you know, the ones you’ve heard about, like neural nets, random forests, support vector machines, and so on—remain purely correlational, and can therefore not discern whether the rooster’s crow causes the sunrise, or the other way round”

I've made similar analogies before myself and still think this makes a lot of sense.

However, a talk at the International Conference on Learning Representations definitely made me stop and think about the kind of progress that has been made in the last decade and the direction research is headed. The talk was titled:  'Learning Representations Using Causal Invariance' (you can actually see it here:


"Learning algorithms often capture spurious correlations present in the training data distribution instead of addressing the task of interest. Such spurious correlations occur because the data collection process is subject to uncontrolled confounding biases. Suppose however that we have access to multiple datasets exemplifying the same concept but whose distributions exhibit different biases. Can we learn something that is common across all these distributions, while ignoring the spurious ways in which they differ? This can be achieved by projecting the data into a representation space that satisfy a causal invariance criterion. This idea differs in important ways from previous work on statistical robustness or adversarial objectives. Similar to recent work on invariant feature selection, this is about discovering the actual mechanism underlying the data instead of modeling its superficial statistics."

This is pretty advanced machine learning and I am not an expert in this area by any means. The way I want to interpret this is that this represents ways of learning from multiple environments that prevent overfitting in any single environment such that predictions are robust to any spurious correlation you might find in any given environment. It has a flavor of causality because the presenter argues that invariance is a common thread underpinning both the works of Rubin and Pearl. It potentially offers powerful predictions/extrapolations while avoiding some of the pitfalls/biases of non-causal machine learning methods.

Going back to Paul Hunermund's post I might draw a dangerous parallel (because I'm still trying to fully grasp the talk) but here goes. If we used invariant learning to predict when or if the sun will rise, the algorithm would leverage those environments where the sun rises even if the rooster does not crow, as well as instances where the rooster crows, but the sun fails to rise. As a result, the biases that are merely correlational (like the sun rising when the rooster crows) will drop out and only the more causal variables will enter the model – which will be invariant to the environment. If this analogy is on track this is a very exciting advancement!

Putting this into the context of predictive modeling/machine learning and causal inference however, these methods create value by giving better answers (less biased/robustness to confounding) to questions or solving problems that sit on the first rung of Judea Pearl’s ladder of causation (see the intro of The Book of Why). Invariant regression is still machine learning and as such does not appear to offer any means to make statistical inferences. However at the same time Susan Athey is doing really cool stuff in this area .

While invariant regression seems to share the invariance properties associated with causal mechanisms emphasized in Rosenbaum and Rubin’s potential outcomes framework and Pearl’s DAGs and ‘do’ operator, it still doesn’t appear to allow us to reach the 3rd rung in Pearl’s ladder of causation which allows us to answer counterfactual questions. And it sounds dangerously close to the idea he criticises in his book that "the data themselves will guide us to the right answers whenever causal questions come up" and allow us to skip the "hard step of constructing or acquiring a causal model."

I’m not sure that is the intention of the method or the talk. Still, its an exciting advancement to be able to build a model with feature selection mechanisms that have more of a causal vs. merely correlational flavor

Tuesday, April 23, 2019

Synthetic Controls - An Example with Toy Data

Abadie Diamond and Hainmueller introduce the method of synthetic controls as an alternative to difference-in-differences for evaluating the effectiveness of a tobacco control program in California (2010).

A very good summarization for how this method works is given by Bret Zeldow and Laura Hatfield at the website:

"The idea behind synthetic control is that a weighted combination of control units can form a closer match to the treated group than than any one (or several) control unit (Abadie, Diamond, and Hainmueller (2010)). The weights are chosen to minimize the distance between treated and control on a set of matching variables, which can include covariates and pre-treatment outcomes. The post-period outcomes for the synthetic control are calculated by taking a weighted average of the control groups’ outcomes. Many authors have extended synthetic control work recently (Kreif et al. 2016; Xu 2017; Ferman, Pinto, and Possebom 2017; Kaul et al. 2015)"

Bouttell and Lewsey (2018) provide a nice survey and introduction to the method related to public health interventions.

For a very nice tour of the math and example R code see this post at The Samuelson Condition blog.

A Toy Example:

See below for some completely made up data for this oversimplified example. But let's assume that we have some intervention in Kentucky in 1995 that impacts some outcome 'Y' in years 1996,1997, and 1998, maybe we are trying to improve the percentages of restaurants with smoke free policies.

Perhaps we want to consider comparing KY to a synthetic control based on the pool of states including TN, IN, CA and values of covariates and predictors measured prior to the intervention (X1,X2,X3) as well as pre-period values of Y.

Using the package Synth in R and the data below the weights used for constructing synthetic controls using states TN, IN and CA with KY as the treatment group are:


We could think of the synthetic control heuristically being approximately 2.1% of TN,  4.4% CA, and 93.6% IN.  If you look at the data, you can see that these wieghts make intuitive sense. I created the toy data so that IN looked a lot more like KY than the other states.

As an additional smell test, if I constructed a synthetic control using only CA and TN, changing one line of R code to reflect only these two states:

controls.identifier = c(2,3), # these states are part of our control pool which will be weighted

I get the following different set of weights:


This makes sense because I made up data for CA that really is quite a bit different from KY. It should contribute very little as a control unit used to calculate a synthetic KY. (in fact maybe it should not be used at all)

The package allows us to plot the trend in outcome Y for the pre and post period. But we could roughly calculate the synthetic (counterfactual) values for KY by hand in excel and get the same plot using this small toy data set.

For instance, the value for KY in 1998 is .51 but the counter factual value created by the synthetic control is the weighted combination of outcomes for TN, CA, & IN or .021*.41 + .044*.95+ .936*.46 = .48097.

Using this small data set with only 3 states being part of the donor pool these results are not perfectly ideal, but we can see roughly that the synthetic control tracks KY's trend in the pre-period and we get a very noticeable divergence in the post period.

The difference between .51 and .48097 or 'gap' between KY and its synthetic control represents the counterfactual impact of the program in KY.  Placebo tests can be ran and visualized using each state from the donor pool as a 'placebo treatment' and constructing synthetic controls using the remaining states. This can be used to produce a distribution of gaps that characterize the uncertainty in our estimate of the treatment effects based on the KY vs KY* synthetic control comparison.

The code excerpt below is an example of how we would designate CA to be our placebo treatment and use the remaining states to create its synthetic control. This could be iterated across all of the remaining controls.

treatment.identifier = 3, # indicates our 'placebo' treatment group
controls.identifier = c(1,2,4), # these states are part of our control pool which will be weighted

R Code:


Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. 2010. “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.” Journal of the American Statistical Association 105: 493–505. doi:10.1198/jasa.2009.ap08746.

Alberto Abadie, Alexis Diamond, Jens Hainmueller
Synth: An R Package for Synthetic Control Methods in Comparative Case Studies
Journal of Statistical Software. 2011

Bouttell J, Craig P, Lewsey J, et al Synthetic control methodology as a tool for evaluating population-level health interventions J Epidemiol Community Health 2018;72:673-678.

More public policy analysis: synthetic control in under an hour