Tuesday, April 23, 2019

Synthetic Controls - An Example with Toy Data

Abadie Diamond and Hainmueller introduce the method of synthetic controls as an alternative to difference-in-differences for evaluating the effectiveness of a tobacco control program in California (2010).

A very good summarization for how this method works is given by Bret Zeldow and Laura Hatfield at the healthpolicydatascience.org website:

"The idea behind synthetic control is that a weighted combination of control units can form a closer match to the treated group than than any one (or several) control unit (Abadie, Diamond, and Hainmueller (2010)). The weights are chosen to minimize the distance between treated and control on a set of matching variables, which can include covariates and pre-treatment outcomes. The post-period outcomes for the synthetic control are calculated by taking a weighted average of the control groups’ outcomes. Many authors have extended synthetic control work recently (Kreif et al. 2016; Xu 2017; Ferman, Pinto, and Possebom 2017; Kaul et al. 2015)"

Bouttell and Lewsey (2018) provide a nice survey and introduction to the method related to public health interventions.

For a very nice tour of the math and example R code see this post at The Samuelson Condition blog.


A Toy Example:

See below for some completely made up data for this oversimplified example. But let's assume that we have some intervention in Kentucky in 1995 that impacts some outcome 'Y' in years 1996,1997, and 1998, maybe we are trying to improve the percentages of restaurants with smoke free policies.

Perhaps we want to consider comparing KY to a synthetic control based on the pool of states including TN, IN, CA and values of covariates and predictors measured prior to the intervention (X1,X2,X3) as well as pre-period values of Y.

Using the package Synth in R and the data below the weights used for constructing synthetic controls using states TN, IN and CA with KY as the treatment group are:


w.weightsunit.names
0.021TN
0.044CA
0.936IN

We could think of the synthetic control heuristically being approximately 2.1% of TN,  4.4% CA, and 93.6% IN.  If you look at the data, you can see that these wieghts make intuitive sense. I created the toy data so that IN looked a lot more like KY than the other states.

As an additional smell test, if I constructed a synthetic control using only CA and TN, changing one line of R code to reflect only these two states:

controls.identifier = c(2,3), # these states are part of our control pool which will be weighted

I get the following different set of weights:


w.weightsunit.namesunit.numbers
0.998TN2
0.002CA3


This makes sense because I made up data for CA that really is quite a bit different from KY. It should contribute very little as a control unit used to calculate a synthetic KY. (in fact maybe it should not be used at all)

The package allows us to plot the trend in outcome Y for the pre and post period. But we could roughly calculate the synthetic (counterfactual) values for KY by hand in excel and get the same plot using this small toy data set.

For instance, the value for KY in 1998 is .51 but the counter factual value created by the synthetic control is the weighted combination of outcomes for TN, CA, & IN or .021*.41 + .044*.95+ .936*.46 = .48097.

Using this small data set with only 3 states being part of the donor pool these results are not perfectly ideal, but we can see roughly that the synthetic control tracks KY's trend in the pre-period and we get a very noticeable divergence in the post period.


The difference between .51 and .48097 or 'gap' between KY and its synthetic control represents the counterfactual impact of the program in KY.  Placebo tests can be ran and visualized using each state from the donor pool as a 'placebo treatment' and constructing synthetic controls using the remaining states. This can be used to produce a distribution of gaps that characterize the uncertainty in our estimate of the treatment effects based on the KY vs KY* synthetic control comparison.

The code excerpt below is an example of how we would designate CA to be our placebo treatment and use the remaining states to create its synthetic control. This could be iterated across all of the remaining controls.


treatment.identifier = 3, # indicates our 'placebo' treatment group
controls.identifier = c(1,2,4), # these states are part of our control pool which will be weighted


R Code: https://gist.github.com/BioSciEconomist/6eb824527c03e12372667fb8861299bd

References:

Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. 2010. “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.” Journal of the American Statistical Association 105: 493–505. doi:10.1198/jasa.2009.ap08746.

Alberto Abadie, Alexis Diamond, Jens Hainmueller
Synth: An R Package for Synthetic Control Methods in Comparative Case Studies
Journal of Statistical Software. 2011

Bouttell J, Craig P, Lewsey J, et al Synthetic control methodology as a tool for evaluating population-level health interventions J Epidemiol Community Health 2018;72:673-678.

More public policy analysis: synthetic control in under an hour
https://thesamuelsoncondition.com/2016/04/29/more-public-policy-analysis-synthetic-control-in-under-an-hour/comment-page-1/


Data:


IDyearstateYX1X2X3
11990KY0.45500002510
11991KY0.45510002610
11992KY0.46520002710
11993KY0.48520002810
11994KY0.48520002810
11995KY0.48530002715
11996KY0.49530002415
11997KY0.5540002415
11998KY0.51550002315
21990TN0.45520002312
21991TN0.45510002312
21992TN0.44530002412
21993TN0.45510002612
21994TN0.44520002512
21995TN0.43540002614
21996TN0.42540002514
21997TN0.4550002614
21998TN0.41560002514
31990CA0.891020001020
31991CA0.91025001120
31992CA0.91030001320
31993CA0.921035001220
31994CA0.931040001120
31995CA0.931040001225
31996CA0.941045001425
31997CA0.941050001225
31998CA0.951050001025
41990IN0.43520002510
41991IN0.44520002610
41992IN0.42530002610
41993IN0.46535002710
41994IN0.45535002810
41995IN0.46540002612
41996IN0.47540002612
41997IN0.45545002512
41998IN0.46550002412