## Tuesday, April 23, 2019

### Synthetic Controls - An Example with Toy Data

Abadie Diamond and Hainmueller introduce the method of synthetic controls as an alternative to difference-in-differences for evaluating the effectiveness of a tobacco control program in California (2010).

A very good summarization for how this method works is given by Bret Zeldow and Laura Hatfield at the healthpolicydatascience.org website:

"The idea behind synthetic control is that a weighted combination of control units can form a closer match to the treated group than than any one (or several) control unit (Abadie, Diamond, and Hainmueller (2010)). The weights are chosen to minimize the distance between treated and control on a set of matching variables, which can include covariates and pre-treatment outcomes. The post-period outcomes for the synthetic control are calculated by taking a weighted average of the control groups’ outcomes. Many authors have extended synthetic control work recently (Kreif et al. 2016; Xu 2017; Ferman, Pinto, and Possebom 2017; Kaul et al. 2015)"

Bouttell and Lewsey (2018) provide a nice survey and introduction to the method related to public health interventions.

For a very nice tour of the math and example R code see this post at The Samuelson Condition blog.

A Toy Example:

See below for some completely made up data for this oversimplified example. But let's assume that we have some intervention in Kentucky in 1995 that impacts some outcome 'Y' in years 1996,1997, and 1998, maybe we are trying to improve the percentages of restaurants with smoke free policies.

Perhaps we want to consider comparing KY to a synthetic control based on the pool of states including TN, IN, CA and values of covariates and predictors measured prior to the intervention (X1,X2,X3) as well as pre-period values of Y.

Using the package Synth in R and the data below the weights used for constructing synthetic controls using states TN, IN and CA with KY as the treatment group are:

 w.weights unit.names 0.021 TN 0.044 CA 0.936 IN

We could think of the synthetic control heuristically being approximately 2.1% of TN,  4.4% CA, and 93.6% IN.  If you look at the data, you can see that these wieghts make intuitive sense. I created the toy data so that IN looked a lot more like KY than the other states.

As an additional smell test, if I constructed a synthetic control using only CA and TN, changing one line of R code to reflect only these two states:

controls.identifier = c(2,3), # these states are part of our control pool which will be weighted

I get the following different set of weights:

 w.weights unit.names unit.numbers 0.998 TN 2 0.002 CA 3

This makes sense because I made up data for CA that really is quite a bit different from KY. It should contribute very little as a control unit used to calculate a synthetic KY. (in fact maybe it should not be used at all)

The package allows us to plot the trend in outcome Y for the pre and post period. But we could roughly calculate the synthetic (counterfactual) values for KY by hand in excel and get the same plot using this small toy data set.

For instance, the value for KY in 1998 is .51 but the counter factual value created by the synthetic control is the weighted combination of outcomes for TN, CA, & IN or .021*.41 + .044*.95+ .936*.46 = .48097.

Using this small data set with only 3 states being part of the donor pool these results are not perfectly ideal, but we can see roughly that the synthetic control tracks KY's trend in the pre-period and we get a very noticeable divergence in the post period.

The difference between .51 and .48097 or 'gap' between KY and its synthetic control represents the counterfactual impact of the program in KY.  Placebo tests can be ran and visualized using each state from the donor pool as a 'placebo treatment' and constructing synthetic controls using the remaining states. This can be used to produce a distribution of gaps that characterize the uncertainty in our estimate of the treatment effects based on the KY vs KY* synthetic control comparison.

The code excerpt below is an example of how we would designate CA to be our placebo treatment and use the remaining states to create its synthetic control. This could be iterated across all of the remaining controls.

 treatment.identifier = 3, # indicates our 'placebo' treatment group controls.identifier = c(1,2,4), # these states are part of our control pool which will be weighted

R Code: https://gist.github.com/BioSciEconomist/6eb824527c03e12372667fb8861299bd

References:

Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. 2010. “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.” Journal of the American Statistical Association 105: 493–505. doi:10.1198/jasa.2009.ap08746.

Alberto Abadie, Alexis Diamond, Jens Hainmueller
Synth: An R Package for Synthetic Control Methods in Comparative Case Studies
Journal of Statistical Software. 2011

Bouttell J, Craig P, Lewsey J, et al Synthetic control methodology as a tool for evaluating population-level health interventions J Epidemiol Community Health 2018;72:673-678.

More public policy analysis: synthetic control in under an hour
https://thesamuelsoncondition.com/2016/04/29/more-public-policy-analysis-synthetic-control-in-under-an-hour/comment-page-1/

Data:

 ID year state Y X1 X2 X3 1 1990 KY 0.45 50000 25 10 1 1991 KY 0.45 51000 26 10 1 1992 KY 0.46 52000 27 10 1 1993 KY 0.48 52000 28 10 1 1994 KY 0.48 52000 28 10 1 1995 KY 0.48 53000 27 15 1 1996 KY 0.49 53000 24 15 1 1997 KY 0.5 54000 24 15 1 1998 KY 0.51 55000 23 15 2 1990 TN 0.45 52000 23 12 2 1991 TN 0.45 51000 23 12 2 1992 TN 0.44 53000 24 12 2 1993 TN 0.45 51000 26 12 2 1994 TN 0.44 52000 25 12 2 1995 TN 0.43 54000 26 14 2 1996 TN 0.42 54000 25 14 2 1997 TN 0.4 55000 26 14 2 1998 TN 0.41 56000 25 14 3 1990 CA 0.89 102000 10 20 3 1991 CA 0.9 102500 11 20 3 1992 CA 0.9 103000 13 20 3 1993 CA 0.92 103500 12 20 3 1994 CA 0.93 104000 11 20 3 1995 CA 0.93 104000 12 25 3 1996 CA 0.94 104500 14 25 3 1997 CA 0.94 105000 12 25 3 1998 CA 0.95 105000 10 25 4 1990 IN 0.43 52000 25 10 4 1991 IN 0.44 52000 26 10 4 1992 IN 0.42 53000 26 10 4 1993 IN 0.46 53500 27 10 4 1994 IN 0.45 53500 28 10 4 1995 IN 0.46 54000 26 12 4 1996 IN 0.47 54000 26 12 4 1997 IN 0.45 54500 25 12 4 1998 IN 0.46 55000 24 12