Econometric Sense: Synthetic Controls - An Example with Toy Data

Abadie Diamond and Hainmueller introduce the method of synthetic controls as an alternative to difference-in-differences for evaluating the effectiveness of a tobacco control program in California (2010).

A very good summarization for how this method works is given by Bret Zeldow and Laura Hatfield at the healthpolicydatascience.org website:

"The idea behind synthetic control is that a weighted combination of control units can form a closer match to the treated group than than any one (or several) control unit (Abadie, Diamond, and Hainmueller (2010)). The weights are chosen to minimize the distance between treated and control on a set of matching variables, which can include covariates and pre-treatment outcomes. The post-period outcomes for the synthetic control are calculated by taking a weighted average of the control groups’ outcomes. Many authors have extended synthetic control work recently (Kreif et al. 2016; Xu 2017; Ferman, Pinto, and Possebom 2017; Kaul et al. 2015)"

Bouttell and Lewsey (2018) provide a nice survey and introduction to the method related to public health interventions.

For a very nice tour of the math and example R code see this post at The Samuelson Condition blog.

A Toy Example:

See below for some completely made up data for this oversimplified example. But let's assume that we have some intervention in Kentucky in 1995 that impacts some outcome 'Y' in years 1996,1997, and 1998, maybe we are trying to improve the percentages of restaurants with smoke free policies.

Perhaps we want to consider comparing KY to a synthetic control based on the pool of states including TN, IN, CA and values of covariates and predictors measured prior to the intervention (X1,X2,X3) as well as pre-period values of Y.

Using the package Synth in R and the data below the weights used for constructing synthetic controls using states TN, IN and CA with KY as the treatment group are:

w.weights	unit.names
0.021	TN
0.044	CA
0.936	IN

We could think of the synthetic control heuristically being approximately 2.1% of TN, 4.4% CA, and 93.6% IN. If you look at the data, you can see that these wieghts make intuitive sense. I created the toy data so that IN looked a lot more like KY than the other states.

As an additional smell test, if I constructed a synthetic control using only CA and TN, changing one line of R code to reflect only these two states:

controls.identifier = c(2,3), # these states are part of our control pool which will be weighted

I get the following different set of weights:

w.weights	unit.names	unit.numbers
0.998	TN	2
0.002	CA	3

This makes sense because I made up data for CA that really is quite a bit different from KY. It should contribute very little as a control unit used to calculate a synthetic KY. (in fact maybe it should not be used at all)

The package allows us to plot the trend in outcome Y for the pre and post period. But we could roughly calculate the synthetic (counterfactual) values for KY by hand in excel and get the same plot using this small toy data set.

For instance, the value for KY in 1998 is .51 but the counter factual value created by the synthetic control is the weighted combination of outcomes for TN, CA, & IN or .021*.41 + .044*.95+ .936*.46 = .48097.

Using this small data set with only 3 states being part of the donor pool these results are not perfectly ideal, but we can see roughly that the synthetic control tracks KY's trend in the pre-period and we get a very noticeable divergence in the post period.

The difference between .51 and .48097 or 'gap' between KY and its synthetic control represents the counterfactual impact of the program in KY. Placebo tests can be ran and visualized using each state from the donor pool as a 'placebo treatment' and constructing synthetic controls using the remaining states. This can be used to produce a distribution of gaps that characterize the uncertainty in our estimate of the treatment effects based on the KY vs KY* synthetic control comparison.

The code excerpt below is an example of how we would designate CA to be our placebo treatment and use the remaining states to create its synthetic control. This could be iterated across all of the remaining controls.

treatment.identifier = 3, # indicates our 'placebo' treatment group
controls.identifier = c(1,2,4), # these states are part of our control pool which will be weighted

R Code: https://gist.github.com/BioSciEconomist/6eb824527c03e12372667fb8861299bd

References:

Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. 2010. “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.” Journal of the American Statistical Association 105: 493–505. doi:10.1198/jasa.2009.ap08746.

Alberto Abadie, Alexis Diamond, Jens Hainmueller
Synth: An R Package for Synthetic Control Methods in Comparative Case Studies
Journal of Statistical Software. 2011

Bouttell J, Craig P, Lewsey J, et al Synthetic control methodology as a tool for evaluating population-level health interventions J Epidemiol Community Health 2018;72:673-678.

More public policy analysis: synthetic control in under an hour
https://thesamuelsoncondition.com/2016/04/29/more-public-policy-analysis-synthetic-control-in-under-an-hour/comment-page-1/

Data:

ID	year	state	Y	X1	X2	X3
1	1990	KY	0.45	50000	25	10
1	1991	KY	0.45	51000	26	10
1	1992	KY	0.46	52000	27	10
1	1993	KY	0.48	52000	28	10
1	1994	KY	0.48	52000	28	10
1	1995	KY	0.48	53000	27	15
1	1996	KY	0.49	53000	24	15
1	1997	KY	0.5	54000	24	15
1	1998	KY	0.51	55000	23	15
2	1990	TN	0.45	52000	23	12
2	1991	TN	0.45	51000	23	12
2	1992	TN	0.44	53000	24	12
2	1993	TN	0.45	51000	26	12
2	1994	TN	0.44	52000	25	12
2	1995	TN	0.43	54000	26	14
2	1996	TN	0.42	54000	25	14
2	1997	TN	0.4	55000	26	14
2	1998	TN	0.41	56000	25	14
3	1990	CA	0.89	102000	10	20
3	1991	CA	0.9	102500	11	20
3	1992	CA	0.9	103000	13	20
3	1993	CA	0.92	103500	12	20
3	1994	CA	0.93	104000	11	20
3	1995	CA	0.93	104000	12	25
3	1996	CA	0.94	104500	14	25
3	1997	CA	0.94	105000	12	25
3	1998	CA	0.95	105000	10	25
4	1990	IN	0.43	52000	25	10
4	1991	IN	0.44	52000	26	10
4	1992	IN	0.42	53000	26	10
4	1993	IN	0.46	53500	27	10
4	1994	IN	0.45	53500	28	10
4	1995	IN	0.46	54000	26	12
4	1996	IN	0.47	54000	26	12
4	1997	IN	0.45	54500	25	12
4	1998	IN	0.46	55000	24	12

Econometric Sense

Tuesday, April 23, 2019

Synthetic Controls - An Example with Toy Data

No comments:

Post a Comment