Matching
δATE = E[y1i | Xi,Di=1] - E[y0i | Xi,Di=0] = ATE
This gives us the average difference in mean outcomes for treatment and control (y1i,y0i ⊥ Di) i.e. in a randomized controlled experiment potential outcomes are independent from treatment status
We represent the matching estimator empirically by:
Σ δx P(Xi,=x) where δx is the difference in mean outcome values between treatment and control units at a particular value of X, or difference in outcome for a particular combination of covariates (y1,y0 ⊥ Di|xi) i.e. conditional independence assumed- hence identification is achieved through a selection on observables framework.
Average differences δx are weighted by the distribution of covariates via the term P(Xi,=x).
Regression
We can represent a regression parameter using the basic formula taught to most undergraduates:
Single Variable: β = cov(y,D)/v(D)
Multivariable: βk = cov(y,D*)/v(D*)
where D* = residual from regression of D on all other covariates and E(X’X)-1E(X’y) is a vector with the kth element cov(y,x*)/v(x*) where x* is the residual from regression of that particular ‘x’ on all other covariates.
We can then represent the estimated treatment effect from regression as:
δR = cov(y,D*)/v(D*) = E[(Di-E[Di|Xi])E[yiIDiXi] / E[(Di-E[Di|Xi])^2] assuming (y1,y0 ⊥ Di|xi)
Again regression and matching rely on similar identification strategies based on selection on observables/conditional independence.
Let E[yi | DiXi] = E[yi | Di =0,Xi] + δx Di
Then with more algebra we get: δR = cov(y,D*)/v(D*) = E[σ^2D(Xi) δx]/ E[σ^2D(Xi)]
where σ^2D(Xi) is the conditional variance of treatment D given X or E{E[(Di –E[Di|Xi])^2|Xi]}.
While the algebra is cumbersome and notation heavy, we can see that the way most people are familiar with viewing a regression estimate cov(y,D*)/v(D*) is equivalent to the term (using expectations) E[σ2D(Xi) δx]/ E[σ2D(Xi)] , and we can see that this term contains the product of the conditional variance of D and our covariate specific differences in treatment and controls δx.
Hence, regression gives us a variance based weighted average treatment effect, whereas matching provides a distribution weighted average treatment effect.
So what does this mean in practical terms? Angrist and Piscke explain that regression puts more weight on covariate cells where the conditional variance of treatment status is the greatest, or where there are an equal number of treated and control units. They state that differences matter little when the variation of δx is minimal across covariate combinations.
In his post The cardinal sin of matching, Chris Blattman puts it this way:
"For causal inference, the most important difference between regression and matching is what observations count the most. A regression tries to minimize the squared errors, so observations on the margins get a lot of weight. Matching puts the emphasis on observations that have similar X’s, and so those observations on the margin might get no weight at all....Matching might make sense if there are observations in your data that have no business being compared to one another, and in that way produce a better estimate"
Below is a very simple contrived example. Suppose our data looks like this:
We can see that those in the treatment group tend to have higher outcome values so a straight comparison between treatment and controls will overestimate treatment effects due to selection bias:
E[Yi|di=1] - E[Yi|di=0] =E[Y1i-Y0i] +{E[Y0i|di=1] - E[Y0i|di=0]}
However, if we estimate differences based on an exact matching scheme, we get a much smaller estimate of .67. If we run a regression using all of the data we get .75. If we consider 3.78 to be biased upward then both matching and regression have significantly reduced it, and depending on the application the difference between .67 and .75 may not be of great consequence. Of course if we run the regression including only matched variables, we get exactly the same results. (see R code below). This is not so different than the method of trimming based on propensity scores suggested in Angrist and Pischke.
Both methods rely on the same assumptions for identification, so noone can argue superiority of one method over the other with regard to identification of causal effects.
Matching has the advantage of having a nonparametric, alleviating concerns with functional form. However, there are lots of considerations to work through in matching (i.e. 1:1, 1:many, optimal caliper width, variance/bias tradeoff and kernel selection etc.). While all of these possibilities might lead to better estimates, I wonder if they don't sometimes lead to a garden of forking paths.
See also:
For a neater set of notes related to this post, see:
Matt Bogard. "Regression and Matching (3).pdf" Econometrics, Statistics, Financial Data Modeling (2017). Available at: http://works.bepress.com/matt_bogard/37/
Using R MatchIt for Propensity Score Matching
R Code:
# generate demo data
x <- c(4,5,6,7,8,9,10,11,12,1,2,3, 4,5,6,7,8,9)
d <- c(1,1,1,1,1,1,1,1,1,0,0,0,0,0, 0,0,0,0)
y <- c(6,7,8,8,9,11,12,13,14,2,3,4, 5,6,7,8,9,10)
summary(lm(y~x+d)) # regression controlling for x