Sunday, December 30, 2012

Causal Inference in A Nutshell

 Often times we hear the platitude that causation does not equal correlation. It is certainly true that at times pundits and even some researchers may get wild and reckless with their interpretation of sometimes spurious correlations. However, there are times, particularly with well designed experiments that we do infer causation. In the ideal experimental setting, we may make inferences about causation, for example we may infer from well designed drug trials that a certain drug actually improves the health of its users. In the social sciences we seldom get well designed experiments with randomized treatments. However, through the use of 'quasi-experimental' methodologies, we get much closer to the ideal experiment than otherwise. This article will not explicitly focus on the derivation of estimators or their properties but will simply supply motivation for thinking about causal inference in the econometric sense.

For more details see also: Empirical Work in The Social Sciences; Mixed, Fixed, and Random Effects Models; Linear Models; Interaction Models; Instrumental Variables; Instrumental Variables and Selection Bias; Difference-in-Difference Estimators  & Propensity Score Matching in Higher Education Research

Ordinary Least Squares(OLS):
y = b0 + b1x +e
y = b0 + b1x1 + b2x2 +e  2-variable case
y = bx + e , b = (x’x)-1x’y  vectorized multivariable case

Ordinary Least Squares(OLS) provides the best linear approximation to the population conditional expectation function f(x) = E(y|x), even if the CEF is non-linear. OLS does not hinge on linearity as an empirical tool for assessing the essential features of causal relationships. The least squares regression equation is causal to the extent that the population CEF is causal.

Interaction Models:  y = b0 + b1x + b2z +b3xz + e

The relationship between x and y is conditional on z. Ex: If z is binary, the the marginal effect of x on y can be expressed as follows:
∂y/∂x = b1+b3z   for  z= 1
∂y/∂x = b1   for z= 0

Selection Bias:

Ci= choice/selection/treatment
Y0i= baseline potential outcome
Y1i = potential treatment outcome

E[Y­­­i|ci=1] - E[Y­­­i|ci=0] =E[Y1i-Y0i|ci=1]  +{ E[Y0i|ci=1] - E[Y0i|ci=0]}

Observed effect = treatment effect on the treated + {selection bias}

 If the potential outcomes  ‘Y0i’ for those that are treated non-randomly or self selected  (ci=1) differ from potential outcomes ‘Y0i’ from those that are not treated or don’t self select then the term                             { E[Y0i|ci=1] - E[Y0i|ci=0]} could have a positive or negative value, creating selection bias. When we calculate the observed difference between treated and untreated groups  E[Y­­­i|ci=1] - E[Y­­­i|ci=0]  selection bias becomes confounded with the actual treatment effect E[Y1i-Y0i|ci=1] .

Conditional Independence Assumption (CIA):
 E[Yi|xi,ci=1]- E[Yi|xi,ci=0]= E[Y1i-Y0i|xi]
With conditional on ‘x ‘comparisons,  selection bias disappears. 

Matching Estimators: make comparisons across groups with similar ‘matched’ covariate values.
E[Y1i-Y0i|ci=1]  Σ δxP(xi=x|ci=1)
The matching estimator calculates an average of the difference between groups( δx )weighted by the distribution of covariates P(xi=x|ci=1) .
Regression Estimators: OLS provides a matching estimator based on a variance weighted average of treatment effects:  b = (x’x)-1x’y   or Cov(x,y)/Var(x).
Propensity score matching works similar to covariate matching except comparisons are made based on scores vs. specific covariate values. OLS with proper controls (the same covariates used in matching, or the same x’s used to generate the propensity scores)provides a robust matching estimator that gives very similar results to explicit matching estimators and propensity score matching estimators.
Fixed Effects and Heterogeneity:
Mixed Models:  y = bx  + zα + e ; bx  = fixed effects (FE); zα +e = random effects (RE)
Heterogeneity: unobserved individual effects
FE: capture individual effects by shifting the regression equation with a dummy variable ‘d’
y = bx  + d α + e
RE: assumes that individual effects are randomly distributed across individuals, modeled as a random intercepts model the RE model is a special case of the general mixed model
y = bx  + α + e 
Instrumental Variables: 
Suppose you are trying to assess the treatment effect of ‘s’, but there is some omitted factor A that is not accounted for, or there is some component ‘A’ related to ‘s’ that is not observable or measurable. We say that there is measurement error in ‘s’.
 y = b0  + b1s + b2A +e  full regression
y = b0  + b1s + e short regression given observable data on ‘s’ omitting ‘A’; e = b2A +e 
To the extent that ‘A’ is correlated with‘s’ we have omitted variable bias in our estimate of b1 , the treatment effect. IV techniques attempt to estimate b1 using only the ‘quasi-experimental’ proportion of variation related to ‘s’ but unrelated to A.  If we could observe ‘A’ we would just include it in the regression, or if we could properly measure ‘s’ we would not require IVs.
IV estimation requires that we find some variable ‘z’ correlated with ‘s’ but uncorrelated with the measurement error; E(z,e)=0.  The only relationship between ‘z ‘and the outcome ‘y ‘should be through ‘s’,  so we have z→s→y and
 E(y|z) = b E(s|z) + E(e|z)  ; given E(e|z) = 0 bIV= E(y|z)/E(s|z) =(z’s)-1z’y
IV estimates are 2-stage regression estimates:
1         s* = bz
2          y =bs*→bIV
Difference in Difference:
DD estimators assume that in absence of treatment the difference between control and treatment groups would be constant or ‘fixed’ over time. DD estimators are a special type of fixed effects estimator.
(A-B) :Differences in groups pre-treatment represent the ‘normal’ difference between groups.

(A’-B) = total post treatment effect = normal effect (A-B) + treatment effect (A’-A)
DD estimates compare the difference in group averages for ‘y’ pre-treatment to the difference in group averages post treatment. The larger the difference post treatment the larger the treatment effect.

This can also be represented in the regression context with interactions where t = time indicating pre and post treatment and x is an indicator for treatment and control groups. At t= 1 there are no treatments so those terms = 0. The parameter b3 is our difference in difference estimator.

y = b0 + b1x + b2t+b3xt + e 

 Greene, Econometric Analysis, 5th Edition

Understanding Interaction Models: Improving Empirical Analyses. Thomas Bramber, William Roberts Clark, Matt Golder. Political Analysis (2006) 14:63-82

Elements of Econometrics. Jan Kmenta. Macmillan (1971)

Angrist and Pischke, Mostly Harmless Econometrics, 2009

Using Instrumental Variables to Account for Selection Effects in Research on First-Year Programs
Gary R. Pike, Michele J. Hansen and Ching-Hui Lin
Research in Higher Education
Volume 52, Number 2, 194-214, DOI: 10.1007/s11162-010-9188-x

Program Evaluation and the
Difference-in-Difference Estimator
Course Notes
Education Policy and Program Evaluation
Vanderbilt University
October 4, 2008

Difference in Difference Models, Course Notes
ECON 47950: Methods for Inferring Causal Relationships in Economics
William N. Evans
University of Notre Dame
Spring 2008

No comments:

Post a Comment