I have previously discussed instrumental variables (here and here ) from a somewhat technical standpoint, but now I’d like to present a very basic example with a toy data set that demonstrates how IV estimation works in practice. The data set below is fabricated for demonstration purposes. The idea is to develop intuition about the mechanics of IV estimators, so we won’t concern ourselves with getting the appropriate standard errors in this example.
Suppose an institution has a summer camp designed to prepare
high school students for their first year of college and we want to assess the
impacts of the camp on 1st year retention. The most basic fixed effect regression model
to assess the impact of ‘CAMP’ might be specified as follows:
Y =β0 + β1 CAMP + β2X + e (1)
Where y = first year retention (see here and here
for a thorough and apologetic discussion of linear probability models)
CAMP = an indicator for camp attendance
X = a vector of controls
Let’s simplify the discussion and exclude controls from the
analysis for now. That leaves us with:
Y =β0 + β1 CAMP + e (2)
The causal effect of interest or the treatment effect of
CAMP is our regression estimate β1 in the regression above. If we use the data above we get an estimate
of the treatment effect β1 = .68 (i.e. CAMP attendance is associated
with a 68% higher level of retention compared to students that don’t attend.) But, what if CAMP attendance is voluntary? If attendance is voluntary, then it
could be that students that choose to attend also have a high propensity to succeed due to unmeasured
factors (social capital, innate ability, ambition, etc.) If this is the case, our observed estimate for
β1 could overstate the actual impact of CAMP on retention. If we knew about a variable that captures the
omitted factors that may be related to both the choice of attending the CAMP
and having a greater likelihood of retaining (like social capital, innate
ability, ambition, etc.) let’s just call it INDEX, we would include it and
estimate the following:
Y =β0 + β1 CAMP + β2 INDEX + e (3)
We would get the following estimate β1 =0.3636 which would be closer to the true effect of
CAMP. So, omitted variable bias in equation (2) is causing us to overestimate
the effect of CAMP. One way to
characterize the selection bias problem is through the potential
outcomes framework that I have discussed before, but this time lets
characterize this problem in terms of the regression specification above. By
omitting INDEX, information about INDEX is getting sucked up into the error
term. When this happens, to the extent that INDEX is correlated with CAMP, CAMP
becomes correlated with the error term ‘e.’ This correlation with the error term is a
violation of the classical regression assumptions and leads to biased estimates
of β1, which we notice with the higher value (.68) that we get above
when we omit INDEX. (For more technical terminology than ‘getting sucked up into the error term’ see my discussion about unobserved
heterogeneity and endogeneity).
So the question becomes, how do we tease out the true
effects of CAMP, when we have this omitted variable INDEX that we can’t
possibly measure that is biasing our estimate? Techniques using what are referred to as instrumental variables will help us do this.
Let’s suppose we find some variable we hadn’t thought of
called Z. Suppose that Z tends to be correlated with our variable of interest
CAMP. For the most part, where Z = 1, CAMP = 1.
But we also notice (or argue) that Z tends to be unrelated to all of
those omitted factors like innate ability and ambition that comprise the
variable INDEX that we wish we had. The technique of instrumental variables
looks at changes in a variable like Z, and relates them to changes in our
variable of interest CAMP, and then relates those changes to the outcome of
interest, retention. Since Z is
unrelated to INDEX, then those changes in CAMP that are related to Z are likely
to be less correlated with INDEX (and hence less correlated with the error term
‘e’). A very non-technical way to think about this is that we are taking Z and
going through CAMP to get to Y, and bringing with us only those aspects of CAMP
that are unrelated to INDEX. Z is like a
filter that picks up only the variation in CAMP (what we may refer to as
‘quasi-experimental variation) that we are interested in and filters out the
noise from INDEX. Z is technically
related to Y only through CAMP.
Z →CAMP→Y (4)
If we can do this, then our estimate of the effects of CAMP on Y will be unbiased by the omitted
effects of INDEX. So how do we do this in practice?
We can do this through a series of regressions. To relate changes in Z to changes in CAMP we estimate:
CAMP = β0 + β1 Z + e
(5)
Notice in (5), β1 only picks up the variation in
Z that is related to CAMP and leaves all of the variation in CAMP related to INDEX in
the residual term. You can think of this
as the filtering process. Then, to relate changes in Z to changes in our target
Y we estimate:
Y = β0 +β2 Z + e (6)
Our instrumental variable estimator then
becomes:
βIV = β2 / β1 or (Z’Z)-1Z’Y / (Z’Z)-1Z’CAMP
or COV(Y,Z)/COV(CAMP,Z) (7)
The last term in (7) indicates that βIV
represents the proportion of total variation in CAMP that is related to our ‘instrument’ Z that is also related to
Y. Or, the total proportion of variation
in CAMP unrelated to INDEX that is related to Y. Or, the total proportion of ‘quasi-experimental variation’ in CAMP
related to Y. Regardless of how you want
to interpret βIV, we can see that it teases out only that variation
in CAMP that is unrelated to INDEX and relates it to Y giving us an estimate
for the treatment effect of CAMP that is less biased than the standard
regression like (2). In fact if we compute βIV as in (7), we get βIV
= .3898. Notice this is much closer to what we think is the true estimate of β1
that we would get from regression (3) if we had information about INDEX and
could include it in the model specification.
Application in Higher
Education
In their paper 'Using
Instrumental Variables to Account for Selection Effects in Research on First
Year Programs' Pike, Hansen and Lin account for self- selection using
instrumental variables to get an unbiased measure the impact of first year
programs. As I discussed this paper previously,
in a normal multivariable regression specification, after including various controls,
they find a positive significant
relationship between first year programs and student success (measured by GPA).
However, by including the instruments in the regression (correcting for
selection bias) this relationship goes away. In the paper they state:
"If, as the results of this study suggest,
traditional evaluation methods can overstate (either positively or negatively)
the magnitude of program effects in the face of self-selection, then evaluation
research may be providing decision makers with inaccurate information. In
addition to providing an incomplete accounting for external audiences,
inaccurate information about program effectiveness can lead to the
misallocation of scarce institutional resources."
Addendum: Estimating
IV via 2SLS
The example above derives
βIV as discussed in a previous
post. But, you can also get βIV by substitution via two stage
least squares (as discussed
here):
CAMP = β0 + β1 Z + e
(5)
RET = β0 + βIV CAMPest
+ e (8)
As discussed above, the first regression gets only the
variation in CAMP related to Z, and leaves all of the variation in CAMP related to
INDEX in the residual term. As Angrist and Pischke state, the second regression
estimates βIV and retains only the quasi-experimental variation in
CAMP generated by the instrument Z. As discussed in their book Mostly Harmless
Econometrics, most IV estimates are derived using packages like SAS, STATA, or
R vs. explicit implementation of the methods illustrated above. Caution should
be used to derive the correct standard errors, which are not the ones you will
get in the intermediate results from any of the regressions depicted above.
References:
Angrist and Pischke, Mostly Harmless Econometrics, 2009
Angrist and Pischke, Mostly Harmless Econometrics, 2009
Identification of Causal Effects Using Instrumental
Variables
Joshua D. Angrist, Guido W. Imbens and Donald B. Rubin
Journal of the American Statistical Association , Vol.
91, No. 434 (Jun., 1996), pp. 444-455
Using Instrumental Variables to Account for Selection Effects in Research on First-Year Programs
Gary R. Pike, Michele J. Hansen and Ching-Hui Lin
Research in Higher Education
Volume 52, Number 2, 194-214
Using Instrumental Variables to Account for Selection Effects in Research on First-Year Programs
Gary R. Pike, Michele J. Hansen and Ching-Hui Lin
Research in Higher Education
Volume 52, Number 2, 194-214
R Code for the Examples Above:
# ------------------------------------------------------------------ # | PROGRAM NAME: R_INSTRUMENTAL_VAR # | DATE:6/17/13 # | CREATED BY: MATT BOGARD # | PROJECT FILE: P:\BLOG\STATISTICS # |---------------------------------------------------------------- # | PURPOSE: BASIC EXAMPLE OF IV # | # | # |------------------------------------------------------------------ setwd('P:\\BLOG\\STATISTICS') CAMP <- read.table("mplan.txt", header =TRUE) summary(lm(CAMP$RET~CAMP$CAMP + CAMP$INDEX)) # true B1 summary(lm(CAMP$RET~CAMP$CAMP)) # biased OLS estimate for B1 cor(CAMP$CAMP,CAMP$Z) # is the treatment correlated with the instrument? cor(CAMP$INDEX,CAMP$Z) # is the instrument correlated with the omitted variable cov(CAMP$RET,CAMP$Z)/var(CAMP$Z) # y = bZ cov(CAMP$CAMP,CAMP$Z)/var(CAMP$Z) # x = bZ # empirical estiamte of B1(IV) (cov(CAMP$RET,CAMP$Z)/var(CAMP$Z))/(cov(CAMP$CAMP,CAMP$Z)/var(CAMP$Z)) # B1(IV) # or (cov(CAMP$RET,CAMP$Z))/(cov(CAMP$CAMP,CAMP$Z)) # two stage regression with IV substitution # x^ = b1Z CAMP_IV <- predict(lm(CAMP$CAMP~CAMP$Z)) # produce a vector of estimates for x # y = b1 x^ lm(CAMP$RET~CAMP_IV)
Dear Prof. Bogard,
ReplyDeleteThis post was really helpful in understanding the IV estimation.
It is a nice post.
To understand it fully, I run all the regressions as you have suggested.
My coefficients for model (2) and (3) are same as you have listed above.
In models (5) and (6), why should not we include intercept? I think since we are interested to find out the impact of Z on CAMP and RET respectively, we should not include intercept.
so when ran regressions (5) and (6) and i got beta 1 = 0.545455 and beta 2 = 0.636364.
Following model (7) if we divide beta 2/beta 1, we get 1.166
But you found out .3839
(Note, i ran all regression in EViews)
But when i included intercept terms in model (5) and (6), my beta 1 and beta 2 are 0.263158 and 0.526316 respectively (and respective intercept terms are 0.282297 and 0.526316)
Now, following model (7) if we divide beta 2/beta 1, we get 1.166 we get 0.3839 which is what you have reported.
But here problem is that my R-sqr are .07 and .01, which is really a problem.
Further, when i tried to get IV estimates form 2SLS (model 8), i get IV estimate exactly 1.166 which is what i found when i tried to extract IV estimate from model (5) and (6).
And my R-squared looks terrible -0.703003
I know my question is bit lengthy.
So kindly help.
Thanking you in advance.
-------
skp
I think all of your results are correct. The problem is that my notation in the post was a little sloppy. I forgot to include intercepts in the notation, but should have. All of my regressions in R include intercepts, and you should get the same results in E-Views by including an intercept. As you suggest we are not interested in interpreting the intercept, but it is necessary to get the correct estimates for the treatment effects we are after. I apologize for the confusion, and will correct the post as soon as soon as time permits. Keep in mind, the data is synthesized to get a feel for IV estimates, so don't pay so much attention to R-square or p-values in this case. I'm only trying to show the mechanics of IVs. Thanks for reading my blog and pointing out the issue with my notation!
ReplyDelete