I have previously discussed instrumental variables (here and here ) from a somewhat technical standpoint, but now I’d like to present a very basic example with a toy data set that demonstrates how IV estimation works in practice. The data set below is fabricated for demonstration purposes. The idea is to develop intuition about the mechanics of IV estimators, so we won’t concern ourselves with getting the appropriate standard errors in this example.

Suppose an institution has a summer camp designed to prepare
high school students for their first year of college and we want to assess the
impacts of the camp on 1

^{st}year retention. The most basic fixed effect regression model to assess the impact of ‘CAMP’ might be specified as follows:
Y =β

_{0}+ β_{1}CAMP + β_{2}X + e (1)
Where y = first year retention (see here and here
for a thorough and apologetic discussion of linear probability models)

CAMP = an indicator for camp attendance

X = a vector of controls

Let’s simplify the discussion and exclude controls from the
analysis for now. That leaves us with:

Y =β

_{0}+ β_{1}CAMP + e (2)
The causal effect of interest or the treatment effect of
CAMP is our regression estimate β

_{1}in the regression above. If we use the data above we get an estimate of the treatment effect β_{1}= .68 (i.e. CAMP attendance is associated with a 68% higher level of retention compared to students that don’t attend.) But, what if CAMP attendance is voluntary? If attendance is voluntary, then it could be that students that choose to attend also have a high propensity to succeed due to unmeasured factors (social capital, innate ability, ambition, etc.) If this is the case, our observed estimate for β_{1}could overstate the actual impact of CAMP on retention. If we knew about a variable that captures the omitted factors that may be related to both the choice of attending the CAMP and having a greater likelihood of retaining (like social capital, innate ability, ambition, etc.) let’s just call it INDEX, we would include it and estimate the following:
Y =β

_{0}+ β_{1}CAMP + β_{2}INDEX + e (3)
We would get the following estimate β

_{1}=0.3636 which would be closer to the true effect of CAMP. So, omitted variable bias in equation (2) is causing us to overestimate the effect of CAMP. One way to characterize the selection bias problem is through the potential outcomes framework that I have discussed before, but this time lets characterize this problem in terms of the regression specification above. By omitting INDEX, information about INDEX is getting sucked up into the error term. When this happens, to the extent that INDEX is correlated with CAMP, CAMP becomes correlated with the error term ‘e.’ This correlation with the error term is a violation of the classical regression assumptions and leads to biased estimates of β_{1}, which we notice with the higher value (.68) that we get above when we omit INDEX. (For more technical terminology than*‘getting sucked up into the error term’*see my discussion about unobserved heterogeneity and endogeneity).
So the question becomes, how do we tease out the true
effects of CAMP, when we have this omitted variable INDEX that we can’t
possibly measure that is biasing our estimate? Techniques using what are referred to as instrumental variables will help us do this.

Let’s suppose we find some variable we hadn’t thought of
called Z. Suppose that Z tends to be correlated with our variable of interest
CAMP. For the most part, where Z = 1, CAMP = 1.
But we also notice (or argue) that Z tends to be unrelated to all of
those omitted factors like innate ability and ambition that comprise the
variable INDEX that we wish we had. The technique of instrumental variables
looks at changes in a variable like Z, and relates them to changes in our
variable of interest CAMP, and then relates those changes to the outcome of
interest, retention. Since Z is
unrelated to INDEX, then those changes in CAMP that are related to Z are likely
to be less correlated with INDEX (and hence less correlated with the error term
‘e’). A very non-technical way to think about this is that we are taking Z and
going through CAMP to get to Y, and bringing with us only those aspects of CAMP
that are unrelated to INDEX. Z is like a
filter that picks up only the variation in CAMP (what we may refer to as
‘quasi-experimental variation) that we are interested in and filters out the
noise from INDEX. Z is technically
related to Y only through CAMP.

Z →CAMP→Y (4)

If we can do this, then our estimate of the effects of CAMP on Y will be unbiased by the omitted
effects of INDEX. So how do we do this in practice?

We can do this through a series of regressions. To relate changes in Z to changes in CAMP we estimate:

CAMP = β

_{0}+ β_{1}Z + e (5)
Notice in (5), β

_{1}only picks up the variation in Z that is related to CAMP and leaves all of the variation in CAMP related to INDEX in the residual term. You can think of this as the filtering process. Then, to relate changes in Z to changes in our target Y we estimate:
Y = β

_{0}+β_{2}Z + e (6)
Our instrumental variable estimator then
becomes:

β

_{IV}= β_{2}/ β_{1}or (Z’Z)^{-1}Z’Y / (Z’Z)^{-1}Z’CAMP or COV(Y,Z)/COV(CAMP,Z) (7)
The last term in (7) indicates that β

_{IV}represents the proportion of total variation in CAMP that is related to our*‘instrument’*Z that is also related to Y. Or, the total proportion of variation in CAMP unrelated to INDEX that is related to Y. Or, the total proportion of*‘quasi-experimental variation’*in CAMP related to Y. Regardless of how you want to interpret β_{IV}, we can see that it teases out only that variation in CAMP that is unrelated to INDEX and relates it to Y giving us an estimate for the treatment effect of CAMP that is less biased than the standard regression like (2). In fact if we compute β_{IV}as in (7), we get β_{IV}= .3898. Notice this is much closer to what we think is the true estimate of β_{1}that we would get from regression (3) if we had information about INDEX and could include it in the model specification.**Application in Higher Education**

In their paper

**'Using Instrumental Variables to Account for Selection Effects in Research on First Year Programs'**Pike, Hansen and Lin account for self- selection using instrumental variables to get an unbiased measure the impact of first year programs. As I discussed this paper previously, in a normal multivariable regression specification, after including various controls, they find a positive significant relationship between first year programs and student success (measured by GPA). However, by including the instruments in the regression (correcting for selection bias) this relationship goes away. In the paper they state:*"If, as the results of this study suggest, traditional evaluation methods can overstate (either positively or negatively) the magnitude of program effects in the face of self-selection, then evaluation research may be providing decision makers with inaccurate information. In addition to providing an incomplete accounting for external audiences, inaccurate information about program effectiveness can lead to the misallocation of scarce institutional resources."*

**Addendum: Estimating IV via 2SLS**

The example above derives
β

_{IV}as discussed in a previous post. But, you can also get β_{IV}by substitution via two stage least squares (as discussed here):
CAMP = β

_{0}+ β_{1}Z + e (5)
RET = β

_{0}+ β_{IV}CAMP^{est}+ e (8)
As discussed above, the first regression gets only the
variation in CAMP related to Z, and leaves all of the variation in CAMP related to
INDEX in the residual term. As Angrist and Pischke state, the second regression
estimates β

_{IV}and retains only the quasi-experimental variation in CAMP generated by the instrument Z. As discussed in their book Mostly Harmless Econometrics, most IV estimates are derived using packages like SAS, STATA, or R vs. explicit implementation of the methods illustrated above. Caution should be used to derive the correct standard errors, which are not the ones you will get in the intermediate results from any of the regressions depicted above.**References:**

Angrist and Pischke, Mostly Harmless Econometrics, 2009

Identification of Causal Effects Using Instrumental
Variables

Joshua D. Angrist, Guido W. Imbens and Donald B. Rubin

Journal of the American Statistical Association , Vol.
91, No. 434 (Jun., 1996), pp. 444-455

Using Instrumental Variables to Account for Selection Effects in Research on First-Year Programs

Gary R. Pike, Michele J. Hansen and Ching-Hui Lin

Research in Higher Education

Volume 52, Number 2, 194-214

Using Instrumental Variables to Account for Selection Effects in Research on First-Year Programs

Gary R. Pike, Michele J. Hansen and Ching-Hui Lin

Research in Higher Education

Volume 52, Number 2, 194-214

**R Code for the Examples Above:**

# ------------------------------------------------------------------ # | PROGRAM NAME: R_INSTRUMENTAL_VAR # | DATE:6/17/13 # | CREATED BY: MATT BOGARD # | PROJECT FILE: P:\BLOG\STATISTICS # |---------------------------------------------------------------- # | PURPOSE: BASIC EXAMPLE OF IV # | # | # |------------------------------------------------------------------ setwd('P:\\BLOG\\STATISTICS') CAMP <- read.table("mplan.txt", header =TRUE) summary(lm(CAMP$RET~CAMP$CAMP + CAMP$INDEX)) # true B1 summary(lm(CAMP$RET~CAMP$CAMP)) # biased OLS estimate for B1 cor(CAMP$CAMP,CAMP$Z) # is the treatment correlated with the instrument? cor(CAMP$INDEX,CAMP$Z) # is the instrument correlated with the omitted variable cov(CAMP$RET,CAMP$Z)/var(CAMP$Z) # y = bZ cov(CAMP$CAMP,CAMP$Z)/var(CAMP$Z) # x = bZ # empirical estiamte of B1(IV) (cov(CAMP$RET,CAMP$Z)/var(CAMP$Z))/(cov(CAMP$CAMP,CAMP$Z)/var(CAMP$Z)) # B1(IV) # or (cov(CAMP$RET,CAMP$Z))/(cov(CAMP$CAMP,CAMP$Z)) # two stage regression with IV substitution # x^ = b1Z CAMP_IV <- predict(lm(CAMP$CAMP~CAMP$Z)) # produce a vector of estimates for x # y = b1 x^ lm(CAMP$RET~CAMP_IV)

Dear Prof. Bogard,

ReplyDeleteThis post was really helpful in understanding the IV estimation.

It is a nice post.

To understand it fully, I run all the regressions as you have suggested.

My coefficients for model (2) and (3) are same as you have listed above.

In models (5) and (6), why should not we include intercept? I think since we are interested to find out the impact of Z on CAMP and RET respectively, we should not include intercept.

so when ran regressions (5) and (6) and i got beta 1 = 0.545455 and beta 2 = 0.636364.

Following model (7) if we divide beta 2/beta 1, we get 1.166

But you found out .3839

(Note, i ran all regression in EViews)

But when i included intercept terms in model (5) and (6), my beta 1 and beta 2 are 0.263158 and 0.526316 respectively (and respective intercept terms are 0.282297 and 0.526316)

Now, following model (7) if we divide beta 2/beta 1, we get 1.166 we get 0.3839 which is what you have reported.

But here problem is that my R-sqr are .07 and .01, which is really a problem.

Further, when i tried to get IV estimates form 2SLS (model 8), i get IV estimate exactly 1.166 which is what i found when i tried to extract IV estimate from model (5) and (6).

And my R-squared looks terrible -0.703003

I know my question is bit lengthy.

So kindly help.

Thanking you in advance.

-------

skp

I think all of your results are correct. The problem is that my notation in the post was a little sloppy. I forgot to include intercepts in the notation, but should have. All of my regressions in R include intercepts, and you should get the same results in E-Views by including an intercept. As you suggest we are not interested in interpreting the intercept, but it is necessary to get the correct estimates for the treatment effects we are after. I apologize for the confusion, and will correct the post as soon as soon as time permits. Keep in mind, the data is synthesized to get a feel for IV estimates, so don't pay so much attention to R-square or p-values in this case. I'm only trying to show the mechanics of IVs. Thanks for reading my blog and pointing out the issue with my notation!

ReplyDelete