Wednesday, June 19, 2013

A Toy Instrumental Variable Application

I have previously discussed instrumental variables (here and here )  from a somewhat technical standpoint, but now I’d like to present a very basic example with a toy data set that demonstrates how IV estimation works in practice. The data set below is fabricated for demonstration purposes. The idea is to develop intuition about the mechanics of IV estimators, so we won’t concern ourselves with getting the appropriate standard errors in this example. 

Suppose an institution has a summer camp designed to prepare high school students for their first year of college and we want to assess the impacts of the camp on 1st year retention.  The most basic fixed effect regression model to assess the impact of ‘CAMP’ might be specified as follows:

Y =β0 + β1 CAMP + β2X + e            (1)

Where y = first year retention (see here  and here for a thorough and apologetic discussion of linear probability models)

CAMP = an indicator for camp attendance 

X = a vector of controls

Let’s simplify the discussion and exclude controls from the analysis for now. That leaves us with:

Y =β0 + β1 CAMP + e        (2)

The causal effect of interest or the treatment effect of CAMP is our regression estimate β1 in the regression above.  If we use the data above we get an estimate of the treatment effect β1 = .68 (i.e. CAMP attendance is associated with a 68% higher level of retention compared to students that don’t attend.) But, what if CAMP attendance is voluntary? If attendance is voluntary, then it could be that students that choose to attend also have a high propensity to succeed due to unmeasured factors (social capital, innate ability, ambition, etc.)  If this is the case, our observed estimate for β1 could overstate the actual impact of CAMP on retention.  If we knew about a variable that captures the omitted factors that may be related to both the choice of attending the CAMP and having a greater likelihood of retaining (like social capital, innate ability, ambition, etc.) let’s just call it INDEX, we would include it and estimate the following:

Y =β0 + β1 CAMP +  β2 INDEX + e                                (3)

We would get the following estimate  β1 =0.3636  which would be closer to the true effect of CAMP. So, omitted variable bias in equation (2) is causing us to overestimate the effect of CAMP.  One way to characterize the selection bias problem is through the potential outcomes framework that I have discussed before, but this time lets characterize this problem in terms of the regression specification above. By omitting INDEX, information about INDEX is getting sucked up into the error term. When this happens, to the extent that INDEX is correlated with CAMP, CAMP becomes correlated with the error term ‘e.’  This correlation with the error term is a violation of the classical regression assumptions and leads to biased estimates of β1, which we notice with the higher value (.68) that we get above when we omit INDEX. (For more technical terminology than ‘getting sucked up into the error term’ see my discussion about unobserved heterogeneity and endogeneity). 

So the question becomes, how do we tease out the true effects of CAMP, when we have this omitted variable INDEX that we can’t possibly measure that is biasing our estimate? Techniques using what are referred to as instrumental variables will help us do this. 

Let’s suppose we find some variable we hadn’t thought of called Z. Suppose that Z tends to be correlated with our variable of interest CAMP. For the most part, where Z = 1, CAMP = 1.  But we also notice (or argue) that Z tends to be unrelated to all of those omitted factors like innate ability and ambition that comprise the variable INDEX that we wish we had. The technique of instrumental variables looks at changes in a variable like Z, and relates them to changes in our variable of interest CAMP, and then relates those changes to the outcome of interest, retention.  Since Z is unrelated to INDEX, then those changes in CAMP that are related to Z are likely to be less correlated with INDEX (and hence less correlated with the error term ‘e’). A very non-technical way to think about this is that we are taking Z and going through CAMP to get to Y, and bringing with us only those aspects of CAMP that are unrelated to INDEX.  Z is like a filter that picks up only the variation in CAMP (what we may refer to as ‘quasi-experimental variation) that we are interested in and filters out the noise from INDEX.  Z is technically related to Y only through CAMP.

Z →CAMP→Y     (4)

If we can do this, then our estimate of the effects of  CAMP on Y will be unbiased by the omitted effects of INDEX.  So how do we do this in practice?
We can do this through a series of regressions.  To relate changes in Z to changes in CAMP  we estimate:

CAMP = β0 + β1 Z + e                (5)

Notice in (5), β1 only picks up the variation in Z that is related to CAMP and leaves all of the variation in CAMP related to INDEX in the residual term.  You can think of this as the filtering process. Then, to relate changes in Z to changes in our target Y we estimate:

Y = β02 Z + e          (6)

Our instrumental variable estimator then becomes:

βIV = β2 / β1  or (Z’Z)-1Z’Y / (Z’Z)-1Z’CAMP or COV(Y,Z)/COV(CAMP,Z)   (7)

The last term in (7) indicates that βIV represents the proportion of total variation in CAMP that is related to our ‘instrument’ Z that is also related to Y.  Or, the total proportion of variation in CAMP unrelated to INDEX that is related to Y. Or, the total proportion of ‘quasi-experimental variation’ in CAMP related to Y.  Regardless of how you want to interpret βIV, we can see that it teases out only that variation in CAMP that is unrelated to INDEX and relates it to Y giving us an estimate for the treatment effect of CAMP that is less biased than the standard regression like (2). In fact if we compute βIV as in (7), we get βIV = .3898. Notice this is much closer to what we think is the true estimate of β1 that we would get from regression (3) if we had information about INDEX and could include it in the model specification.

Application in Higher Education

In their paper 'Using Instrumental Variables to Account for Selection Effects in Research on First Year Programs' Pike, Hansen and Lin account for self- selection using instrumental variables to get an unbiased measure the impact of first year programs. As I discussed this paper previously, in a normal multivariable regression specification, after including various controls, they find  a positive significant relationship between first year programs and student success (measured by GPA). However, by including the instruments in the regression (correcting for selection bias) this relationship goes away. In the paper they state:

"If, as the results of this study suggest, traditional evaluation methods can overstate (either positively or negatively) the magnitude of program effects in the face of self-selection, then evaluation research may be providing decision makers with inaccurate information. In addition to providing an incomplete accounting for external audiences, inaccurate information about program effectiveness can lead to the misallocation of scarce institutional resources."

Addendum: Estimating IV via 2SLS
The example above derives  βIV  as discussed in a previous post. But, you can also get βIV by substitution via two stage least squares (as discussed here):

CAMP = β0 + β1 Z + e                (5)
RET =  β0 + βIV CAMPest + e    (8)

As discussed above, the first regression gets only the variation in CAMP related to Z, and leaves all of the variation in CAMP related to INDEX in the residual term. As Angrist and Pischke state, the second regression estimates βIV and retains only the quasi-experimental variation in CAMP generated by the instrument Z. As discussed in their book Mostly Harmless Econometrics, most IV estimates are derived using packages like SAS, STATA, or R vs. explicit implementation of the methods illustrated above. Caution should be used to derive the correct standard errors, which are not the ones you will get in the intermediate results from any of the regressions depicted above. 


Angrist and Pischke, Mostly Harmless Econometrics, 2009 

Identification of Causal Effects Using Instrumental Variables
Joshua D. Angrist, Guido W. Imbens and Donald B. Rubin
Journal of the American Statistical Association , Vol. 91, No. 434 (Jun., 1996), pp. 444-455

Using Instrumental Variables to Account for Selection Effects in Research on First-Year Programs
Gary R. Pike, Michele J. Hansen and Ching-Hui Lin
Research in Higher Education
Volume 52, Number 2, 194-214

R Code for the Examples Above:

#   ------------------------------------------------------------------
#  | DATE:6/17/13 
#  |----------------------------------------------------------------
#  |
#  | 
#  |------------------------------------------------------------------
CAMP <- read.table("mplan.txt", header =TRUE)
summary(lm(CAMP$RET~CAMP$CAMP + CAMP$INDEX)) # true B1
summary(lm(CAMP$RET~CAMP$CAMP))  #  biased OLS estimate for B1
cor(CAMP$CAMP,CAMP$Z) # is the treatment correlated with the instrument?
cor(CAMP$INDEX,CAMP$Z) # is the instrument correlated with the omitted variable
cov(CAMP$RET,CAMP$Z)/var(CAMP$Z) # y = bZ
cov(CAMP$CAMP,CAMP$Z)/var(CAMP$Z) # x = bZ
# empirical estiamte of B1(IV)
(cov(CAMP$RET,CAMP$Z)/var(CAMP$Z))/(cov(CAMP$CAMP,CAMP$Z)/var(CAMP$Z)) # B1(IV)
# or
# two stage regression with IV substitution
# x^ = b1Z
CAMP_IV <- predict(lm(CAMP$CAMP~CAMP$Z)) # produce a vector of estimates for x
# y = b1 x^
Created by Pretty R at


  1. Dear Prof. Bogard,

    This post was really helpful in understanding the IV estimation.
    It is a nice post.

    To understand it fully, I run all the regressions as you have suggested.

    My coefficients for model (2) and (3) are same as you have listed above.

    In models (5) and (6), why should not we include intercept? I think since we are interested to find out the impact of Z on CAMP and RET respectively, we should not include intercept.

    so when ran regressions (5) and (6) and i got beta 1 = 0.545455 and beta 2 = 0.636364.

    Following model (7) if we divide beta 2/beta 1, we get 1.166

    But you found out .3839

    (Note, i ran all regression in EViews)

    But when i included intercept terms in model (5) and (6), my beta 1 and beta 2 are 0.263158 and 0.526316 respectively (and respective intercept terms are 0.282297 and 0.526316)

    Now, following model (7) if we divide beta 2/beta 1, we get 1.166 we get 0.3839 which is what you have reported.

    But here problem is that my R-sqr are .07 and .01, which is really a problem.

    Further, when i tried to get IV estimates form 2SLS (model 8), i get IV estimate exactly 1.166 which is what i found when i tried to extract IV estimate from model (5) and (6).

    And my R-squared looks terrible -0.703003

    I know my question is bit lengthy.
    So kindly help.

    Thanking you in advance.

  2. I think all of your results are correct. The problem is that my notation in the post was a little sloppy. I forgot to include intercepts in the notation, but should have. All of my regressions in R include intercepts, and you should get the same results in E-Views by including an intercept. As you suggest we are not interested in interpreting the intercept, but it is necessary to get the correct estimates for the treatment effects we are after. I apologize for the confusion, and will correct the post as soon as soon as time permits. Keep in mind, the data is synthesized to get a feel for IV estimates, so don't pay so much attention to R-square or p-values in this case. I'm only trying to show the mechanics of IVs. Thanks for reading my blog and pointing out the issue with my notation!