Tuesday, February 13, 2018

Intuition for Random Effects

Previously I wrote a post based on course notes from J.Blumenstock that attempted to provide some intuition for how fixed effects estimators can account for unobserved heterogeneity (individual specific effects).

Recently someone asked if I could provide a similarly motivating and intuitive example regarding random effects. Although I was not able to come up with a new example, I can definitely discuss random effects in the same context of the previous example. But first a little (less intuitive) background.


To recap, the purpose of both fixed and random effects estimators is to model treatment effects in the face of unobserved individual specific effects.

yit =b xit + αi + uit  (1) 

In the model above this is represented by αi . In terms of estimation, the difference between fixed and random effects depends on how we choose to model this term. In the context of fixed effects it can be captured through a dummy variable estimation (this creates different intercepts or shifts capturing specific effects) or by transforming the data, subtracting group (fixed effects) means from individual observations within each group.  In random effects models, individual specific effects are captured by a composite error term (αi + uit) which assumes that individual intercepts are drawn from a random distribution of possible intercepts. The random component of the error term αi captures the individual specific effects in a different way from fixed effects models. 

As noted in another post, Fixed, Mixed, and Random Effects, the random effects model is estimated using Generalized Least Squares (GLS) :

βGLS = (X’Ω-1X)-1(X’Ω-1Y) where Ω = I  Σ    (2) 

Where Σ is the variance αi+ uit If  Σ is unknown, it is estimated, producing a feasible generalized least squares estimate βFGLS

Intuition for Random Effects

In my post Intuition for Fixed Effects I noted: 

"Essentially using a dummy variable in a regression for each city (or group, or type to generalize beyond this example) holds constant or 'fixes' the effects across cities that we can't directly measure or observe. Controlling for these differences removes the 'cross-sectional' variation related to unobserved heterogeneity (like tastes, preferences, other unobserved individual specific effects). The remaining variation, or 'within' variation can then be used to 'identify' the causal relationships we are interested in."

Lets look at the toy data I used in that example. 

The crude ellipses in the plots above (motivated by the example given in Kennedy, 2008) indicate the data for each city and the the 'within' variation exploited by fixed effects models (that allowed us to correctly identify the correct price/quantity relationships expected in the previous post). The differences between the ellipses represents 'between variation.' As Kennedy discusses, random effects models differ from fixed effects models in that they are able to exploit both 'within' and 'between' variation, producing an estimate that is a weighted average of both kinds of variation (via Σ in equation 2 above). OLS, on the other hand exploits both kinds of variation as an unweighted average.

More Details 

As Kennedy discusses, both FE and RE can be viewed as running OLS on different transformations of the data.

For fixed effects: "this transformation consists of subtracting from each observation the average of the values within its ellipse"

For random effects: "the EGLS (or FGLS above) calculation is done by finding a transformation of the data that creates a spherical variance-covariance matrix and then performing OLS on the transformed data."

As Kennedy notes, the increased information used by RE makes them more efficient estimators, but correlation between 'x' and the error term creates bias. i.e. RE assumes that αis uncorrelated with (orthogonal to) regressors. Angrist and Pischke (2009) discuss (footnote, p. 223) that they prefer FE because the gains in efficiency are likely to be modest while the finite sample properties of RE may be worse. As noted on p.243 an important assumption for identification in FE is that the most important sources of variation are time invariant (because information from time varying regressors gets differenced out). Angrist and Pischke also have a nice discussion on page 244-245 discussing the choice between FE and lagged dependent variable models.


A Guide to Econometrics. Peter Kennedy. 6th Edition. 2008
Mostly Harmless Econometrics. Angrist and Pischke. 2009

See also: ‘Metrics Monday: Fixed Effects, Random Effects, and (Lack of) External Validity (Marc Bellemare.

Marc notes: 

"Nowadays, in the wake of the Credibility Revolution, what we teach students is: “You should use RE when your variable of interest is orthogonal to the error term; if there is any doubt and you think your variable of interest is not orthogonal to the error term, use FE.” And since the variable can be argued to be orthogonal pretty much only in cases where it is randomly assigned in the context of an experiment, experimental work is pretty much the only time the RE estimator should be used."

Friday, February 2, 2018

Deep Learning vs. Logistic Regression ROC vs Calibration Explaining vs. Predicting

Frank Harrel writes Is Medicine Mesmerized by Machine Learning? Some time ago I wrote about predictive modeling and the differences between what the ROC curve may tell us and how well a model 'calibarates.'

There I quoted from the journal Circulation:

'When the goal of a predictive model is to categorize individuals into risk strata, the assessment of such models should be based on how well they achieve this aim...The use of a single, somewhat insensitive, measure of model fit such as the c statistic can erroneously eliminate important clinical risk predictors for consideration in scoring algorithms'

Not too long ago Dr. Harrel shares the following tweet related to this:

I have seen hundreds of ROC curves in the past few years.  I've yet to see one that provided any insight whatsoever.  They reverse the roles of X and Y and invite dichotomization.  Authors seem to think they're obligatory.  Let's get rid of 'em. @f2harrell 8:42 AM - 1 Jan 2018

In his Statistical Thinking post above, Dr. Harrel writes:

"Like many applications of ML where few statistical principles are incorporated into the algorithm, the result is a failure to make accurate predictions on the absolute risk scale. The calibration curve is far from the line of identity as shown below...The gain in c-index from ML over simpler approaches has been more than offset by worse calibration accuracy than the other approaches achieved."

i.e. depending on the goal, better ROC scores don't necessarily mean better models.

But this post was about more than discrimination and calibration. It was discussing the logistic regression approach taken in Exceptional Mortality Prediction by Risk Scores from Common Laboratory Tests  vs the deep learning approach used in Improving Palliative Care with Deep Learning.

"One additional point: the ML deep learning algorithm is a black box, not provided by Avati et al, and apparently not usable by others. And the algorithm is so complex (especially with its extreme usage of procedure codes) that one can’t be certain that it didn’t use proxies for private insurance coverage, raising a possible ethics flag. In general, any bias that exists in the health system may be represented in the EHR, and an EHR-wide ML algorithm has a chance of perpetuating that bias in future medical decisions. On a separate note, I would favor using comprehensive comorbidity indexes and severity of disease measures over doing a free-range exploration of ICD-9 codes."

This kind of pushes back against the idea that deep neural nets can effectively bypass feature engineering, or at least raises cautions in specific contexts.

Actually, he is not as critical of the authors of this paper as he is about what he considers undue accolades it has received.

This ties back to my post on LinkedIn a couple weeks ago, Deep Learning, Regression, and SQL. 

See also:

To Explain or Predict
Big Data: Causality and Local Expertise Are Key in Agronomic Applications


Feature Engineering for Deep Learning
In Deep Learning, Architecture Engineering is the New Feature Engineering

Sunday, December 31, 2017

HARK! - flawed studies in nutrition call for credibility revolution -or- HARKing in nutrition research

There was a nice piece over at the Genetic Literacy Project I read just recently: Why so many scientific studies are flawed and poorly understood. (link). They gave a fairly intuitive example of false positives in research using coin flips. I like this because I used the specific example of flipping a coin 5 times in a row to demonstrate basic probability concepts in some of the stats classes I used to teach. Their example might make a nice extension:

"In Table 1 we present ten 61-toss sequences. The sequences were computer generated using a fair 50:50 coin. We have marked where there are runs of five or more heads one after the other. In all but three of the sequences, there is a run of at least five heads. Thus, a sequence of five heads has a probability of 0.55=0.03125 (i.e., less than 0.05) of occurring. Note that there are 57 opportunities in a sequence of 61 tosses for five consecutive heads to occur. We can conclude that although a sequence of five consecutive heads is relatively rare taken alone, it is not rare to see at least one sequence of five heads in 61 tosses of a coin."

In other words, a 5 head run in a sequence of 61 tosses (as evidence against a null hypothesis of p(head) = .5 i.e. a fair coin) is their analogy for a false positive in research. Particularly they relate this to nutrition research where it is popular to use large survey questionnaires that consist of a large number of questions:

"asking lots of questions and doing weak statistical testing is part of what is wrong with the self-reinforcing publish/grants business model. Just ask a lot of questions, get false-positives, and make a plausible story for the food causing a health effect with a p-value less than 0.05"

It is their 'hypothesis' that this approach in conjunction with a questionable practice referred to as 'HARKing' (hypothesizing after the results are known) is one reason we see so many conflicting headlines about what we should and should not eat or benefits or harms of certain foods and diets. There is some damage done in terms of peoples' trust in science as a result.  They conclude:

"Curiously, editors and peer-reviewers of research articles have not recognized and ended this statistical malpractice, so it will fall to government funding agencies to cut off support for studies with flawed design, and to universities to stop rewarding the publication of bad research. We are not optimistic."

More on HARKing.....

A good article related to HARKing is a paper written by Norbert L. Kerr.  By HARKing he specifically discusses it as the practice of proposing one hypothesis (or set of hypotheses) but later changing the research question *after* the data is examined. Then presenting the results *as if* the new hypothesis were the original.  He does distinguish this from a more intentional exercise in scientific induction, inferring some relation or principle post hoc from a pattern of data. This is more like exploratory data analysis.

I discussed exploratory studies and issues related to multiple testing in a previous post:  Econometrics, Multiple Testing, and Researcher Degrees of Freedom. 

To borrow a quote from this post- "At the same time, we do not want demands of statistical purity to strait-jacket our science. The most valuable statistical analyses often arise only after an iterative process involving the data" (see, e.g., Tukey, 1980, and Box, 1997).

To say the least, careful consideration of tradeoffs should be made in the way research is conducted, and as the post discusses in more detail, the garden of forking paths involved.

I am not sure to what extent the credibility revolution has impacted nutrition studies, but the lessons apply here.


HARKing: Hypothesizing After the Results are Known
Norbert L. Kerr
Personality and Social Psychology Review
Vol 2, Issue 3, pp. 196 - 217
First Published August 1, 1998

Thursday, August 24, 2017

Granger Causality

"Granger causality is a standard linear technique for determining whether one time series is useful in forecasting another." (Irwin and Sanders, 2011).

A series 'granger' causes another series if it consistently predicts it. If series X granger causes Y, while we can't be certain that this relationship is causal in any rigorous way, we might be fairly certain that Y doesn't cause X.


Yt = B0 + B1*Yt-1 +... Bp*Yt-p + A2*Xt-1+.....+Ap*Xt-p + Et

if we reject the hypothesis that all the 'A' coefficients jointly = 0  then 'X' granger causes 'Y'

Xt = B0 + B1*Xt-1 +... Bp*Xt-p + A2*Yt-1+.....+Ap*Yt-p + Et

if we reject the hypothesis that all the 'A' coefficients jointly = 0 then 'Y' granger causes 'X'


Below are some applications where granger causality methods were used to test the impacts of index funds on commodity market price and volatility.

The Impact of Index Funds in Commodity Futures Markets:A Systems Approach
The Journal of Alternative Investments
Summer 2011, Vol. 14, No. 1: pp. 40-49

Irwin, S. H. and D. R. Sanders (2010), “The Impact of Index and Swap Funds on Commodity Futures Markets: Preliminary Results”, OECD Food, Agriculture and Fisheries Working Papers, No. 27, OECD Publishing. doi: 10.1787/5kmd40wl1t5f-en

Index Trading and Agricultural Commodity Prices:
A Panel Granger Causality Analysis
Gunther Capelle-Blancard and Dramane Coulibaly
CEPII, WP No 2011 – 28
No 2011 – 28


Using Econometrics: A Practical Guide (6th Edition) A.H. Studenmund. 2011

Monday, August 7, 2017

Confidence Intervals: Fad or Fashion

Confidence intervals seem to be the fad among some in pop stats/data science/analytics. Whenever there is mention of p-hacking, or the ills of publication standards, or the pitfalls of null hypothesis significance testing, CIs almost always seem to be the popular solution.

There are some attractive features of CIs. This paper provides some alternative views of CIs, discusses some strengths and weaknesses, and ultimately proposes that they are on balance superior to p-values and hypothesis testing. CIs can bring more information to the table in terms of effect sizes for a given sample however some of the statements made in this article need to be read with caution. I just wonder how much the fascination with CIs is largely the result of confusing a Bayesian interpretation with a frequentist application or just sloppy misinterpretation. I completely disagree that they are more straight forward to students (compared to interpreting hypothesis tests and p-values as the article claims).

Dave Giles gives a very good review starting with the very basics of what is a parameter vs. an estimator vs. an estimate, sampling distributions etc. After reviewing the concepts key to understanding CIs he points out two very common interpretations of CIs that are clearly wrong:

1) There's a 95% probability that the true value of the regression coefficient lies in the interval [a,b].
2) This interval includes the true value of the regression coefficient 95% of the time.

"we really should talk about the (random) intervals "covering" the (fixed) value of the parameter. If, as some people do, we talk about the parameter "falling in the interval", it sounds as if it's the parameter that's random and the interval that's fixed. Not so!"

In Robust misinterpretation of confidence intervals, the authors take on the idea that confidence intervals offer a panacea for interpretation issues related to null hypothesis significance testing (NHST):

"Confidence intervals (CIs) have frequently been proposed as a more useful alternative to NHST, and their use is strongly encouraged in the APA Manual...Our findings suggest that many researchers do not know the correct interpretation of a CI....As is the case with p-values, CIs do not allow one to make probability statements about parameters or hypotheses."

The authors present evidence about this misunderstanding by presenting subjects with a number of false statements regarding confidence intervals (including the two above pointed out by Dave Giles) and noting the frequency of incorrect affirmations about their truth.

In Osteoarthritis and Cartilage, authors write:

"In spite of frequent discussions of misuse and misunderstanding of probability values (P-values) they still appear in most scientific publications, and the disadvantages of erroneous and simplistic P-value interpretations grow with the number of scientific publications."

They raise a number of issues related to both p-values and confidence intervals (multiplicity of testing, the focus on effect sizes, etc.) and they point out some informative differences between using p-values vs. using standard errors to produce 'error bars.' However, in trying to clarify the advantages of p-values they step really close to what might be considered an erroneous and simplistic interpretation:

"the great advantage with confidence intervals is that they do show what effects are likely to exist in the population. Values excluded from the confidence interval are thus not likely to exist in the population. "

Maybe I am being picky, but if we are going to be picky about interpreting p-values then the same goes for CIs. It sounds a lot like they are talking about 'a parameter falling into an interval' or the 'probability of a parameter falling into an interval' as Dave cautions against. They seem careful enough in their language using the term 'likely' vs. making strong probability statements, so maybe they are making a more heuristic interpretation that while useful may not be the most correct.

In Mastering 'Metrics, Angrist and Pishcke give a great interpretation of confidence intervals that doesn't lend itself in my opinion as easily to abusive probability interpretations:

"By describing a set of parameter values consistent with our data, confidence intervals provide a compact summary of the information these data contain about the population from which they were sampled"

I think the authors Osteoarthritis and Cartilage could have stated their case better if they had said:

"The great advantage of confidence intervals is that they describe what effects in the population are consistent with our sample data. Our sample data is not consistent with population effects excluded from the confidence interval."

Both hypothesis testing and confidence intervals are statements about the compatibility of our observable sample data with population characteristics of interest. The ASAreleased a set of clarifications on statements on p-values. Number 2 states that "P-values do not measure the probability that the studied hypothesis is true." Nor does a confidence interval (again see Ranstan, 2014).

Venturing into the risky practice of making imperfect analogies, take this loosely from the perspective of criminal investigations. We might think of confidence intervals as narrowing the range of suspects based on observed evidence, without providing specific probabilities related to the guilt or innocence of any particular suspect. Better evidence narrows the list, just as better evidence in our sample data (less noise) will narrow the confidence interval.

I see no harm in CIs and more good if they draw more attention to practical/clinical significance of effect sizes. But I think the temptation to incorrectly represent CIs can be just as strong as the temptation to speak boldly of 'significant' findings following an exercise in p-hacking or in the face of meaningless effect sizes. Maybe some sins are greater than others and proponents feel more comfortable with misinterpretations/overinterpretations of CIs than they do with misinterpretations/overinterpretaions of p-values.

Or as Briggs concludes about this issue:

"Since no frequentist can interpret a confidence interval in any but in a logical probability or Bayesian way, it would be best to admit it and abandon frequentism"

Methods of Psychological Research Online 1999, Vol.4, No.2 © 1999 PABST SCIENCE PUBLISHERS Confidence Intervals as an Alternative to Significance Testing Eduard Brandstätter1 Johannes Kepler Universität Linz

J. Ranstam, Why the -value culture is bad and confidence intervals a better alternative, Osteoarthritis and Cartilage, Volume 20, Issue 8, 2012, Pages 805-808, ISSN 1063-4584, http://dx.doi.org/10.1016/j.joca.2012.04.001 (http://www.sciencedirect.com/science/article/pii/S1063458412007789)

Robust misinterpretation of confidence intervals
Rink Hoekstra & Richard D. Morey & Jeffrey N. Rouder &
Eric-Jan Wagenmakers Psychon Bull Rev
DOI 10.3758/s13423-013-0572-3 2014

Friday, July 21, 2017

Regression as a variance based weighted average treatment effect

In Mostly Harmless Econometrics Angrist and Pischke discuss regression in the context of matching. Specifically they show that regression provides variance based weighted average of covariate specific differences in outcomes between treatment and control groups. Matching gives us a weighted average difference in treatment and control outcomes weighted by the empirical distribution of covariates. (see more here). I wanted to roughly sketch this logic out below.


 δATE = E[y1i | Xi,Di=1] - E[y0i | Xi,Di=0] = ATE

This gives us the average difference in mean outcomes for treatment and control  (y1i,y0i ⊥ Di) i.e. in a randomized controlled experiment potential outcomes are independent from treatment status

We represent the matching estimator empirically by:

 Σ δx P(Xi,=x) where δx is the difference in mean outcome values between treatment and control units at a particular value of X, or  difference in outcome for a particular combination of covariates (y1,y0 ⊥ Di|xi) i.e. conditional independence assumed- hence identification is achieved through a selection on observables framework.

Average differences δx are weighted by  the distribution of covariates via the term P(Xi,=x).


We can represent a regression parameter using the basic formula taught to most undergraduates:

Single Variable: β = cov(y,D)/v(D)
Multivariable:  βk = cov(y,D*)/v(D*)

where  D* = residual from regression of D on all other covariates and 
E(X’X)-1E(X’y) is a vector with the kth element cov(y,x*)/v(x*) where x* is the residual from regression of that particular ‘x’ on all other covariates.

We can then represent the estimated treatment effect from regression as:

 δR = cov(y,D*)/v(D*) = E[(Di-E[Di|Xi])E[yiIDiXi] / E[(Di-E[Di|Xi])^2]  assuming (y1,y0 ⊥ Di|xi)

Again regression and matching rely on similar identification strategies based on selection on observables/conditional independence.

Let E[yi | DiXi] = E[yi | Di =0,Xi] + δx Di

Then with more algebra we get: δR = cov(y,D*)/v(D*) = E[σ^2D(Xi) δx]/ E[σ^2D(Xi)]

where σ^2D(Xi) is the conditional variance of treatment D given X or  E{E[(Di –E[Di|Xi])^2|Xi]}.

While the algebra is cumbersome and notation heavy, we can see that the way most people are familiar with viewing a regression estimate cov(y,D*)/v(D*)  is equivalent to the term (using expectations)  E[σ2D(Xi) δx]/ E[σ2D(Xi)] , and we can see that this term contains the product of the conditional variance of D and our covariate specific differences in treatment and controls δx.

Hence, regression gives us a variance based weighted average treatment effect, whereas matching provides a distribution weighted average treatment effect.

So what does this mean in practical terms? Angrist and Piscke explain that regression puts more weight on covariate cells where the conditional variance of treatment status is the greatest, or where there are an equal number of treated and control units. They state that differences matter little when the variation of δx is minimal across covariate combinations.

In his post The cardinal sin of matching, Chris Blattman puts it this way:

"For causal inference, the most important difference between regression and matching is what observations count the most. A regression tries to minimize the squared errors, so observations on the margins get a lot of weight. Matching puts the emphasis on observations that have similar X’s, and so those observations on the margin might get no weight at all....Matching might make sense if there are observations in your data that have no business being compared to one another, and in that way produce a better estimate" 

Below is a very simple contrived example. Suppose our data looks like this:
We can see that those in the treatment group tend to have higher outcome values so a straight comparison between treatment and controls will overestimate treatment effects due to selection bias:

 E[Y­­­i|di=1] - E[Y­­­i|di=0] =E[Y1i-Y0i]  +{E[Y0i|di=1] - E[Y0i|di=0]}

 However, if we estimate differences based on an exact matching scheme, we get a much smaller estimate of .67. If we run a regression using all of the data we get .75. If we consider 3.78 to be biased upward then both matching and regression have significantly reduced it, and depending on the application the difference between .67 and .75 may not be of great consequence. Of course if we run the regression including only matched variables, we get exactly the same results. (see R code below). This is not so different than the method of trimming based on propensity scores suggested in Angrist and Pischke.

Both methods rely on the same assumptions for identification, so noone can argue superiority of one method over the other with regard to identification of causal effects.

Matching has the advantage of having a nonparametric, alleviating concerns with functional form. However, there are lots of considerations to work through in matching (i.e. 1:1, 1:many, optimal caliper width, variance/bias tradeoff and kernel selection etc.). While all of these possibilities might lead to better estimates, I wonder if they don't sometimes lead to a garden of forking paths.

See also: 

For a neater set of notes related to this post, see:

Matt Bogard. "Regression and Matching (3).pdf" Econometrics, Statistics, Financial Data Modeling (2017). Available at: http://works.bepress.com/matt_bogard/37/

Using R MatchIt for Propensity Score Matching

R Code:

# generate demo data
x <- c(4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9)
d <- c(1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0)
y <- c(6,7,8,8,9,11,12,13,14,2,3,4,5,6,7,8,9,10)

summary(lm(y~x+d)) # regression controlling for x

Wednesday, July 12, 2017

Instrumental Variables and LATE

Often in program evaluation we are interested in estimating the average treatment effect (ATE).  This is in theory the effect of treatment on a randomly selected person from the population. This can be estimated in the context of a randomized controlled trial (RCT) by a comparison of means between treated and untreated participants.

However, sometimes in a randomized experiment, some members selected for treatment may not actually receive treatment (if participation is voluntary, for example the Medicaid expansion in Oregon). In this case, sometimes researchers will compare differences in outcome between those selected for treatment vs those assigned to control groups. This analysis, as assigned or as randomized, is referred to as an intent-to-treat analysis (ITT). With perfect compliance, ITT = ATE.

As discussed previously, using treatment assignment as an instrumental variable  (IV) is another approach to estimating treatment effects. This is referred to as a local average treatment effect (LATE).

What is LATE and how does it give us an unbiased estimate of causal effects?

In simplest terms, LATE is the ATE for the sub-population of compliers in an RCT (or other natural experiment where an instrument is used).

In a randomized controlled trial you can characterize participants as follows: (see this reference from egap.org for a really great primer on this)

Never Takers: those that refuse treatment regardless of treatment/control assignment.

Always Takers: those that get the treatment even if they are assigned to the control group.

Defiers: Those that get the treatment when assigned to the control group and do not receive treatment when assigned to the treatment group. (these people violate an IV assumption referred to monotonicity)

Compliers: those that comply or receive treatment if assigned to a treatment group but do not recieve treatment when assigned to control group.

The outcome for never takers is the same regardless of treatment assignment and in effect cancel out in an IV analysis. As discussed by Angrist and Pishke in Mastering Metrics, the always takers are prime suspects for creating bias in non-compliance scenarios. These folks are typically the more motivated participants and likely would have higher potential outcomes or potentially have a greater benefit from treatment than other participants.  The compliers are characterized as participants that receive treatment only as a result of random assignment. The estimated treatment effect for these folks is often very desirable and in an IV framework can give us an unbiased causal estimate of the treatment effect. This is what is referred to as a local average treatment effect or LATE.

How do we estimate LATE with IVs?

One way LATE estimates are often described is as dividing the ITT effect by the share of compliers. This can also be done in a regression context. Let D be an indicator equal to 1 if treatment is received vs. 0, and let Z be our indicator (0,1) for the original randomization i.e. our instrumental variable. We first regress:

D = β0 + β1 Z + e  

This captures all of the variation in our treatment that is related to our instrument Z, or random assignment. This is 'quasi-experimental' variation. It is also an estimate of the rate of compliance. β1 only picks up the variation in treatment D that is related to Z and leaves all of the variation and unobservable factors related to self selection (i.e. bias) in the residual term.  You can think of this as the filtering process.  We can represent this as: COV(D,Z)/V(Z). 

Then, to relate changes in Z to changes in our target Y we estimate β2  or COV(Y,Z)/V(Z).

Y = β02 Z + e        
Our instrumental variable estimator then becomes:
βIV = β2 / β1  or (Z’Z)-1Z’Y / (Z’Z)-1Z’D or COV(Y,Z)/COV(D,Z)  

The last term gives us the total proportion of ‘quasi-experimental variation’ in D related to Y. We can also view this through a 2SLS modeling strategy:

Stage 1: Regress D on Z to get D* or D = β0 + β1 Z + e 

Stage 2: Regress Y on D*  or  Y = β0IV D* + e 

 As described in Mostly Harmless Econometrics, "Intuitively, conditional on covariates, 2SLS retains only the variation in s [D  in our example above] that is generated by quasi-experimental variation- that is generated by the instrument z"

Regardless of how you want to interpret βIV, we can see that it teases out only that variation in  our treatment D that is unrelated to selection bias and relates it to Y giving us an estimate for the treatment effect of D that is less biased.

The causal path can be represented as:

Z →D→Y   

There are lots of other ways to think about how to interpret IVs. Ultimately they provide us with an estiamate of the LATE which can be interpreted as an average causal effect of treatment for those participants in a study whose enrollment status is determined completely by Z (the treatment assignment) i.e. the compliers and this is often a very relevant effect of interest. 

Marc Bellemare has some really good posts related to this see here, here, and here.