## Tuesday, June 18, 2013

### Unobserved Heterogeneity and Endogeneity

Let's suppose we estimate the following:

Y =β0 + β1 X1+ e            (1)

When we estimate a regression such as (1) above and leave out an important variable such as X2 then our estimate of β1 can become unbiased and inconsistent. In fact, to the extent that X1 and X2 are both correlated, X1 becomes correlated with the error term violating a basic assumption of regression. The omitted information in X2 is referred to in econometrics as ‘unobserved heterogeneity.’ Heterogeneity is simply variation across individual units of observations, and since we can’t observe this variation or heterogeneity as it relates to X2, we have unobserved heterogeneity.  Correlation between an explanatory variable and the error term is referred to as endogeneity.  So in econometrics, when we have an omitted variable (as is often with cases of causal inference and selection bias)  we say we have endogeneity caused by unobserved heterogeneity

How do we characterize the impacts of this on our estimate of β1 ?

We know from basic econometrics that our estimate of β =

b =  (X’X)-1X’Y or COV(Y,X)/VAR(X)           (2)

If we substitute Y = β0 + β1 X1+ e into (2) we get:

COV(β0 + β1 X+ e,X)/VAR(X) =
COV(β0,X)/VAR(X) + COV(β1 X,X)/VAR(X) + COV(e,X)/VAR(X)                         (3)

= 0 + β1 VAR(X)/VAR(X) + COV(e,X)/VAR(X)        (4)

= β1 + COV(e,X)/VAR(X)                                (5)

We can see from (5) that if  we leave out a variable in (1) i.e. we have unobserved heterogeneity, then the correlation that results between X and the error term will not be zero, and our estimate for β1 will be biased by the term  COV(e,X)/VAR(X). If (1) were correctly specified, then the term COV(e,X)/VAR(X) will drop out and we will get an unbiased estimate of β1

1. Thank you for this. Have you also discussed difference in differences estimation especially the variety that takes on multiple time periods for a recurring treatment (as opposed to the common which involves only two periods)?

2. good stuff

however, there is a typo in the first sentence: "When we estimate a regression such as (1) above and leave out an important variable such as X2 then our estimate of β1 can become unbiased and inconsistent."

should "our estimate of β1 can become BIASED and inconsistent

1. YES! THANK YOU! It should say BIASED. I need to correct that.

3. thank you for the reply. i really like your post; it helps clarify the difficult jargon in an intuitive way.

i have a quick, related question. is the phrase, "correlated unobervables" referring to the same phenomenon as "unobserved heterogeneity"?

thanks again