Let's suppose we estimate the following:
Y =β0 + β1 X1+ e (1)
When we estimate a regression such as (1) above and leave out an important variable such as X2 then our estimate of β1 can become unbiased and inconsistent. In fact, to the extent that X1 and X2 are both correlated, X1 becomes correlated with the error term violating a basic assumption of regression. The omitted information in X2 is referred to in econometrics as ‘unobserved heterogeneity.’ Heterogeneity is simply variation across individual units of observations, and since we can’t observe this variation or heterogeneity as it relates to X2, we have unobserved heterogeneity. Correlation between an explanatory variable and the error term is referred to as endogeneity. So in econometrics, when we have an omitted variable (as is often with cases of causal inference and selection bias) we say we have endogeneity caused by unobserved heterogeneity.
How do we characterize the impacts of this on our estimate of β1 ?
We know from basic econometrics that our estimate of β =
b = (X’X)-1X’Y or COV(Y,X)/VAR(X) (2)
If we substitute Y = β0 + β1 X1+ e into (2) we get:
COV(β0 + β1 X+ e,X)/VAR(X) =
COV(β0,X)/VAR(X) + COV(β1 X,X)/VAR(X) + COV(e,X)/VAR(X) (3)
= 0 + β1 VAR(X)/VAR(X) + COV(e,X)/VAR(X) (4)
= β1 + COV(e,X)/VAR(X) (5)
We can see from (5) that if we leave out a variable in (1) i.e. we have unobserved heterogeneity, then the correlation that results between X and the error term will not be zero, and our estimate for β1 will be biased by the term COV(e,X)/VAR(X). If (1) were correctly specified, then the term COV(e,X)/VAR(X) will drop out and we will get an unbiased estimate of β1