Tuesday, March 22, 2016

Identification and Common Trend Assumptions in Difference-in-Differences for Linear vs GLM Models


In a previous post I discussed the conclusion from Lechner’s paper 'The Estimation of Causal Effects by Difference-in-Difference Methods', that difference-in-difference models in a non-linear or GLM context failed to meet the common trend assumptions, and therefore failed to identify treatment effects from a selection on unobservables context.

In that post I noted that Lechner points out (quite rigorously in the context of the potential outcomes framework):

"We start with a “natural” nonlinear model with a linear index structure which is transformed by a link function, G(·), to yield the conditional expectation of the potential outcome.....The common trend assumption relies on differencing out specific terms of the unobservable potential outcome, which does not happen in this nonlinear specification... Whereas the linear specification requires the group specific differences to be time constant, the nonlinear specification requires them to be absent. Of course, this property of this nonlinear specification removes the attractive feature that DiD allows for some selection on unobservable group and individual specific differences. Thus, we conclude that estimating a DiD model with the standard specification of a nonlinear model would usually lead to an inconsistent estimator if the standard common trend assumption is upheld. In other words, if the standard DiD assumptions hold, this nonlinear model does not exploit them (it will usually violate them). Therefore, estimation based on this model does not identify the causal effect”

I wanted to review at a high level exactly how he gets to this result. But I wanted to simplify this as much as possible and start with some basic concepts. Starting with a basic regression model, the population conditional expectation function, or conditional mean of Y given X can be written as:

Regression and Expected Value Notation:

E[Y|X] = β0 + β1 X (1)

and we estimate this with the regression on observed data:

y = b0 + b1X + e (2)

Where b1 is our estimate of the population parameter of interest β1.

If E[b1] = β1 then we say our estimator is unbiased.

Potential Outcomes Notation:

When it comes to experimental designs, we are interested in knowing counterfactuals, that is what value of an outcome would a treatment or program participant have in absence of treatment (the baseline potential outcome) vs. if they participated or were treated? If we specify these 'potential outcomes' as follows:

Y0= baseline potential outcome
Y1= potential treatment outcome

We can characterize the treatment effect as:

E[Y0-Y1] or the difference in potential treated vs baseline outcomes. This is referred to as the average treatment effect or ATE. Sometimes we are interested in, or some models estimate, the average treatment effect on the treated or ATT :E[Y0-Y1 | d = 1]  

where d is an indicator for treatment (d = 1) vs control or untreated (d =0).

Difference-in-Difference Analysis:

Difference-in-difference (DD) estimators assume that in absence of treatment the difference between control (B) and treatment (A) groups would be constant or ‘fixed’ over time. Treatment effects in DD estimators are derived by subtracting differences between pre and post values within treatment and control groups, and then taking a difference in differences between treatment and control groups. The unobservable effects that are constant or fixed over time 'difference out' allowing us to identify treatment effects controlling for these unobservable characteristics with out explicitly measuring them. This characterizes what is referred to as a 'selection on unobservables' framework.


  This can also be estimated using linear regression with an interaction term:

y = b0 + b1 d + b2 t + b3 d*t+ e (3)

where d indicates treatment (d=1 vs d = 0) and the estimated coefficient (b3 ) on the time by treatment interaction term gives us our estimate of treatment effects. 


Lechner and Potential Outcomes Framework:

In an attempt to present the issues with GLM DD models depicted in Lechner (2010) using the simplest notation possible (abusing notation slightly and perhaps at a cost of precision), we can depict the framework for difference-in-difference analysis using expectations:

DID = [E(Y1|D=1)-E( Y0|D=1)] -[E(Y1|D=0)-E(Y0|D=0)] (4)


DID = [pre/post differences for treatment group] – [pre/post differences for control group]

where Y represents the observed outcome values sub-scripted by pre (0) and post periods(1)

We can represent potential outcomes in the regression framework as follows:

E(Yt1|D) = α + tδ1 + dγ “potential outcome if treated” (5)

E(Yt0|D) = α + tδ0 + dγ “potential baseline outcome” (6)

ATET: E(Yt1- Yt0|D= 1) = θ1 = δ    (7)

difference-in-difference of potential outcomes across time if treated”

We can estimate δ with a regression on observed data of the form:

y = b0 + b1 d + b2 t + b3 d*t+ e (3')

where b3 is our estimator for δ.

Common Trend Assumption:
Difference-in-difference (DD) estimators assume that in absence of treatment the difference between control (B) and treatment (A) groups would be constant or ‘fixed’ over time. This can be represented geometrically in a linear modeling context by 'parallel trends' in outcome levels between treatment and control groups in absence of a treatment:


As depicted above, BB represents the trend in outcome Y for a control group. AA represents the counterfactual trend, or parallel or common trend for the treatment group that would occur in absence of treatment. The distance A'A represents a departure from the parallel trend in response to treatment, and would be our DD treatment effect or the value b3 our estimator for δ.

The common trend assumption following Lechner, can be expressed in terms of potential outcomes:

E(Y10|D=1)-E(Y00|D=1) = α + δ0 + γ - α – γ = δ0 (8)

E(Y10|D=0)-E(Y00|D=0) = α + δ0 - α = δ0 (9)

i.e. the pre and post period differences in baseline outcomes is the same (δ0) regardless if individuals are assigned to the treatment group (D=1) or control group (D=0).

Nonlinear Models:

In a GLM framework, with a specific link function G(.) a DD framework can be expressed in terms of potential outcomes as follows:

E(Yt1|D) = G(α + tδ1 + dγ) “potential outcome if treated” (10)

E(Yt0|D) = G(α + tδ0 + dγ) “potential baseline outcome” (11)

DID can be estimated by regression on observed outcomes:

G(b0 + b1 d + b2 t + b3 d*t) (12)

Common Trend Assumption:

E(Y10|D=1)-E(Y00|D=1) = G(α + δ0 + γ) - G(α + γ) (13)

E(Y10|D=0)-E(Y00|D=0) = G(α + δ0 ) - G(α ) (14)

It turns out in a GLM framework, for the common trend assumption to hold, group specific differences must be zero or γ =0. The common trend assumption relies on differencing out specific terms of the unobservable potential outcome, or the individual specific effects we are trying to control for in the selection on unobservables scenario, but in a GLM scenario we have to assume that these effects are zero or absent. In essence, the attractive feature of DD models to control for unobservable effects is not a feature of DD models in a GLM scenario.  
References: 
The Estimation of Causal Effects by Difference-in-Difference Methods
By Michael Lechner Foundations and Trends in Econometrics
Vol. 4, No. 3 (2010) 165–224  


Program Evaluation and the
Difference-in-Difference Estimator
Course Notes
Education Policy and Program Evaluation
Vanderbilt University
October 4, 2008

Difference in Difference Models, Course Notes
ECON 47950: Methods for Inferring Causal Relationships in Economics
William N. Evans
University of Notre Dame
Spring 2008
 

Friday, March 11, 2016

Marginal Effects vs Odds Ratios

Models of binary dependent variables often are estimated using logistic regression or probit models, but the estimated coefficients (or exponentiated coefficients expressed as odds ratios) are often difficult to interpret from a practical standpoint. Empirical economic research often reports ‘marginal effects’, which are more intuitive but often more difficult to obtain from popular statistical software. The most straightforward way to obtain marginal effects is from estimation of linear probability models. This paper uses a toy data set to demonstrate the calculation of odds ratios and marginal effects from logistic regression using SAS and R, while comparing them to the results from a standard linear probability model.

Suppose we have a data set that looks at program participation (for some program or product or service of interest) by age and we want to know the influence of age on the decision to participate. Our data may look something like the excerpt below:

participation     age
1                       25
1                       26
1                       27
1                       28
1                       29
1                       30
0                       31
1                       32
1                       33
0                       34

Theoretically,  this might call for logistic regression for modeling a dichotomous outcome like participant, so we could use SAS or R to get the following results:

                   Estimate Std. Error z value Pr(>|z|) 
(Intercept)  5.92972    2.34258   2.531   0.0114 *
age            -0.14099    0.05656  -2.493   0.0127 *

                    OR               2.5 %       97.5 %
(Intercept)   376.049897 6.2769262 7.864410e+04
age              0.868502 0.7641126 9.595017e-01

 While the estimated coefficients from logistic regression are not easily interpretable (they represent the change in the log of odds of participation for a given change in age),  odds ratios might provide a better summary of the effects of age on participation (odds ratios are derived from exponentiation of the estimated coefficients from logistic regression -see also: The Calculation and Interpretation of Odds Ratios) and may be somewhat more meaningful. We can see the odds ratio associated with age is .8685 which implies that for every year increase in age the odds of participation are about (.8685-1)*100 = -13.15% or 13.5% less.  You tell me what this means if this is the way you think about the likelihood of outcomes in everyday life!

Marginal effects are an alternative metric that can be used to describe the impact of age on participation. Marginal effects can be described as the change in outcome as a function of the change in the treatment (or independent variable of interest) holding all other variables in the model constant. In linear regression, the estimated regression coefficients are marginal effects and are more easily interpreted (more on this later). Marginal effects can be output easily from STATA, however they are not directly available in SAS or R. However there are some adhoc ways of getting them which I will demonstrate here.  (there are some packages in R available to assist with this as well). I am basing most of this directly on two very good blog posts on the topic:

https://statcompute.wordpress.com/2012/09/30/marginal-effects-on-binary-outcome/ 
https://diffuseprior.wordpress.com/2012/04/23/probitlogit-marginal-effects-in-r-2/ 

One approach is to use PROC QLIM and request output of marginal effects. This computes a marginal effect for each observation’s value of x in the data set (because marginal effects may not be constant across the range of explanatory variables). Taking the average of this result gives and estimated ‘sample average estimate of marginal effect’:  -.0258

This tells us that for every year increase in age the probability of participation decreases on average by 2.5%.  For most people, for practical purposes, this is probably a more useful interpretation of the relationship between age and participation compared to odds ratios.  We can calculate this more directly (following the code from the blog post by WenSui Liu) using output from logistic regression and the data step in SAS. Basically for each observation in the data set calculate:

MARGIN_AGE = EXP(XB) / ((1 + EXP(XB)) ** 2) * (-0.1410);

Where -.1410 is the estimated coefficient on age from the original logistic regression model. We can run the same analysis in R, either replicating the results from the data step above, or using the mfx function defined by Alan Fernihough referenced in the diffuseprior blog post mentioned above or the paper referenced below.

The paper notes that this function gives similar results to the mfx function in STATA. And we get almost the same results we got from SAS above but additionally provides bootstrapped standard errors :

marginal.effects   standard.error
      -0.0258330        0.6687069

Marginal Effects from Linear Probability Models

Earlier I mentioned that you could estimate marginal effects directly from the estimated coefficients from a linear probability model. While in some circles LPMs are not viewed favorably, they have a strong following among applied econometricians (see references for more on this). As Angrist and Piscke state in their very popular book Mostly Harmless Econometrics:

"While a nonlinear model may fit the CEF (population conditional expectation function) for LDVs (limited dependent variables) more closely than a linear model, when it comes to marginal effects, this probably matters little"

Using SAS or R we can get the following results from estimating a LPM for this data:

 Coefficients:
                   Estimate    Std. Error  t value    Pr(>|t|)  
(Intercept)  1.700260   0.378572   4.491     0.000111 ***
dat1$age    -0.028699   0.009362  -3.065   0.004775 **

 You can see that the estimate from the linear probability model above gives us a marginal effect  (-.028699) almost identical to the previous estimates derived from logistic regression, as is often the case, and as indicated by Angrist and Pischke.

In the SAS ETS example cited in the references below, a distinction is made between calculating sample average marginal effects (which were discussed above) vs. calculating marginal effects at the mean:

“To evaluate the "average" or "overall" marginal effect, two approaches are frequently used. One approach is to compute the marginal effect at the sample means of the data. The other approach is to compute marginal effect at each observation and then to calculate the sample average of individual marginal effects to obtain the overall marginal effect. For large sample sizes, both the approaches yield similar results. However for smaller samples, averaging the individual marginal effects is preferred (Greene 1997, p. 876)”


For a step by step review of the SAS and R code presented above as well as an additional example with multiple variables see:

Matt Bogard. "Comparing Odds Ratios and Marginal Effects from Logistic Regression and Linear Probability Models" Staff Paper (2016)
Available at: http://works.bepress.com/matt_bogard/30/ 

References: 

Simple logit and probit marginal effects in R.  https://ideas.repec.org/p/ucn/wpaper/201122.html


SAS/ETS Web Examples Computing Marginal Effects for Discrete Dependent Variable Models. http://support.sas.com/rnd/app/examples/ets/margeff/ 

Linear Regression and Analysis of Variance with a Binary Dependent Variable (from EconomicSense, by Matt Bogard).

Angrist, Joshua D. & Jörn-Steffen Pischke. Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press. NJ. 2008.

Probit better than LPM? http://www.mostlyharmlesseconometrics.com/2012/07/probit-better-than-lpm/ 

Love It or Logit. By Marc Bellemare. marcfbellemare.com/wordpress/9024

R Data Analysis Examples: Logit Regression. From http://www.ats.ucla.edu/stat/r/dae/logit.htm   (accessed March 4,2016).

Greene, W. H. (1997), Econometric Analysis, Third edition, Prentice Hall, 339–350.

Wednesday, March 9, 2016

What's the difference between difference-in-difference models in a linear vs nonlinear context?

A while back I discussed a powerful methodology for identification of causal effects from both a selection on observables and unobservables context, namely combining propensity score matching and difference-in-differences. 

But recently I ran across a tweet from Felix Bethke (https://twitter.com/F_Bethke) sharing a blog post by Tom Pepinsky related to plug and play models. At the risk of oversimplifying, the take away was that we can't just take a methodology like DID used in a standard linear regression context and necessarily 'plug it into'  a non-linear context and get the same results. (often we see arguments going the other way around, we can't use linear models in a non-linear context but that is a different battle for another day).  I highly recommend Tom's post for more details and he links to a number of papers that clarify the issues in a very technical sense.

In a linear difference-in-difference (DID) analysis, identification of causal effects hinge on a common trend assumption and interpretation of the estimated regression coefficient on the time x treatment interaction term.

y = b0 + b1 x + b2 t+b3 x*t + e

In Tom's post, and some of the papers, specific attention is given to how the interpretation of the interaction term (and our estimated treatment effect or b3 in a specification like above) changes in a logit or probit context and its something quite different from the causal effect of interest.

I was specifically interested in knowing, is this an issue just for probit and logit models or other nonlinear models, like GLM models in general. For instance, in the healthcare economics literature, its very common to use probit or logit models in a two part modeling context where the second part of a two part model is a GLM model with a log link and gamma distribution. And I have seen some papers using a difference-in-differences across the board with these models.

I took a look at a couple of papers and it appears that these issues are a concern for any GLM model.

In a Health Services Research paper, Karaca-Mandic et al discuss these issues and in the abstract imply that this would apply to log transformed models often used in healthcare economics:

"We discuss the motivation for including interaction terms in multivariate analyses. We then explain how the straightforward interpretation of interaction terms in linear models changes in nonlinear models, using graphs and equations. We extend the basic results from logit and probit to difference‐in‐differences models, models with higher powers of explanatory variables, other nonlinear models (including log transformation and ordered models), and panel data models."

After pointing out several issues, they state:

"It is important to understand that the issues about interaction terms discussed here apply to all nonlinear models, including log transformation models"

More specifically, what are these issues, at least at a high level? Recall, difference-in-difference models are a special case of fixed effects panel data models, where unobserved differences and individual specific effects essentially cancel out providing clean identification of causal effects.  For this to work in the DID framework, a common trends assumption is required.  In the referenced paper below, Lechner points out (quite rigorously in the context of the potential outcomes framework):

"We start with a “natural” nonlinear model with a linear index structure which is transformed by a link function, G(·), to yield the conditional expectation of the potential outcome.....The common trend assumption relies on differencing out specific terms of the unobservable potential outcome, which does not happen in this nonlinear specification... Whereas the linear specification requires the group specific differences to be time constant, the nonlinear specification requires them to be absent. Of course, this property of this nonlinear specification removes the attractive feature that DiD allows for some selection on unobservable group and individual specific differences. Thus, we conclude that estimating a DiD model with the standard specification of a nonlinear model would usually lead to an inconsistent estimator if the standard common trend assumption is upheld. In other words, if the standard DiD assumptions hold, this nonlinear model does not exploit them (it will usually violate them). Therefore, estimation based on this model does not identify the causal effect "

Because they demonstrate that this applies to any GLM specification/link function, this seems to strike a blow to using DID in the context of a lot of the modeling approaches used in healthcare economics or any other field relying on similar GLM specifications.

So as Angrist and Pischke might ask, what is an applied guy to do? One approach even in the context of skewed distributions with high mass points (as is common in the healthcare econometrics space) is to specify a linear model. For count outcomes (utilization like ER visits or hospital admissions are often dichotomized and modeled by logit or probit models) you can just use a linear probability model. For skewed distributions with heavy mass points, dichotomization with a LPM may also be an attractive alternative.

References:

Special thanks to tweets and additional input from Tom Pepinsky and Marc Bellemare.

Interaction Terms in Nonlinear Models
Pinar Karaca-Mandic, Edward C. Norton, and Bryan Dowd
HSR: Health Services Research 47:1, Part I (February 2012)

The Estimation of Causal Effects by Difference-in-Difference Methods
By Michael Lechner Foundations and Trends in Econometrics
Vol. 4, No. 3 (2010) 165–224

Tuesday, March 8, 2016

Applied Econometrics in One Lesson

When it comes to the challenging problems of causal inference (all the issues we encounter that create the gaps between textbook and applied econometrics) I think the best advice I have seen as an applied researcher comes from Marc Bellemare:

Do Both!!

Which seems to be a big takeaway from Angrist and Pischke's  Mostly Harmless Econometrics:

"So what's an applied guy to do? One answer, as always, is to check the robustness of your findings using alternative identifying assumptions. That means that you would like to find broadly similar results using plausible alternative models" 

That's applied econometrics in one lesson. That's the credibility revolution in practice.

Saturday, March 5, 2016

Machine Learning and Econometrics

Not long ago Tyler Cowen blogged at Marginal Revolution about a Quora post by Susan Athey discussing the impact of machine learning on econometrics, flavors of machine learning, and differences in the emphasis placed on tools and methodologies traditional in each field. The differences often hinge on whether one's intention is to explain or predict,  or if one is interested in causal inference vs analytics. I really liked the point about instrumental variables made in the snippet below:

"Yet, a cornerstone of introductory econometrics is that prediction is not causal inference, and indeed a classic economic example is that in many economic datasets, price and quantity are positively correlated.  Firms set prices higher in high-income cities where consumers buy more; they raise prices in anticipation of times of peak demand. A large body of econometric research seeks to REDUCE the goodness of fit of a model in order to estimate the causal effect of, say, changing prices. If prices and quantities are positively correlated in the data, any model that estimates the true causal effect (quantity goes down if you change price) will not do as good a job fitting the data….Techniques like instrumental variables seek to use only some of the information that is in the data – the “clean” or “exogenous” or “experiment-like” variation in price—sacrificing predictive accuracy in the current environment to learn about a more fundamental relationship that will help make decisions about changing price. This type of model has not received almost any attention in ML."

Tyler also points to a wealth of resources by Suan Athey here. And check out the mini-course she taught with Guido Imbens via NBER.

The differences and synergies between tools used in both econometrics and machine learning is something I have been interested in for a long time and have blogged about several times in the past. Kenneth Sanford and Hal Varian have also been writing about this as well. See related content below.

Related Content and Further Reading

Economists as Data Scientists http://econometricsense.blogspot.com/2012/10/economists-as-data-scientists.html

Econometrics, Math, and Machine Learning….what? http://econometricsense.blogspot.com/2015/09/econometrics-math-and-machine.html 

"Mathematical Themes in Economics, Machine Learning, and Bioinformatics" (2010)
Available at: http://works.bepress.com/matt_bogard/7/ 

Notes to 'Support' an Understanding of Support Vector Machines  http://econometricsense.blogspot.com/2012/05/notes-to-support-understanding-of.html

Culture War: Classical Statistics vs. Machine Learning http://econometricsense.blogspot.com/2011/01/classical-statistics-vs-machine.html

Analytics vs Causal Inference http://econometricsense.blogspot.com/2014/01/analytics-vs-causal-inference.html

Big Data: Don’t throw the baby out with the bath water http://econometricsense.blogspot.com/2014/05/big-data-dont-throw-baby-out-with.html

To Explain or Predict http://econometricsense.blogspot.com/2015/03/to-explain-or-predict.html 

Big Data: Causality and Local Expertise Are Key in Agronomic Applications http://econometricsense.blogspot.com/2014/05/big-data-think-global-act-local-when-it.html


Big Data:  New Tricks for Econometrics
Hal R. Varian
June 2013
Revised:  April 14, 2014
http://people.ischool.berkeley.edu/~hal/Papers/2013/ml.pdf

Is machine learning trending with economists? (Kenneth Sanford)  http://blogs.sas.com/content/subconsciousmusings/2015/06/05/is-machine-learning-trending-with-economists/