Wednesday, May 25, 2016

Divide by 4 Rule for Marginal Effects

Previously I wrote about the practical differences between marginal effects and odds ratios with regard to logistic regression.

Recently, I ran across a tweet from Michael Grogan linking to one of his posts using logistic regression to model dividend probabilities. This really got me interested:

"Moreover, to obtain a measure in probability terms – one could divide the logit coefficient by 4 to obtain an estimate of the probability of a dividend increase. In this case, the logit coefficient of 0.8919 divided by 4 gives us a result of 22.29%, which means that for every 1 year increase in consecutive dividend increases, the probability of an increase in the dividend this year rises by 22.29%."

I had never heard of this 'divide by 4' short cut to get to marginal effects. While you can get those in STATA, R or SAS with a little work, I think this trick would be very handy for instance if you are reading someone else’s paper/results and just wanted a ballpark on marginal effects (instead of interpreting odds ratios).

I did some additional investigation on this and ran across Stephen Turner's Getting Genetics Done blog post related to this, where he goes a little deeper into the mathematics behind this:

"The slope of this curve (1st derivative of the logistic curve) is maximized at a+ßx=0, where it takes on the value:
ße0/(1+e0

=ß(1)/(1+1)²

=ß/4

So you can take the logistic regression coefficients (not including the intercept) and divide them by 4 to get an upper bound of the predictive difference in probability of the outcome y=1 per unit increase in x."


Stephen points to Andrew Gelman, who may be the originator of this, citing the text Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman, Jennifer Hill. There is some pushback in the comments to Stephen's post, but I still think this is a nice shortcut for an on the fly interpretation of reported results.

If you go back to the output from my previous post on marginal effects, the estimated logistic regression coefficients were:

                   Estimate Std. Error z value Pr(>|z|) 
(Intercept)  5.92972    2.34258   2.531   0.0114 *
age            -0.14099    0.05656  -2.493   0.0127 *


And if you apply the divide by 4 rule you get:

-0.14099 / 4 = -.0352

While this is not a proof, or even a simulation, it is close to the minimum (which would be the upper bound for negative marginal effects) for this data (see full R program below):

summary(meffx)
   Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
-0.03525 -0.03262 -0.02697 -0.02583 -0.02030 -0.01071


R Code:

 #------------------------------------------------------------------
 # PROGRAM NAME: MEFF and Odds Ratios
 # DATE: 3/3/16
 # CREATED BY: MATT BOGARD
 # PROJECT FILE:                       
 #------------------------------------------------------------------
 # PURPOSE: GENERATE MARGINAL EFFECTS FOR LOGISTIC REGRESSION AND COMPARE TO:
 # ODDS RATIOS / RESULTS FROM R
 #
 # REFERENCES: https://statcompute.wordpress.com/2012/09/30/marginal-effects-on-binary-outcome/
 #             https://diffuseprior.wordpress.com/2012/04/23/probitlogit-marginal-effects-in-r-2/
 #
 # UCD CENTRE FOR ECONOMIC RESEARCH 
 #  WORKING PAPER SERIES 
 # 2011 
 # Simple Logit and Probit Marginal Effects in R 
 # Alan Fernihough, University College Dublin 
 # WP11/22 
 # October 2011 
 # http://www.ucd.ie/t4cms/WP11_22.pdf
 #------------------------------------------------------------------;
 
 
#-----------------------------------------------------
#   generate data for continuous explantory variable
#----------------------------------------------------
 
 
Input = ("participate age
1 25
1 26
1 27
1 28
1 29
1 30
0 31
1 32
1 33
0 34
1 35
1 36
1 37
1 38
0 39
0 40
0 41
1 42
1 43
0 44
0 45
1 46
1 47
0 48
0 49
0 50
0 51
0 52
1 53
0 54
")
 
dat1 <-  read.table(textConnection(Input),header=TRUE)
 
summary(dat1) # summary stats
 
#### run logistic regression model
 
mylogit <- glm(participate ~ age, data = dat1, family = "binomial")
summary(mylogit)
 
exp(cbind(OR = coef(mylogit), confint(mylogit))) # get odds ratios
 
#-------------------------------------------------
| marginal effects calculations
#-------------------------------------------------
 
 
#--------------------------------------------------------------------
# mfx function for maginal effects from a glm model
#
# from: https://diffuseprior.wordpress.com/2012/04/23/probitlogit-marginal-effects-in-r-2/  
# based on:
# UCD CENTRE FOR ECONOMIC RESEARCH 
# WORKING PAPER SERIES 
# 2011 
# Simple Logit and Probit Marginal Effects in R 
# Alan Fernihough, University College Dublin 
# WP11/22 
#October 2011 
# http://www.ucd.ie/t4cms/WP11_22.pdf
#---------------------------------------------------------------------
 
mfx <- function(x,sims=1000){
set.seed(1984)
pdf <- ifelse(as.character(x$call)[3]=="binomial(link = \"probit\")",
mean(dnorm(predict(x, type = "link"))),
mean(dlogis(predict(x, type = "link"))))
pdfsd <- ifelse(as.character(x$call)[3]=="binomial(link = \"probit\")",
sd(dnorm(predict(x, type = "link"))),
sd(dlogis(predict(x, type = "link"))))
marginal.effects <- pdf*coef(x)
sim <- matrix(rep(NA,sims*length(coef(x))), nrow=sims)
for(i in 1:length(coef(x))){
sim[,i] <- rnorm(sims,coef(x)[i],diag(vcov(x)^0.5)[i])
}
pdfsim <- rnorm(sims,pdf,pdfsd)
sim.se <- pdfsim*sim
res <- cbind(marginal.effects,sd(sim.se))
colnames(res)[2] <- "standard.error"
ifelse(names(x$coefficients[1])=="(Intercept)",
return(res[2:nrow(res),]),return(res))
}
 
# marginal effects from logit
mfx(mylogit)
 
 
### code it yourself for marginal effects at the mean
 
summary(dat1)
 
b0 <-  5.92972   # estimated intercept from logit
b1 <-  -0.14099  # estimated b from logit
 
xvar <- 39.5   # reference value (i.e. mean)for explanatory variable
d <- .0001     # incremental change in x
 
xbi <- (xvar + d)*b1 + b0
xbj <- (xvar - d)*b1 + b0
meff <- ((exp(xbi)/(1+exp(xbi)))-(exp(xbj)/(1+exp(xbj))))/(d*2) ;
print(meff)
 
 
### a different perhaps easier formulation for me at the mean
 
XB <- xvar*b1 + b0 # this could be expanded for multiple b's or x's
meffx <- (exp(XB)/((1+exp(XB))^2))*b1
print(meffx)
 
 
### averaging the meff for the whole data set
 
dat1$XB <- dat1$age*b1 + b0
 
meffx <- (exp(dat1$XB)/((1+exp(dat1$XB))^2))*b1
summary(meffx) # get mean
 
#### marginal effects from linear model
 
lpm <- lm(dat1$participate~dat1$age)
summary(lpm)
 
#---------------------------------------------------
#
#
# multivariable case
#
#
#---------------------------------------------------
 
dat2 <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
 
head(dat2) 
 
summary(dat2) # summary stats
 
#### run logistic regression model
 
mylogit <- glm(admit ~ gre + gpa, data = dat2, family = "binomial")
summary(mylogit)
 
exp(cbind(OR = coef(mylogit), confint(mylogit))) # get odds ratios
 
# marginal effects from logit
mfx(mylogit)
 
 
### code it yourself for marginal effects at the mean
 
summary(dat1)
 
b0 <-  -4.949378    # estimated intercept from logit
b1 <-  0.002691     # estimated b for gre
b2 <-   0.754687    # estimated b for gpa
 
x1 <- 587    # reference value (i.e. mean)for gre
x2 <- 3.39   # reference value (i.e. mean)for gre
d <- .0001   # incremental change in x
 
# meff at means for gre
xbi <- (x1 + d)*b1 + b2*x2 + b0
xbj <- (x1 - d)*b1 + b2*x2 + b0
meff <- ((exp(xbi)/(1+exp(xbi)))-(exp(xbj)/(1+exp(xbj))))/(d*2) ;
print(meff)
 
# meff at means for gpa
xbi <- (x2 + d)*b2 + b1*x1 + b0
xbj <- (x2 - d)*b2 + b1*x1 + b0
meff <- ((exp(xbi)/(1+exp(xbi)))-(exp(xbj)/(1+exp(xbj))))/(d*2) ;
print(meff)
 
 
### a different perhaps easier formulation for me at the mean
 
XB <- x1*b1 +x2*b2 + b0 # this could be expanded for multiple b's or x's
 
# meff at means for gre
meffx <- (exp(XB)/((1+exp(XB))^2))*b1
print(meffx)
 
# meff at means for gpa
meffx <- (exp(XB)/((1+exp(XB))^2))*b2
print(meffx)
 
### averaging the meff for the whole data set
 
dat2$XB <- dat2$gre*b1 + dat2$gpa*b2 + b0
 
# sample avg meff for gre
meffx <- (exp(dat1$XB)/((1+exp(dat1$XB))^2))*b1
summary(meffx) # get mean
 
# sample avg meff for gpa
meffx <- (exp(dat1$XB)/((1+exp(dat1$XB))^2))*b2
summary(meffx) # get mean
 
#### marginal effects from linear model
 
lpm <- lm(admit ~ gre + gpa, data = dat2)
summary(lpm)
Created by Pretty R at inside-R.org

Tuesday, May 24, 2016

Data Scientists vs Algorithms vs Real Solutions to Real Problems

A few weeks ago there was a short tweet in my tweetstream that kindled some thoughts.

"All people are biased, that's why we need algorithms!" "All algorithms are biased, that's why we need people!" via @pmarca

And a retweet/reply by Diego Kuonen:

"Algorithms are aids to thinking and NOT replacements for it"

This got me thinking about  a lot of work that I have been doing,  past job interviews and conversations I have had with headhunters and 'data science' recruiters, as well as a number of discussions or sometimes arguments about what defines data science and about 'unicorns' and 'fake' data scientists. I ran across a couple interesting perspectives related to this some years back:

What sets data scientists apart from other data workers, including data analysts, is their ability to create logic behind the data that leads to business decisions. "Data scientists extract data, formulate models and apply quantitative analysis in a proactive manner" -Laura Kelley, Vice President, Modis.

"They can suck data out of a server log, a telecom billing file, or the alternator on a locomotive, and figure out what the heck is going on with it. They create new products and services for customers. They can also interface with carbon-based lifeforms — senior executives, product managers, CTOs, and CIOs. You need them." - Can You Live Without a Data Scientist, Harvard Business Review.

I have seen numerous variations on Drew Conway's data science Venn Diagram, but I think Drew still nails it down pretty well. If you can write code, understand statistics, and can apply those skills based on your specific subject matter expertise, to me that meets the threshold of the minimum skills most employers might require for you to do value added work. But these tweets beg the question, for many business problems, do we need algorithms at all, or what kind of people do we really need?

 Absolutely there are differences in skillsets required for machine learning vs traditional statistical inference, and I know there are definitely instances where knowing how to set up a Hadoop cluster can be valuable for certain problems. Maybe you do need a complex algorithm to power a killer ap or recommender system.

I think part of the hype and snobery around the terms data science and data scientist might stem from the fact that they are used in so many different contexts and they mean so many things to so many people that there is fear that the true meaning will be lost along with one's relevance in this space as a data scientist. It might be better to forget about semantics and just concentrate on the ends that we are trying to achieve.

I think a vast majority of businesses really need insights driven by people with subject matter expertise and the ability to clean, extract, analyze, visualize, and probably most importantly, communicate. Sometimes the business need requires prediction, other times inference. Many times you may not need a complicated algorithm or experimental design at all, not as much as you need someone to make sense of the nasty transnational data your business is producing and summarize it all with something maybe as simple as cross tabs. Sometimes you might need a PhD computer scientist or Engineer that can meet the strictest of data science thresholds, but lots of times what you really may need is a statistician, econometrician, biometrician, or just a good MBA or business analyst that understands predictive modeling, causal inference, and the basics of a left join.
This is why one recommendation may be to pursue 'data scientists'  outside of some of the traditional academic disciplines assocaited with data science.

There was a similar discussion a couple years ago on the SAS Voices Blog in a post by Tonay Balan titled "Data scientist or statistician: What's in a name?" 

"Obviously, there are a lot of opinions about the terminology.  Here’s my perspective:  The rise and fall of these titles points to the rapid pace of change in our industry.  However, there is one thing that remains constant.  As companies collect more and more data, it is imperative that they have individuals who are skilled at extracting information from that data in a way that is relevant to the business, scientifically accurate, and that can be used to drive better decisions, faster.  Whether they call themselves statisticians, data scientists, or SAS users, their end goal is the same."

And really good insight from Evan Stubbs (in the comment section):

"Personally, I think it's just the latest attempt to clarify what's actually an extremely difficult role. I don't think it's synonymous with any existing title (including statistician, mathematician, analyst, data miner, and so on) but on the same note, I don't think it's where we'll finally end up.

From my perspective, the market is trying to create a shorthand signal that describes someone with:

* An applied (rather than theoretical) focus
* A broader (rather than narrow) set of technical skills
* A focus on outcomes over analysis
* A belief in creating processes rather than doing independent activities
* An ability to communicate along with an awareness of organisational psychology, persuasion, and influence
* An emphasis on recommendations over insight

Existing roles and titles don't necessarily identify those characteristics.

While "Data Scientist" is the latest attempt to create an all-encompassing title, I don't think it'll last. On one hand, it's very generic. On the other, it still implies a technical focus - much of the value of these people stems from their ability to communicate, interface with the business, be creative, and drive outcomes. "Data Scientist", to me at least, carries very research-heavy connotations, something that dilutes the applied and (often) political nature of the field. "


One thing is for certain, recruiters might have a better shot at placing candidates for their clients if role descriptions would just say what they mean and leave the fighting over who's a real data scientist to the LinkedIn discussion boards and their tweetstream.

See also:

Economists as Data Scientists

Why Study Agricultural and Applied Economics?

Tuesday, March 22, 2016

Identification and Common Trend Assumptions in Difference-in-Differences for Linear vs GLM Models


In a previous post I discussed the conclusion from Lechner’s paper 'The Estimation of Causal Effects by Difference-in-Difference Methods', that difference-in-difference models in a non-linear or GLM context failed to meet the common trend assumptions, and therefore failed to identify treatment effects from a selection on unobservables context.

In that paper I noted that Lechner points out (quite rigorously in the context of the potential outcomes framework):

"We start with a “natural” nonlinear model with a linear index structure which is transformed by a link function, G(·), to yield the conditional expectation of the potential outcome.....The common trend assumption relies on differencing out specific terms of the unobservable potential outcome, which does not happen in this nonlinear specification... Whereas the linear specification requires the group specific differences to be time constant, the nonlinear specification requires them to be absent. Of course, this property of this nonlinear specification removes the attractive feature that DiD allows for some selection on unobservable group and individual specific differences. Thus, we conclude that estimating a DiD model with the standard specification of a nonlinear model would usually lead to an inconsistent estimator if the standard common trend assumption is upheld. In other words, if the standard DiD assumptions hold, this nonlinear model does not exploit them (it will usually violate them). Therefore, estimation based on this model does not identify the causal effect”

I wanted to review at a high level exactly how he gets to this result. But I wanted to simplify this as much as possible and start with some basic concepts. Starting with a basic regression model, the population conditional expectation function, or conditional mean of Y given X can be written as:

Regression and Expected Value Notation:

E[Y|X] = β0 + β1 X (1)

and we estimate this with the regression on observed data:

y = b0 + b1X + e (2)

Where b1 is our estimate of the population parameter of interest β1.

If E[b1] = β1 then we say our estimator is unbiased.

Potential Outcomes Notation:

When it comes to experimental designs, we are interested in knowing counterfactuals, that is what value of an outcome would a treatment or program participant have in absence of treatment (the baseline potential outcome) vs. if they participated or were treated? If we specify these 'potential outcomes' as follows:

Y0= baseline potential outcome
Y1= potential treatment outcome

We can characterize the treatment effect as:

E[Y0-Y1] or the difference in potential treated vs baseline outcomes. This is referred to as the average treatment effect or ATE. Sometimes we are interested in, or some models estimate, the average treatment effect on the treated or ATT :E[Y0-Y1 | d = 1]  

where d is an indicator for treatment (d = 1) vs control or untreated (d =0).

Difference-in-Difference Analysis:

Difference-in-difference (DD) estimators assume that in absence of treatment the difference between control (B) and treatment (A) groups would be constant or ‘fixed’ over time. Treatment effects in DD estimators are derived by subtracting differences between pre and post values within treatment and control groups, and then taking a difference in differences between treatment and control groups. The unobservable effects that are constant or fixed over time 'difference out' allowing us to identify treatment effects controlling for these unobservable characteristics with out explicitly measuring them. This characterizes what is referred to as a 'selection on unobservables' framework.


  This can also be estimated using linear regression with an interaction term:

y = b0 + b1 d + b2 t + b3 d*t+ e (3)

where d indicates treatment (d=1 vs d = 0) and the estimated coefficient (b3 ) on the time by treatment interaction term gives us our estimate of treatment effects. 


Lechner and Potential Outcomes Framework:

In an attempt to present the issues with GLM DD models depicted in Lechner (2010) using the simplest notation possible (abusing notation slightly and perhaps at a cost of precision), we can depict the framework for difference-in-difference analysis using expectations:

DID = [E(Y1|D=1)-E( Y0|D=1)] -[E(Y1|D=0)-E(Y0|D=0)] (4)


DID = [pre/post differences for treatment group] – [pre/post differences for control group]

where Y represents the observed outcome values sub-scripted by pre (0) and post periods(1)

We can represent potential outcomes in the regression framework as follows:

E(Yt1|D) = α + tδ1 + dγ “potential outcome if treated” (5)

E(Yt0|D) = α + tδ0 + dγ “potential baseline outcome” (6)

ATET: E(Yt1- Yt0|D= 1) = θ1 = δ    (7)

difference-in-difference of potential outcomes across time if treated”

We can estimate δ with a regression on observed data of the form:

y = b0 + b1 d + b2 t + b3 d*t+ e (3')

where b3 is our estimator for δ.

Common Trend Assumption:
Difference-in-difference (DD) estimators assume that in absence of treatment the difference between control (B) and treatment (A) groups would be constant or ‘fixed’ over time. This can be represented geometrically in a linear modeling context by 'parallel trends' in outcome levels between treatment and control groups in absence of a treatment:


As depicted above, BB represents the trend in outcome Y for a control group. AA represents the counterfactual trend, or parallel or common trend for the treatment group that would occur in absence of treatment. The distance A'A represents a departure from the parallel trend in response to treatment, and would be our DD treatment effect or the value b3 our estimator for δ.

The common trend assumption following Lechner, can be expressed in terms of potential outcomes:

E(Y10|D=1)-E(Y00|D=1) = α + δ0 + γ - α – γ = δ0 (8)

E(Y10|D=0)-E(Y00|D=0) = α + δ0 - α = δ0 (9)

i.e. the pre and post period differences in baseline outcomes is the same (δ0) regardless if individuals are assigned to the treatment group (D=1) or control group (D=0).

Nonlinear Models:

In a GLM framework, with a specific link function G(.) a DD framework can be expressed in terms of potential outcomes as follows:

E(Yt1|D) = G(α + tδ1 + dγ) “potential outcome if treated” (10)

E(Yt0|D) = G(α + tδ0 + dγ) “potential baseline outcome” (11)

DID can be estimated by regression on observed outcomes:

G(b0 + b1 d + b2 t + b3 d*t) (12)

Common Trend Assumption:

E(Y10|D=1)-E(Y00|D=1) = G(α + δ0 + γ) - G(α + γ) (13)

E(Y10|D=0)-E(Y00|D=0) = G(α + δ0 ) - G(α ) (14)

It turns out in a GLM framework, for the common trend assumption to hold, group specific differences must be zero or γ =0. The common trend assumption relies on differencing out specific terms of the unobservable potential outcome, or the individual specific effects we are trying to control for in the selection on unobservables scenario, but in a GLM scenario we have to assume that these effects are zero or absent. In essence, the attractive feature of DD models to control for unobservable effects is not a feature of DD models in a GLM scenario.  
References: 
The Estimation of Causal Effects by Difference-in-Difference Methods
By Michael Lechner Foundations and Trends in Econometrics
Vol. 4, No. 3 (2010) 165–224  


Program Evaluation and the
Difference-in-Difference Estimator
Course Notes
Education Policy and Program Evaluation
Vanderbilt University
October 4, 2008

Difference in Difference Models, Course Notes
ECON 47950: Methods for Inferring Causal Relationships in Economics
William N. Evans
University of Notre Dame
Spring 2008
 

Friday, March 11, 2016

Marginal Effects vs Odds Ratios

Models of binary dependent variables often are estimated using logistic regression or probit models, but the estimated coefficients (or exponentiated coefficients expressed as odds ratios) are often difficult to interpret from a practical standpoint. Empirical economic research often reports ‘marginal effects’, which are more intuitive but often more difficult to obtain from popular statistical software. The most straightforward way to obtain marginal effects is from estimation of linear probability models. This paper uses a toy data set to demonstrate the calculation of odds ratios and marginal effects from logistic regression using SAS and R, while comparing them to the results from a standard linear probability model.

Suppose we have a data set that looks at program participation (for some program or product or service of interest) by age and we want to know the influence of age on the decision to participate. Our data may look something like the excerpt below:

participation     age
1                       25
1                       26
1                       27
1                       28
1                       29
1                       30
0                       31
1                       32
1                       33
0                       34

Theoretically,  this might call for logistic regression for modeling a dichotomous outcome like participant, so we could use SAS or R to get the following results:

                   Estimate Std. Error z value Pr(>|z|) 
(Intercept)  5.92972    2.34258   2.531   0.0114 *
age            -0.14099    0.05656  -2.493   0.0127 *

                    OR               2.5 %       97.5 %
(Intercept)   376.049897 6.2769262 7.864410e+04
age              0.868502 0.7641126 9.595017e-01

 While the estimated coefficients from logistic regression are not easily interpretable (they represent the change in the log of odds of participation for a given change in age),  odds ratios might provide a better summary of the effects of age on participation (odds ratios are derived from exponentiation of the estimated coefficients from logistic regression -see also: The Calculation and Interpretation of Odds Ratios) and may be somewhat more meaningful. We can see the odds ratio associated with age is .8685 which implies that for every year increase in age the odds of participation are about (.8685-1)*100 = -13.15% or 13.5% less.  You tell me what this means if this is the way you think about the likelihood of outcomes in everyday life!

Marginal effects are an alternative metric that can be used to describe the impact of age on participation. Marginal effects can be described as the change in outcome as a function of the change in the treatment (or independent variable of interest) holding all other variables in the model constant. In linear regression, the estimated regression coefficients are marginal effects and are more easily interpreted (more on this later). Marginal effects can be output easily from STATA, however they are not directly available in SAS or R. However there are some adhoc ways of getting them which I will demonstrate here.  (there are some packages in R available to assist with this as well). I am basing most of this directly on two very good blog posts on the topic:

https://statcompute.wordpress.com/2012/09/30/marginal-effects-on-binary-outcome/ 
https://diffuseprior.wordpress.com/2012/04/23/probitlogit-marginal-effects-in-r-2/ 

One approach is to use PROC QLIM and request output of marginal effects. This computes a marginal effect for each observation’s value of x in the data set (because marginal effects may not be constant across the range of explanatory variables). Taking the average of this result gives and estimated ‘sample average estimate of marginal effect’:  -.0258

This tells us that for every year increase in age the probability of participation decreases on average by 2.5%.  For most people, for practical purposes, this is probably a more useful interpretation of the relationship between age and participation compared to odds ratios.  We can calculate this more directly (following the code from the blog post by WenSui Liu) using output from logistic regression and the data step in SAS. Basically for each observation in the data set calculate:

MARGIN_AGE = EXP(XB) / ((1 + EXP(XB)) ** 2) * (-0.1410);

Where -.1410 is the estimated coefficient on age from the original logistic regression model. We can run the same analysis in R, either replicating the results from the data step above, or using the mfx function defined by Alan Fernihough referenced in the diffuseprior blog post mentioned above or the paper referenced below.

The paper notes that this function gives similar results to the mfx function in STATA. And we get almost the same results we got from SAS above but additionally provides bootstrapped standard errors :

marginal.effects   standard.error
      -0.0258330        0.6687069

Marginal Effects from Linear Probability Models

Earlier I mentioned that you could estimate marginal effects directly from the estimated coefficients from a linear probability model. While in some circles LPMs are not viewed favorably, they have a strong following among applied econometricians (see references for more on this). As Angrist and Piscke state in their very popular book Mostly Harmless Econometrics:

"While a nonlinear model may fit the CEF (population conditional expectation function) for LDVs (limited dependent variables) more closely than a linear model, when it comes to marginal effects, this probably matters little"

Using SAS or R we can get the following results from estimating a LPM for this data:

 Coefficients:
                   Estimate    Std. Error  t value    Pr(>|t|)  
(Intercept)  1.700260   0.378572   4.491     0.000111 ***
dat1$age    -0.028699   0.009362  -3.065   0.004775 **

 You can see that the estimate from the linear probability model above gives us a marginal effect  (-.028699) almost identical to the previous estimates derived from logistic regression, as is often the case, and as indicated by Angrist and Pischke.

In the SAS ETS example cited in the references below, a distinction is made between calculating sample average marginal effects (which were discussed above) vs. calculating marginal effects at the mean:

“To evaluate the "average" or "overall" marginal effect, two approaches are frequently used. One approach is to compute the marginal effect at the sample means of the data. The other approach is to compute marginal effect at each observation and then to calculate the sample average of individual marginal effects to obtain the overall marginal effect. For large sample sizes, both the approaches yield similar results. However for smaller samples, averaging the individual marginal effects is preferred (Greene 1997, p. 876)”


For a step by step review of the SAS and R code presented above as well as an additional example with multiple variables see:

Matt Bogard. "Comparing Odds Ratios and Marginal Effects from Logistic Regression and Linear Probability Models" Staff Paper (2016)
Available at: http://works.bepress.com/matt_bogard/30/ 

References: 

Simple logit and probit marginal effects in R.  https://ideas.repec.org/p/ucn/wpaper/201122.html


SAS/ETS Web Examples Computing Marginal Effects for Discrete Dependent Variable Models. http://support.sas.com/rnd/app/examples/ets/margeff/ 

Linear Regression and Analysis of Variance with a Binary Dependent Variable (from EconomicSense, by Matt Bogard).

Angrist, Joshua D. & Jörn-Steffen Pischke. Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press. NJ. 2008.

Probit better than LPM? http://www.mostlyharmlesseconometrics.com/2012/07/probit-better-than-lpm/ 

Love It or Logit. By Marc Bellemare. marcfbellemare.com/wordpress/9024

R Data Analysis Examples: Logit Regression. From http://www.ats.ucla.edu/stat/r/dae/logit.htm   (accessed March 4,2016).

Greene, W. H. (1997), Econometric Analysis, Third edition, Prentice Hall, 339–350.

Wednesday, March 9, 2016

What's the difference between difference-in-difference models in a linear vs nonlinear context?

A while back I discussed a powerful methodology for identification of causal effects from both a selection on observables and unobservables context, namely combining propensity score matching and difference-in-differences. 

But recently I ran across a tweet from Felix Bethke (https://twitter.com/F_Bethke) sharing a blog post by Tom Pepinsky related to plug and play models. At the risk of oversimplifying, the take away was that we can't just take a methodology like DID used in a standard linear regression context and necessarily 'plug it into'  a non-linear context and get the same results. (often we see arguments going the other way around, we can't use linear models in a non-linear context but that is a different battle for another day).  I highly recommend Tom's post for more details and he links to a number of papers that clarify the issues in a very technical sense.

In a linear difference-in-difference (DID) analysis, identification of causal effects hinge on a common trend assumption and interpretation of the estimated regression coefficient on the time x treatment interaction term.

y = b0 + b1 x + b2 t+b3 x*t + e

In Tom's post, and some of the papers, specific attention is given to how the interpretation of the interaction term (and our estimated treatment effect or b3 in a specification like above) changes in a logit or probit context and its something quite different from the causal effect of interest.

I was specifically interested in knowing, is this an issue just for probit and logit models or other nonlinear models, like GLM models in general. For instance, in the healthcare economics literature, its very common to use probit or logit models in a two part modeling context where the second part of a two part model is a GLM model with a log link and gamma distribution. And I have seen some papers using a difference-in-differences across the board with these models.

I took a look at a couple of papers and it appears that these issues are a concern for any GLM model.

In a Health Services Research paper, Karaca-Mandic et al discuss these issues and in the abstract imply that this would apply to log transformed models often used in healthcare economics:

"We discuss the motivation for including interaction terms in multivariate analyses. We then explain how the straightforward interpretation of interaction terms in linear models changes in nonlinear models, using graphs and equations. We extend the basic results from logit and probit to difference‐in‐differences models, models with higher powers of explanatory variables, other nonlinear models (including log transformation and ordered models), and panel data models."

After pointing out several issues, they state:

"It is important to understand that the issues about interaction terms discussed here apply to all nonlinear models, including log transformation models"

More specifically, what are these issues, at least at a high level? Recall, difference-in-difference models are a special case of fixed effects panel data models, where unobserved differences and individual specific effects essentially cancel out providing clean identification of causal effects.  For this to work in the DID framework, a common trends assumption is required.  In the referenced paper below, Lechner points out (quite rigorously in the context of the potential outcomes framework):

"We start with a “natural” nonlinear model with a linear index structure which is transformed by a link function, G(·), to yield the conditional expectation of the potential outcome.....The common trend assumption relies on differencing out specific terms of the unobservable potential outcome, which does not happen in this nonlinear specification... Whereas the linear specification requires the group specific differences to be time constant, the nonlinear specification requires them to be absent. Of course, this property of this nonlinear specification removes the attractive feature that DiD allows for some selection on unobservable group and individual specific differences. Thus, we conclude that estimating a DiD model with the standard specification of a nonlinear model would usually lead to an inconsistent estimator if the standard common trend assumption is upheld. In other words, if the standard DiD assumptions hold, this nonlinear model does not exploit them (it will usually violate them). Therefore, estimation based on this model does not identify the causal effect "

Because they demonstrate that this applies to any GLM specification/link function, this seems to strike a blow to using DID in the context of a lot of the modeling approaches used in healthcare economics or any other field relying on similar GLM specifications.

So as Angrist and Pischke might ask, what is an applied guy to do? One approach even in the context of skewed distributions with high mass points (as is common in the healthcare econometrics space) is to specify a linear model. For dichotomous outcomes (utilization like ER visits or hospital admissions are often dichotomized and modeled by logit or probit models) you can just use a linear probability model. For skewed distributions with heavy mass points, dichotomization with a LPM may also be an attractive alternative.

References:

Special thanks to tweets and additional input from Tom Pepinsky and Marc Bellemare.

Interaction Terms in Nonlinear Models
Pinar Karaca-Mandic, Edward C. Norton, and Bryan Dowd
HSR: Health Services Research 47:1, Part I (February 2012)

The Estimation of Causal Effects by Difference-in-Difference Methods
By Michael Lechner Foundations and Trends in Econometrics
Vol. 4, No. 3 (2010) 165–224

Tuesday, March 8, 2016

Applied Econometrics in One Lesson

When it comes to the challenging problems of causal inference (all the issues we encounter that create the gaps between textbook and applied econometrics) I think the best advice I have seen as an applied researcher comes from Marc Bellemare:

Do Both!!

Which seems to be a big takeaway from Angrist and Pischke's  Mostly Harmless Econometrics:

"So what's an applied guy to do? One answer, as always, is to check the robustness of your findings using alternative identifying assumptions. That means that you would like to find broadly similar results using plausible alternative models" 

That's applied econometrics in one lesson. That's the credibility revolution in practice.

Saturday, March 5, 2016

Machine Learning and Econometrics

Not long ago Tyler Cowen blogged at Marginal Revolution about a Quora post by Susan Athey discussing the impact of machine learning on econometrics, flavors of machine learning, and differences in the emphasis placed on tools and methodologies traditional in each field. The differences often hinge on whether one's intention is to explain or predict,  or if one is interested in causal inference vs analytics. I really liked the point about instrumental variables made in the snippet below:

"Yet, a cornerstone of introductory econometrics is that prediction is not causal inference, and indeed a classic economic example is that in many economic datasets, price and quantity are positively correlated.  Firms set prices higher in high-income cities where consumers buy more; they raise prices in anticipation of times of peak demand. A large body of econometric research seeks to REDUCE the goodness of fit of a model in order to estimate the causal effect of, say, changing prices. If prices and quantities are positively correlated in the data, any model that estimates the true causal effect (quantity goes down if you change price) will not do as good a job fitting the data….Techniques like instrumental variables seek to use only some of the information that is in the data – the “clean” or “exogenous” or “experiment-like” variation in price—sacrificing predictive accuracy in the current environment to learn about a more fundamental relationship that will help make decisions about changing price. This type of model has not received almost any attention in ML."

Tyler also points to a wealth of resources by Suan Athey here. And check out the mini-course she taught with Guido Imbens via NBER.

The differences and synergies between tools used in both econometrics and machine learning is something I have been interested in for a long time and have blogged about several times in the past. Kenneth Sanford and Hal Varian have also been writing about this as well. See related content below.

Related Content and Further Reading

Economists as Data Scientists http://econometricsense.blogspot.com/2012/10/economists-as-data-scientists.html

Econometrics, Math, and Machine Learning….what? http://econometricsense.blogspot.com/2015/09/econometrics-math-and-machine.html 

"Mathematical Themes in Economics, Machine Learning, and Bioinformatics" (2010)
Available at: http://works.bepress.com/matt_bogard/7/ 

Notes to 'Support' an Understanding of Support Vector Machines  http://econometricsense.blogspot.com/2012/05/notes-to-support-understanding-of.html

Culture War: Classical Statistics vs. Machine Learning http://econometricsense.blogspot.com/2011/01/classical-statistics-vs-machine.html

Analytics vs Causal Inference http://econometricsense.blogspot.com/2014/01/analytics-vs-causal-inference.html

Big Data: Don’t throw the baby out with the bath water http://econometricsense.blogspot.com/2014/05/big-data-dont-throw-baby-out-with.html

To Explain or Predict http://econometricsense.blogspot.com/2015/03/to-explain-or-predict.html 

Big Data: Causality and Local Expertise Are Key in Agronomic Applications http://econometricsense.blogspot.com/2014/05/big-data-think-global-act-local-when-it.html


Big Data:  New Tricks for Econometrics
Hal R. Varian
June 2013
Revised:  April 14, 2014
http://people.ischool.berkeley.edu/~hal/Papers/2013/ml.pdf

Is machine learning trending with economists? (Kenneth Sanford)  http://blogs.sas.com/content/subconsciousmusings/2015/06/05/is-machine-learning-trending-with-economists/