Sunday, August 7, 2016

The State of Applied Econometrics-Imbens and Athey on Causality, Machine Learning, and Econometrics

I recently ran across:

The State of Applied Econometrics - Causality and Policy Evaluation
Susan Athey, Guido Imbens 

A nice read, although I skipped directly to the section on machine learning. A few interesting causality/machine learning comments.

They discussed some known issues related to estimating propensity scores using various machine learning algorithms in terms of the sensitivity of results, especially for propensity scores close to 0 or 1. They discuss trimming weights as one possible approach, which I have heard before in Angrist and Pischke and other work (see below). In fact, in a working paper where I employed gradient boosting to estimate propensity scores for IPTW regression, I trimmed weights. However, I did not trim them for the stratified matching estimator that I also used. I wish I still had the data because I would like to see the impact on my previous results.

Another interesting application discussed in this paper was a two (or 3?) stage LASSO estimation (they actually have a great overall discussion of penalized regression and regularization in machine learning) where they mention first running LASSO to select variables related to the outcome of interest, second running LASSO to select for variables related to selection, and finally running OLS to estimate a causal model that includes the selected variables from the previous LASSO methods.

The paper covers a range of other topics including decision trees, random forests, distinctions between traditional econometrics and machine learning, instrumental variables etc.

Some Additional Notes and References:

Multiple Algorithms (CART/Logistic Regression/Boosting/Random Forests) with PS weights and trimming:

Following Angrist and Pischke I present results for regressions utilizing data that has been 'screened' by eliminating observations where ps > .90 or < .10 using the r 'matchit' package

Estimating the Causal Effect of Advising Contacts on Fall to Spring Retention Using Propensity Score Matching and Inverse Probability of Treatment Weighted Regression

Matt Bogard, Western Kentucky University


In the fall of 2011 academic advising and residence life staff working for a southeastern university utilized a newly implemented advising software system to identify students based on attrition risk. Advising contacts, appointments, and support services were prioritized based on this new system and information regarding the characteristics of these interactions was captured in an automated format. It was the goal of this study to investigate the impact of this advising initiative on fall to spring retention rates. It is a challenge on college campuses to evaluate interventions that are often independent and decentralized across many university offices and organizations. In this study propensity score methods were utilized to address issues related to selection bias. The findings indicate that advising contacts associated with the utilization of the new software had statistically significant impacts on fall to spring retention for first year students on the order of a 3.26 point improvement over comparable students that were not contacted.

Suggested Citation

Matt Bogard. 2013. "Estimating the Causal Effect of Advising Contacts on Fall to Spring Retention Using Propensity Score Matching and Inverse Probability of Treatment Weighted Regression" The SelectedWorks of Matt Bogard
Available at:

Friday, July 29, 2016

Heckman...what the heck?

A while back I was presenting some research I did that involved propensity score matching, and I was asked why I did not utilize a Heckman model. My response was that I viewed my selection issues from the context of the Rubin causal modeling and a selection on observables framework. And truthfully, I was not that familiar with Heckman. It is interesting that in Angrist and Pischke's Mostly Harmless Econometrics, Heckman is given scant attention. However here are some of the basics:

Some statistical pre-requisites:
Incidental Truncation-we do not observe y due to the effect of another variable z. This results in a truncated distribution of y:

f(y|z > a) = f(y,z)/Prob(z > a)  (1)

This is a ratio of a density to a cumulative density function, referred to as the inverse Mill’s ratio or selection hazard.

Of major interest is the expected value of a truncated normal variable:

E(y|z > a) = µ + ρσλ  (2)

Application: The Heckman Model is often used in the context of truncated or incidental truncation or selection, where we only observe some outcome conditioned on a decision to participate or self select into a program or treatment. A popular example is the observation of wages only for people that choose to work, or outcomes for people that choose to participate in a job training or coaching program.

Estimation: Estimation is a two step process.

Step 1: Selection Equation

Z = wγ + µ  (3)

for each observation compute: λ_hat = φ(w,γ)/Φ(w,γ ) (4) from estimates in the selection equation

Step 2: Outcome Equation

y|z > 0 = xβ + βλ λ_hat + v  (5) where βλ =ρσ (note the similarity to (2)

One way to think of λ is the correlation between the treatment variable and the error term in the context of omitted variable bias andendogeneity

Y= a + xc + bs  + e  (6)

where s = selection or treatment indicator

if the conditional independence or selection on observables assumption does not hold, i.e. there are factors related to selection not controlled for by x, then we have omitted variable bias and correlatin between 's' and the error term 'e'. This results in endogeneity and biased estimates of treatment effects 'b'.

if we characterize correlation between e and s as λ = E(e | s,x)     (7)

the Heckman model consists of deriving an estimate of λ and including it in a regression as previously illustrated.

Y= a + xc + bs  + hλ + e  (8)

As stated (paraphrasing somewhat) in Briggs (2004):

"The Heckman model goes from specifying a selection model to getting an estimate for the bias term E(e | s,x) by estimating the expected value of a truncated normal random variable. This estimate is known in the literature as the Mills ratio or hazard function, and can be expressed as the ration of the standard normal density function to the cumulative distribution."

The Heckman model is powerful because it handles selection bias from both a selection on observables and unobservables context. There are however a number of assumptions involved that could limit its use. For more details I recommend the article by Briggs in the references below.


 Journal of Educational and Behavioral Statistics
 Winter 2004, Vol. 29, No. 4, pp. 397-420
 Causal Inference and the Heckman Model
 Derek C. Briggs

 Selection Bias - What You Don't. Know Can Hurt Your Bottom Line. Gaétan Veilleux, Valen . Casualty Actuarial Society -presentation.

Wednesday, May 25, 2016

Divide by 4 Rule for Marginal Effects

Previously I wrote about the practical differences between marginal effects and odds ratios with regard to logistic regression.

Recently, I ran across a tweet from Michael Grogan linking to one of his posts using logistic regression to model dividend probabilities. This really got me interested:

"Moreover, to obtain a measure in probability terms – one could divide the logit coefficient by 4 to obtain an estimate of the probability of a dividend increase. In this case, the logit coefficient of 0.8919 divided by 4 gives us a result of 22.29%, which means that for every 1 year increase in consecutive dividend increases, the probability of an increase in the dividend this year rises by 22.29%."

I had never heard of this 'divide by 4' short cut to get to marginal effects. While you can get those in STATA, R or SAS with a little work, I think this trick would be very handy for instance if you are reading someone else’s paper/results and just wanted a ballpark on marginal effects (instead of interpreting odds ratios).

I did some additional investigation on this and ran across Stephen Turner's Getting Genetics Done blog post related to this, where he goes a little deeper into the mathematics behind this:

"The slope of this curve (1st derivative of the logistic curve) is maximized at a+ßx=0, where it takes on the value:



So you can take the logistic regression coefficients (not including the intercept) and divide them by 4 to get an upper bound of the predictive difference in probability of the outcome y=1 per unit increase in x."

Stephen points to Andrew Gelman, who may be the originator of this, citing the text Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman, Jennifer Hill. There is some pushback in the comments to Stephen's post, but I still think this is a nice shortcut for an on the fly interpretation of reported results.

If you go back to the output from my previous post on marginal effects, the estimated logistic regression coefficients were:

                   Estimate Std. Error z value Pr(>|z|) 
(Intercept)  5.92972    2.34258   2.531   0.0114 *
age            -0.14099    0.05656  -2.493   0.0127 *

And if you apply the divide by 4 rule you get:

-0.14099 / 4 = -.0352

While this is not a proof, or even a simulation, it is close to the minimum (which would be the upper bound for negative marginal effects) for this data (see full R program below):

   Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
-0.03525 -0.03262 -0.02697 -0.02583 -0.02030 -0.01071

R Code:

 # PROGRAM NAME: MEFF and Odds Ratios
 # DATE: 3/3/16
 # PROJECT FILE:                       
 # 2011 
 # Simple Logit and Probit Marginal Effects in R 
 # Alan Fernihough, University College Dublin 
 # WP11/22 
 # October 2011 
#   generate data for continuous explantory variable
Input = ("participate age
1 25
1 26
1 27
1 28
1 29
1 30
0 31
1 32
1 33
0 34
1 35
1 36
1 37
1 38
0 39
0 40
0 41
1 42
1 43
0 44
0 45
1 46
1 47
0 48
0 49
0 50
0 51
0 52
1 53
0 54
dat1 <-  read.table(textConnection(Input),header=TRUE)
summary(dat1) # summary stats
#### run logistic regression model
mylogit <- glm(participate ~ age, data = dat1, family = "binomial")
exp(cbind(OR = coef(mylogit), confint(mylogit))) # get odds ratios
| marginal effects calculations
# mfx function for maginal effects from a glm model
# from:  
# based on:
# 2011 
# Simple Logit and Probit Marginal Effects in R 
# Alan Fernihough, University College Dublin 
# WP11/22 
#October 2011 
mfx <- function(x,sims=1000){
pdf <- ifelse(as.character(x$call)[3]=="binomial(link = \"probit\")",
mean(dnorm(predict(x, type = "link"))),
mean(dlogis(predict(x, type = "link"))))
pdfsd <- ifelse(as.character(x$call)[3]=="binomial(link = \"probit\")",
sd(dnorm(predict(x, type = "link"))),
sd(dlogis(predict(x, type = "link"))))
marginal.effects <- pdf*coef(x)
sim <- matrix(rep(NA,sims*length(coef(x))), nrow=sims)
for(i in 1:length(coef(x))){
sim[,i] <- rnorm(sims,coef(x)[i],diag(vcov(x)^0.5)[i])
pdfsim <- rnorm(sims,pdf,pdfsd) <- pdfsim*sim
res <- cbind(marginal.effects,sd(
colnames(res)[2] <- "standard.error"
# marginal effects from logit
### code it yourself for marginal effects at the mean
b0 <-  5.92972   # estimated intercept from logit
b1 <-  -0.14099  # estimated b from logit
xvar <- 39.5   # reference value (i.e. mean)for explanatory variable
d <- .0001     # incremental change in x
xbi <- (xvar + d)*b1 + b0
xbj <- (xvar - d)*b1 + b0
meff <- ((exp(xbi)/(1+exp(xbi)))-(exp(xbj)/(1+exp(xbj))))/(d*2) ;
### a different perhaps easier formulation for me at the mean
XB <- xvar*b1 + b0 # this could be expanded for multiple b's or x's
meffx <- (exp(XB)/((1+exp(XB))^2))*b1
### averaging the meff for the whole data set
dat1$XB <- dat1$age*b1 + b0
meffx <- (exp(dat1$XB)/((1+exp(dat1$XB))^2))*b1
summary(meffx) # get mean
#### marginal effects from linear model
lpm <- lm(dat1$participate~dat1$age)
# multivariable case
dat2 <- read.csv("")
summary(dat2) # summary stats
#### run logistic regression model
mylogit <- glm(admit ~ gre + gpa, data = dat2, family = "binomial")
exp(cbind(OR = coef(mylogit), confint(mylogit))) # get odds ratios
# marginal effects from logit
### code it yourself for marginal effects at the mean
b0 <-  -4.949378    # estimated intercept from logit
b1 <-  0.002691     # estimated b for gre
b2 <-   0.754687    # estimated b for gpa
x1 <- 587    # reference value (i.e. mean)for gre
x2 <- 3.39   # reference value (i.e. mean)for gre
d <- .0001   # incremental change in x
# meff at means for gre
xbi <- (x1 + d)*b1 + b2*x2 + b0
xbj <- (x1 - d)*b1 + b2*x2 + b0
meff <- ((exp(xbi)/(1+exp(xbi)))-(exp(xbj)/(1+exp(xbj))))/(d*2) ;
# meff at means for gpa
xbi <- (x2 + d)*b2 + b1*x1 + b0
xbj <- (x2 - d)*b2 + b1*x1 + b0
meff <- ((exp(xbi)/(1+exp(xbi)))-(exp(xbj)/(1+exp(xbj))))/(d*2) ;
### a different perhaps easier formulation for me at the mean
XB <- x1*b1 +x2*b2 + b0 # this could be expanded for multiple b's or x's
# meff at means for gre
meffx <- (exp(XB)/((1+exp(XB))^2))*b1
# meff at means for gpa
meffx <- (exp(XB)/((1+exp(XB))^2))*b2
### averaging the meff for the whole data set
dat2$XB <- dat2$gre*b1 + dat2$gpa*b2 + b0
# sample avg meff for gre
meffx <- (exp(dat1$XB)/((1+exp(dat1$XB))^2))*b1
summary(meffx) # get mean
# sample avg meff for gpa
meffx <- (exp(dat1$XB)/((1+exp(dat1$XB))^2))*b2
summary(meffx) # get mean
#### marginal effects from linear model
lpm <- lm(admit ~ gre + gpa, data = dat2)
Created by Pretty R at

Tuesday, May 24, 2016

Data Scientists vs Algorithms vs Real Solutions to Real Problems

A few weeks ago there was a short tweet in my tweetstream that kindled some thoughts.

"All people are biased, that's why we need algorithms!" "All algorithms are biased, that's why we need people!" via @pmarca

And a retweet/reply by Diego Kuonen:

"Algorithms are aids to thinking and NOT replacements for it"

This got me thinking about  a lot of work that I have been doing,  past job interviews and conversations I have had with headhunters and 'data science' recruiters, as well as a number of discussions or sometimes arguments about what defines data science and about 'unicorns' and 'fake' data scientists. I ran across a couple interesting perspectives related to this some years back:

What sets data scientists apart from other data workers, including data analysts, is their ability to create logic behind the data that leads to business decisions. "Data scientists extract data, formulate models and apply quantitative analysis in a proactive manner" -Laura Kelley, Vice President, Modis.

"They can suck data out of a server log, a telecom billing file, or the alternator on a locomotive, and figure out what the heck is going on with it. They create new products and services for customers. They can also interface with carbon-based lifeforms — senior executives, product managers, CTOs, and CIOs. You need them." - Can You Live Without a Data Scientist, Harvard Business Review.

I have seen numerous variations on Drew Conway's data science Venn Diagram, but I think Drew still nails it down pretty well. If you can write code, understand statistics, and can apply those skills based on your specific subject matter expertise, to me that meets the threshold of the minimum skills most employers might require for you to do value added work. But these tweets beg the question, for many business problems, do we need algorithms at all, or what kind of people do we really need?

 Absolutely there are differences in skillsets required for machine learning vs traditional statistical inference, and I know there are definitely instances where knowing how to set up a Hadoop cluster can be valuable for certain problems. Maybe you do need a complex algorithm to power a killer ap or recommender system.

I think part of the hype and snobery around the terms data science and data scientist might stem from the fact that they are used in so many different contexts and they mean so many things to so many people that there is fear that the true meaning will be lost along with one's relevance in this space as a data scientist. It might be better to forget about semantics and just concentrate on the ends that we are trying to achieve.

I think a vast majority of businesses really need insights driven by people with subject matter expertise and the ability to clean, extract, analyze, visualize, and probably most importantly, communicate. Sometimes the business need requires prediction, other times inference. Many times you may not need a complicated algorithm or experimental design at all, not as much as you need someone to make sense of the nasty transactional data your business is producing and summarize it all with something maybe as simple as cross tabs. Sometimes you might need a PhD computer scientist or Engineer that can meet the strictest of data science thresholds, but lots of times what you really may need is a statistician, econometrician, biometrician, or just a good MBA or business analyst that understands predictive modeling, causal inference, and the basics of a left join.
This is why one recommendation may be to pursue 'data scientists'  outside of some of the traditional academic disciplines assocaited with data science.

There was a similar discussion a couple years ago on the SAS Voices Blog in a post by Tonay Balan titled "Data scientist or statistician: What's in a name?" 

"Obviously, there are a lot of opinions about the terminology.  Here’s my perspective:  The rise and fall of these titles points to the rapid pace of change in our industry.  However, there is one thing that remains constant.  As companies collect more and more data, it is imperative that they have individuals who are skilled at extracting information from that data in a way that is relevant to the business, scientifically accurate, and that can be used to drive better decisions, faster.  Whether they call themselves statisticians, data scientists, or SAS users, their end goal is the same."

And really good insight from Evan Stubbs (in the comment section):

"Personally, I think it's just the latest attempt to clarify what's actually an extremely difficult role. I don't think it's synonymous with any existing title (including statistician, mathematician, analyst, data miner, and so on) but on the same note, I don't think it's where we'll finally end up.

From my perspective, the market is trying to create a shorthand signal that describes someone with:

* An applied (rather than theoretical) focus
* A broader (rather than narrow) set of technical skills
* A focus on outcomes over analysis
* A belief in creating processes rather than doing independent activities
* An ability to communicate along with an awareness of organisational psychology, persuasion, and influence
* An emphasis on recommendations over insight

Existing roles and titles don't necessarily identify those characteristics.

While "Data Scientist" is the latest attempt to create an all-encompassing title, I don't think it'll last. On one hand, it's very generic. On the other, it still implies a technical focus - much of the value of these people stems from their ability to communicate, interface with the business, be creative, and drive outcomes. "Data Scientist", to me at least, carries very research-heavy connotations, something that dilutes the applied and (often) political nature of the field. "

One thing is for certain, recruiters might have a better shot at placing candidates for their clients if role descriptions would just say what they mean and leave the fighting over who's a real data scientist to the LinkedIn discussion boards and their tweetstream.

See also:

Economists as Data Scientists

Why Study Agricultural and Applied Economics?

Tuesday, March 22, 2016

Identification and Common Trend Assumptions in Difference-in-Differences for Linear vs GLM Models

In a previous post I discussed the conclusion from Lechner’s paper 'The Estimation of Causal Effects by Difference-in-Difference Methods', that difference-in-difference models in a non-linear or GLM context failed to meet the common trend assumptions, and therefore failed to identify treatment effects from a selection on unobservables context.

In that paper I noted that Lechner points out (quite rigorously in the context of the potential outcomes framework):

"We start with a “natural” nonlinear model with a linear index structure which is transformed by a link function, G(·), to yield the conditional expectation of the potential outcome.....The common trend assumption relies on differencing out specific terms of the unobservable potential outcome, which does not happen in this nonlinear specification... Whereas the linear specification requires the group specific differences to be time constant, the nonlinear specification requires them to be absent. Of course, this property of this nonlinear specification removes the attractive feature that DiD allows for some selection on unobservable group and individual specific differences. Thus, we conclude that estimating a DiD model with the standard specification of a nonlinear model would usually lead to an inconsistent estimator if the standard common trend assumption is upheld. In other words, if the standard DiD assumptions hold, this nonlinear model does not exploit them (it will usually violate them). Therefore, estimation based on this model does not identify the causal effect”

I wanted to review at a high level exactly how he gets to this result. But I wanted to simplify this as much as possible and start with some basic concepts. Starting with a basic regression model, the population conditional expectation function, or conditional mean of Y given X can be written as:

Regression and Expected Value Notation:

E[Y|X] = β0 + β1 X (1)

and we estimate this with the regression on observed data:

y = b0 + b1X + e (2)

Where b1 is our estimate of the population parameter of interest β1.

If E[b1] = β1 then we say our estimator is unbiased.

Potential Outcomes Notation:

When it comes to experimental designs, we are interested in knowing counterfactuals, that is what value of an outcome would a treatment or program participant have in absence of treatment (the baseline potential outcome) vs. if they participated or were treated? If we specify these 'potential outcomes' as follows:

Y0= baseline potential outcome
Y1= potential treatment outcome

We can characterize the treatment effect as:

E[Y0-Y1] or the difference in potential treated vs baseline outcomes. This is referred to as the average treatment effect or ATE. Sometimes we are interested in, or some models estimate, the average treatment effect on the treated or ATT :E[Y0-Y1 | d = 1]  

where d is an indicator for treatment (d = 1) vs control or untreated (d =0).

Difference-in-Difference Analysis:

Difference-in-difference (DD) estimators assume that in absence of treatment the difference between control (B) and treatment (A) groups would be constant or ‘fixed’ over time. Treatment effects in DD estimators are derived by subtracting differences between pre and post values within treatment and control groups, and then taking a difference in differences between treatment and control groups. The unobservable effects that are constant or fixed over time 'difference out' allowing us to identify treatment effects controlling for these unobservable characteristics with out explicitly measuring them. This characterizes what is referred to as a 'selection on unobservables' framework.

  This can also be estimated using linear regression with an interaction term:

y = b0 + b1 d + b2 t + b3 d*t+ e (3)

where d indicates treatment (d=1 vs d = 0) and the estimated coefficient (b3 ) on the time by treatment interaction term gives us our estimate of treatment effects. 

Lechner and Potential Outcomes Framework:

In an attempt to present the issues with GLM DD models depicted in Lechner (2010) using the simplest notation possible (abusing notation slightly and perhaps at a cost of precision), we can depict the framework for difference-in-difference analysis using expectations:

DID = [E(Y1|D=1)-E( Y0|D=1)] -[E(Y1|D=0)-E(Y0|D=0)] (4)

DID = [pre/post differences for treatment group] – [pre/post differences for control group]

where Y represents the observed outcome values sub-scripted by pre (0) and post periods(1)

We can represent potential outcomes in the regression framework as follows:

E(Yt1|D) = α + tδ1 + dγ “potential outcome if treated” (5)

E(Yt0|D) = α + tδ0 + dγ “potential baseline outcome” (6)

ATET: E(Yt1- Yt0|D= 1) = θ1 = δ    (7)

difference-in-difference of potential outcomes across time if treated”

We can estimate δ with a regression on observed data of the form:

y = b0 + b1 d + b2 t + b3 d*t+ e (3')

where b3 is our estimator for δ.

Common Trend Assumption:
Difference-in-difference (DD) estimators assume that in absence of treatment the difference between control (B) and treatment (A) groups would be constant or ‘fixed’ over time. This can be represented geometrically in a linear modeling context by 'parallel trends' in outcome levels between treatment and control groups in absence of a treatment:

As depicted above, BB represents the trend in outcome Y for a control group. AA represents the counterfactual trend, or parallel or common trend for the treatment group that would occur in absence of treatment. The distance A'A represents a departure from the parallel trend in response to treatment, and would be our DD treatment effect or the value b3 our estimator for δ.

The common trend assumption following Lechner, can be expressed in terms of potential outcomes:

E(Y10|D=1)-E(Y00|D=1) = α + δ0 + γ - α – γ = δ0 (8)

E(Y10|D=0)-E(Y00|D=0) = α + δ0 - α = δ0 (9)

i.e. the pre and post period differences in baseline outcomes is the same (δ0) regardless if individuals are assigned to the treatment group (D=1) or control group (D=0).

Nonlinear Models:

In a GLM framework, with a specific link function G(.) a DD framework can be expressed in terms of potential outcomes as follows:

E(Yt1|D) = G(α + tδ1 + dγ) “potential outcome if treated” (10)

E(Yt0|D) = G(α + tδ0 + dγ) “potential baseline outcome” (11)

DID can be estimated by regression on observed outcomes:

G(b0 + b1 d + b2 t + b3 d*t) (12)

Common Trend Assumption:

E(Y10|D=1)-E(Y00|D=1) = G(α + δ0 + γ) - G(α + γ) (13)

E(Y10|D=0)-E(Y00|D=0) = G(α + δ0 ) - G(α ) (14)

It turns out in a GLM framework, for the common trend assumption to hold, group specific differences must be zero or γ =0. The common trend assumption relies on differencing out specific terms of the unobservable potential outcome, or the individual specific effects we are trying to control for in the selection on unobservables scenario, but in a GLM scenario we have to assume that these effects are zero or absent. In essence, the attractive feature of DD models to control for unobservable effects is not a feature of DD models in a GLM scenario.  
The Estimation of Causal Effects by Difference-in-Difference Methods
By Michael Lechner Foundations and Trends in Econometrics
Vol. 4, No. 3 (2010) 165–224  

Program Evaluation and the
Difference-in-Difference Estimator
Course Notes
Education Policy and Program Evaluation
Vanderbilt University
October 4, 2008

Difference in Difference Models, Course Notes
ECON 47950: Methods for Inferring Causal Relationships in Economics
William N. Evans
University of Notre Dame
Spring 2008