Thursday, October 13, 2016

Why Data Science Needs Economics

Cleaning out my inbox recently, I ran across an article from 2015. Below is a link and excerpt:

4. Data Science Will Belong to the Economists

"We will start to see data science (to the extent that it operates as a coherent entity) increasingly rely on the domain expertise of economists. The early days of data science were very math, statistics and programming oriented. Then there was the rise of the “computational social scientist,” which added sociology to the mix. Many trend setting data science places are finding that sociology, and similar disciplines, tend to be retrospective, while other fields, like economics, offer simulation and auction modeling and other techniques to get more proactive and predictive with data. Of course, most economists don’t have the programming chops to land most data science jobs, but I think we’ll see that start to change significantly."
I think coding is important...and it would be nice if students interested in a career in data science could get more exposure to coding (SQL/R/SAS/Python etc.) in the classroom as well as algorithmic approaches (decision trees, neural networks etc.). However, I think its more important to have the analytical thinking skills and grounding in statistical inference that they get from an economics program (both UG and GR). That's the skillset I think will differentiate the data scientists in the future from the very technical tools focused ones in demand today. 

Recently on EconTalk, Russ Roberts and Cathy O'Neil discuss her book Weapons of Math Destruction and they take on issues related to explaining vs predicting, causality vs fitting the data. (see also their previous episode with Susan Athey). The role of quasi-experimental methods and rigorous identification as well as theory was emphasized. And theory is something, having spent the better part of my career focusing on empirical methods (both causal inference and machine learning) that I have not given enough thought to until recently. But the more I think about it....the more I realize it is necessary. Can big data and algorithms deliver tighter, unbiased, and more truthful insights? This excerpt from the 10th edition of Heyne, Boettke, and Pryschitko's The Economic Way of Thinking leads me to think exactly the opposite:

"We can observe facts, but it takes a theory to explain the causes. It takes a theory to weed out the irrelevant facts from the relevant ones."

And they give an anecdote:

"although the facts clearly show that most pot smokers were former milk drinkers, milk drinking probably is not a relevant fact in explaining pot smoking; similarly, the Superbowl is likely irrelevant when explaining Wall Street Interactions"(even if the data does show that the Dow does well when an NFC team does well)."

And more about theory:

"Our observations of the world are in fact drenched with theory, which is why we can usually make sense out of the buzzing confusion that assaults our eyes and ears. Actually we observe only a small fraction of what we "know," a hint here and a suggestion there. The rest we fill in from the theories we hold: small and broad, vague and precise..."

Big data in many ways is buzzing confusion, and yes algorithmic approaches i.e machine learning can help us find patterns and relationships that can be useful. But relying totally on a data driven process devoid of theory is more often going to lead us down the wrong path depending on the questions we are trying to answer. Economics is a way of thinking and economic theory can help us make sense of what we find, it can help us ask better or important questions, and can help guide us to understand the answers to those questions. It is forward looking as the article above states.

Of course we can test theories using data, through some clever identification strategy or even employing methods from machine learning in conjunction with conventional econometric approaches. And this brings me back full circle to my previous post about what the most important skillsets for data scientists may be going forward, and how economics training, and in fact economic theory can help fill that niche.

See also:

Are Data Scientitsts Going Extinct
To Explain or Predict
Economists as Data Scientists
Why Study Agricultural and Applied Economics
Analytics vs Causal Inference
Culture War: Inferential Statistics vs Machine Learning
Big Data: Don't throw the baby out with the bathwater
Causal Inference and Quasi-Experimental Design Roundup 
Big Data: Causality and Local Expertise are Key in Agronomic Applications 
Data Scientists vs Algorithms vs Real Solutions to Real Problems 
Analytical Translators

Sunday, October 2, 2016

Are Data Scientists Going Extinct?

 I came across a tweet by @YvesMulkers recently pointing to the following TechCrunch article:  

It introduces an interesting perspective about the future work of data scientists:

Why we will continue to need data scientists:

"you need to intimately understand the problems that can be solved by data science first, which involves a very human process of interacting with the business. Crafting models will always require the subtle translation of real-world phenomena into mathematical expressions. And there is a human element to interpreting and presenting results that would be difficult to automate."

However they discuss how some of the very technical aspects of data science will become more routine or automated or modularized:

"Consider how the work of the software engineer has changed fundamentally in the last 20 years. They no longer need to write their own logging module or database access layer or UI widget. And agile methods have brought the “customer” more immediately into the development process. More and more, the job of the engineer is to stitch together higher-level components and collaborate with product managers and UX designers....Similarly, the job of the data scientist will be to take advantage of pre-built components in order to solve a greater variety of business problems. Instead of a few six-month analytics projects that focus on model accuracy and algorithmic niceties, business and analytics teams will be able to work on hundreds of projects that emphasize making concrete changes in the way business is done. And as the software available for analytics becomes more powerful, the result should be a continued steady demand for data scientists, playing a different but more prominent role in the day-to-day working of an organization."

I always thought about this from the standpoint of the .com boom in the 90's and the role of HTML programmers. This blog is case in point...instead of focusing on HTML tags (OK I know a lot of HTML has even been replaced by java script and other languages I am not aware of) but that is the point...I can focus on content, analysis, and design vs what ever script is behind this page. Will R and python coding go the same way? I'm not sure. I'm a tried and true devotee to scripting my analysis work, regardless of the language. 

But this discussion makes me think of an earlier article in Deloitte Press related to the role of Analytical Translators:

"Data scientists...can make Hadoop jump through hoops,....dream in SAS or R, ...extract two years of data from a medical device that normally dumps it after 20 minutes (a true request)....A “light quant” is someone who knows something about analytical and data management methods, and who also knows a lot about specific business problems. The value of the role comes, of course, from connecting the two."

And this is what I was getting at to some degree in a recent post:

"Sometimes you might need a PhD computer scientist or Engineer that can meet the strictest of data science thresholds, but lots of times what you really may need is a statistician, econometrician, biometrician, or just a good MBA or business analyst that understands predictive modeling, causal inference, and the basics of a left join."

Implementation of the solution is where the technical work comes into play...and lots of code. Currently for me its SAS and R. Maybe python down the road. However, the crux of my work is understanding the context of the problem, and then scoping out the data requirements and methodology for solving it.

So should an experienced data scientist really sweat learning the latest and newest language...or focus more on the analytical thinking and analysis skills that ultimately drive the solution? What about an aspiring data scientist? How many languages/tools should they master? At the margin, is more time spent adding a new tool to the tool box worth more than experience solving a problem using an older or existing tool? I'm thinking learn enough to solve problems and then become a more polished problem solver and analytical translator. Languages come and go, but questions begging solutions, and the ability to provide them regardless of the tools used are never ending.

See also:
Data Scientists vs Algorithms vs Real Solutions to Real Problems

Economists as Data Scientists

Analytical Translators

Sunday, August 7, 2016

The State of Applied Econometrics-Imbens and Athey on Causality, Machine Learning, and Econometrics

I recently ran across:

The State of Applied Econometrics - Causality and Policy Evaluation
Susan Athey, Guido Imbens 

A nice read, although I skipped directly to the section on machine learning. A few interesting causality/machine learning comments.

They discussed some known issues related to estimating propensity scores using various machine learning algorithms in terms of the sensitivity of results, especially for propensity scores close to 0 or 1. They discuss trimming weights as one possible approach, which I have heard before in Angrist and Pischke and other work (see below). In fact, in a working paper where I employed gradient boosting to estimate propensity scores for IPTW regression, I trimmed weights. However, I did not trim them for the stratified matching estimator that I also used. I wish I still had the data because I would like to see the impact on my previous results.

Another interesting application discussed in this paper was a two (or 3?) stage LASSO estimation (they actually have a great overall discussion of penalized regression and regularization in machine learning) where they mention first running LASSO to select variables related to the outcome of interest, second running LASSO to select for variables related to selection, and finally running OLS to estimate a causal model that includes the selected variables from the previous LASSO methods.

The paper covers a range of other topics including decision trees, random forests, distinctions between traditional econometrics and machine learning, instrumental variables etc.

Some Additional Notes and References:

Multiple Algorithms (CART/Logistic Regression/Boosting/Random Forests) with PS weights and trimming:

Following Angrist and Pischke I present results for regressions utilizing data that has been 'screened' by eliminating observations where ps > .90 or < .10 using the r 'matchit' package

Estimating the Causal Effect of Advising Contacts on Fall to Spring Retention Using Propensity Score Matching and Inverse Probability of Treatment Weighted Regression

Matt Bogard, Western Kentucky University


In the fall of 2011 academic advising and residence life staff working for a southeastern university utilized a newly implemented advising software system to identify students based on attrition risk. Advising contacts, appointments, and support services were prioritized based on this new system and information regarding the characteristics of these interactions was captured in an automated format. It was the goal of this study to investigate the impact of this advising initiative on fall to spring retention rates. It is a challenge on college campuses to evaluate interventions that are often independent and decentralized across many university offices and organizations. In this study propensity score methods were utilized to address issues related to selection bias. The findings indicate that advising contacts associated with the utilization of the new software had statistically significant impacts on fall to spring retention for first year students on the order of a 3.26 point improvement over comparable students that were not contacted.

Suggested Citation

Matt Bogard. 2013. "Estimating the Causal Effect of Advising Contacts on Fall to Spring Retention Using Propensity Score Matching and Inverse Probability of Treatment Weighted Regression" The SelectedWorks of Matt Bogard
Available at:

Friday, July 29, 2016

Heckman...what the heck?

A while back I was presenting some research I did that involved propensity score matching, and I was asked why I did not utilize a Heckman model. My response was that I viewed my selection issues from the context of the Rubin causal modeling and a selection on observables framework. And truthfully, I was not that familiar with Heckman. It is interesting that in Angrist and Pischke's Mostly Harmless Econometrics, Heckman is given scant attention. However here are some of the basics:

Some statistical pre-requisites:
Incidental Truncation-we do not observe y due to the effect of another variable z. This results in a truncated distribution of y:

f(y|z > a) = f(y,z)/Prob(z > a)  (1)

This is a ratio of a density to a cumulative density function, referred to as the inverse Mill’s ratio or selection hazard.

Of major interest is the expected value of a truncated normal variable:

E(y|z > a) = µ + ρσλ  (2)

Application: The Heckman Model is often used in the context of truncated or incidental truncation or selection, where we only observe some outcome conditioned on a decision to participate or self select into a program or treatment. A popular example is the observation of wages only for people that choose to work, or outcomes for people that choose to participate in a job training or coaching program.

Estimation: Estimation is a two step process.

Step 1: Selection Equation

Z = wγ + µ  (3)

for each observation compute: λ_hat = φ(w,γ)/Φ(w,γ ) (4) from estimates in the selection equation

Step 2: Outcome Equation

y|z > 0 = xβ + βλ λ_hat + v  (5) where βλ =ρσ (note the similarity to (2)

One way to think of λ is the correlation between the treatment variable and the error term in the context of omitted variable bias andendogeneity

Y= a + xc + bs  + e  (6)

where s = selection or treatment indicator

if the conditional independence or selection on observables assumption does not hold, i.e. there are factors related to selection not controlled for by x, then we have omitted variable bias and correlatin between 's' and the error term 'e'. This results in endogeneity and biased estimates of treatment effects 'b'.

if we characterize correlation between e and s as λ = E(e | s,x)     (7)

the Heckman model consists of deriving an estimate of λ and including it in a regression as previously illustrated.

Y= a + xc + bs  + hλ + e  (8)

As stated (paraphrasing somewhat) in Briggs (2004):

"The Heckman model goes from specifying a selection model to getting an estimate for the bias term E(e | s,x) by estimating the expected value of a truncated normal random variable. This estimate is known in the literature as the Mills ratio or hazard function, and can be expressed as the ration of the standard normal density function to the cumulative distribution."

The Heckman model is powerful because it handles selection bias from both a selection on observables and unobservables context. There are however a number of assumptions involved that could limit its use. For more details I recommend the article by Briggs in the references below.


 Journal of Educational and Behavioral Statistics
 Winter 2004, Vol. 29, No. 4, pp. 397-420
 Causal Inference and the Heckman Model
 Derek C. Briggs

 Selection Bias - What You Don't. Know Can Hurt Your Bottom Line. Gaétan Veilleux, Valen . Casualty Actuarial Society -presentation.

Wednesday, May 25, 2016

Divide by 4 Rule for Marginal Effects

Previously I wrote about the practical differences between marginal effects and odds ratios with regard to logistic regression.

Recently, I ran across a tweet from Michael Grogan linking to one of his posts using logistic regression to model dividend probabilities. This really got me interested:

"Moreover, to obtain a measure in probability terms – one could divide the logit coefficient by 4 to obtain an estimate of the probability of a dividend increase. In this case, the logit coefficient of 0.8919 divided by 4 gives us a result of 22.29%, which means that for every 1 year increase in consecutive dividend increases, the probability of an increase in the dividend this year rises by 22.29%."

I had never heard of this 'divide by 4' short cut to get to marginal effects. While you can get those in STATA, R or SAS with a little work, I think this trick would be very handy for instance if you are reading someone else’s paper/results and just wanted a ballpark on marginal effects (instead of interpreting odds ratios).

I did some additional investigation on this and ran across Stephen Turner's Getting Genetics Done blog post related to this, where he goes a little deeper into the mathematics behind this:

"The slope of this curve (1st derivative of the logistic curve) is maximized at a+ßx=0, where it takes on the value:



So you can take the logistic regression coefficients (not including the intercept) and divide them by 4 to get an upper bound of the predictive difference in probability of the outcome y=1 per unit increase in x."

Stephen points to Andrew Gelman, who may be the originator of this, citing the text Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman, Jennifer Hill. There is some pushback in the comments to Stephen's post, but I still think this is a nice shortcut for an on the fly interpretation of reported results.

If you go back to the output from my previous post on marginal effects, the estimated logistic regression coefficients were:

                   Estimate Std. Error z value Pr(>|z|) 
(Intercept)  5.92972    2.34258   2.531   0.0114 *
age            -0.14099    0.05656  -2.493   0.0127 *

And if you apply the divide by 4 rule you get:

-0.14099 / 4 = -.0352

While this is not a proof, or even a simulation, it is close to the minimum (which would be the upper bound for negative marginal effects) for this data (see full R program below):

   Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
-0.03525 -0.03262 -0.02697 -0.02583 -0.02030 -0.01071

R Code:

 # PROGRAM NAME: MEFF and Odds Ratios
 # DATE: 3/3/16
 # PROJECT FILE:                       
 # 2011 
 # Simple Logit and Probit Marginal Effects in R 
 # Alan Fernihough, University College Dublin 
 # WP11/22 
 # October 2011 
#   generate data for continuous explantory variable
Input = ("participate age
1 25
1 26
1 27
1 28
1 29
1 30
0 31
1 32
1 33
0 34
1 35
1 36
1 37
1 38
0 39
0 40
0 41
1 42
1 43
0 44
0 45
1 46
1 47
0 48
0 49
0 50
0 51
0 52
1 53
0 54
dat1 <-  read.table(textConnection(Input),header=TRUE)
summary(dat1) # summary stats
#### run logistic regression model
mylogit <- glm(participate ~ age, data = dat1, family = "binomial")
exp(cbind(OR = coef(mylogit), confint(mylogit))) # get odds ratios
| marginal effects calculations
# mfx function for maginal effects from a glm model
# from:  
# based on:
# 2011 
# Simple Logit and Probit Marginal Effects in R 
# Alan Fernihough, University College Dublin 
# WP11/22 
#October 2011 
mfx <- function(x,sims=1000){
pdf <- ifelse(as.character(x$call)[3]=="binomial(link = \"probit\")",
mean(dnorm(predict(x, type = "link"))),
mean(dlogis(predict(x, type = "link"))))
pdfsd <- ifelse(as.character(x$call)[3]=="binomial(link = \"probit\")",
sd(dnorm(predict(x, type = "link"))),
sd(dlogis(predict(x, type = "link"))))
marginal.effects <- pdf*coef(x)
sim <- matrix(rep(NA,sims*length(coef(x))), nrow=sims)
for(i in 1:length(coef(x))){
sim[,i] <- rnorm(sims,coef(x)[i],diag(vcov(x)^0.5)[i])
pdfsim <- rnorm(sims,pdf,pdfsd) <- pdfsim*sim
res <- cbind(marginal.effects,sd(
colnames(res)[2] <- "standard.error"
# marginal effects from logit
### code it yourself for marginal effects at the mean
b0 <-  5.92972   # estimated intercept from logit
b1 <-  -0.14099  # estimated b from logit
xvar <- 39.5   # reference value (i.e. mean)for explanatory variable
d <- .0001     # incremental change in x
xbi <- (xvar + d)*b1 + b0
xbj <- (xvar - d)*b1 + b0
meff <- ((exp(xbi)/(1+exp(xbi)))-(exp(xbj)/(1+exp(xbj))))/(d*2) ;
### a different perhaps easier formulation for me at the mean
XB <- xvar*b1 + b0 # this could be expanded for multiple b's or x's
meffx <- (exp(XB)/((1+exp(XB))^2))*b1
### averaging the meff for the whole data set
dat1$XB <- dat1$age*b1 + b0
meffx <- (exp(dat1$XB)/((1+exp(dat1$XB))^2))*b1
summary(meffx) # get mean
#### marginal effects from linear model
lpm <- lm(dat1$participate~dat1$age)
# multivariable case
dat2 <- read.csv("")
summary(dat2) # summary stats
#### run logistic regression model
mylogit <- glm(admit ~ gre + gpa, data = dat2, family = "binomial")
exp(cbind(OR = coef(mylogit), confint(mylogit))) # get odds ratios
# marginal effects from logit
### code it yourself for marginal effects at the mean
b0 <-  -4.949378    # estimated intercept from logit
b1 <-  0.002691     # estimated b for gre
b2 <-   0.754687    # estimated b for gpa
x1 <- 587    # reference value (i.e. mean)for gre
x2 <- 3.39   # reference value (i.e. mean)for gre
d <- .0001   # incremental change in x
# meff at means for gre
xbi <- (x1 + d)*b1 + b2*x2 + b0
xbj <- (x1 - d)*b1 + b2*x2 + b0
meff <- ((exp(xbi)/(1+exp(xbi)))-(exp(xbj)/(1+exp(xbj))))/(d*2) ;
# meff at means for gpa
xbi <- (x2 + d)*b2 + b1*x1 + b0
xbj <- (x2 - d)*b2 + b1*x1 + b0
meff <- ((exp(xbi)/(1+exp(xbi)))-(exp(xbj)/(1+exp(xbj))))/(d*2) ;
### a different perhaps easier formulation for me at the mean
XB <- x1*b1 +x2*b2 + b0 # this could be expanded for multiple b's or x's
# meff at means for gre
meffx <- (exp(XB)/((1+exp(XB))^2))*b1
# meff at means for gpa
meffx <- (exp(XB)/((1+exp(XB))^2))*b2
### averaging the meff for the whole data set
dat2$XB <- dat2$gre*b1 + dat2$gpa*b2 + b0
# sample avg meff for gre
meffx <- (exp(dat1$XB)/((1+exp(dat1$XB))^2))*b1
summary(meffx) # get mean
# sample avg meff for gpa
meffx <- (exp(dat1$XB)/((1+exp(dat1$XB))^2))*b2
summary(meffx) # get mean
#### marginal effects from linear model
lpm <- lm(admit ~ gre + gpa, data = dat2)
Created by Pretty R at

Tuesday, May 24, 2016

Data Scientists vs Algorithms vs Real Solutions to Real Problems

A few weeks ago there was a short tweet in my tweetstream that kindled some thoughts.

"All people are biased, that's why we need algorithms!" "All algorithms are biased, that's why we need people!" via @pmarca

And a retweet/reply by Diego Kuonen:

"Algorithms are aids to thinking and NOT replacements for it"

This got me thinking about  a lot of work that I have been doing,  past job interviews and conversations I have had with headhunters and 'data science' recruiters, as well as a number of discussions or sometimes arguments about what defines data science and about 'unicorns' and 'fake' data scientists. I ran across a couple interesting perspectives related to this some years back:

What sets data scientists apart from other data workers, including data analysts, is their ability to create logic behind the data that leads to business decisions. "Data scientists extract data, formulate models and apply quantitative analysis in a proactive manner" -Laura Kelley, Vice President, Modis.

"They can suck data out of a server log, a telecom billing file, or the alternator on a locomotive, and figure out what the heck is going on with it. They create new products and services for customers. They can also interface with carbon-based lifeforms — senior executives, product managers, CTOs, and CIOs. You need them." - Can You Live Without a Data Scientist, Harvard Business Review.

I have seen numerous variations on Drew Conway's data science Venn Diagram, but I think Drew still nails it down pretty well. If you can write code, understand statistics, and can apply those skills based on your specific subject matter expertise, to me that meets the threshold of the minimum skills most employers might require for you to do value added work. But these tweets beg the question, for many business problems, do we need algorithms at all, or what kind of people do we really need?

 Absolutely there are differences in skillsets required for machine learning vs traditional statistical inference, and I know there are definitely instances where knowing how to set up a Hadoop cluster can be valuable for certain problems. Maybe you do need a complex algorithm to power a killer ap or recommender system.

I think part of the hype and snobery around the terms data science and data scientist might stem from the fact that they are used in so many different contexts and they mean so many things to so many people that there is fear that the true meaning will be lost along with one's relevance in this space as a data scientist. It might be better to forget about semantics and just concentrate on the ends that we are trying to achieve.

I think a vast majority of businesses really need insights driven by people with subject matter expertise and the ability to clean, extract, analyze, visualize, and probably most importantly, communicate. Sometimes the business need requires prediction, other times inference. Many times you may not need a complicated algorithm or experimental design at all, not as much as you need someone to make sense of the nasty transactional data your business is producing and summarize it all with something maybe as simple as cross tabs. Sometimes you might need a PhD computer scientist or Engineer that can meet the strictest of data science thresholds, but lots of times what you really may need is a statistician, econometrician, biometrician, or just a good MBA or business analyst that understands predictive modeling, causal inference, and the basics of a left join.
This is why one recommendation may be to pursue 'data scientists'  outside of some of the traditional academic disciplines assocaited with data science.

There was a similar discussion a couple years ago on the SAS Voices Blog in a post by Tonay Balan titled "Data scientist or statistician: What's in a name?" 

"Obviously, there are a lot of opinions about the terminology.  Here’s my perspective:  The rise and fall of these titles points to the rapid pace of change in our industry.  However, there is one thing that remains constant.  As companies collect more and more data, it is imperative that they have individuals who are skilled at extracting information from that data in a way that is relevant to the business, scientifically accurate, and that can be used to drive better decisions, faster.  Whether they call themselves statisticians, data scientists, or SAS users, their end goal is the same."

And really good insight from Evan Stubbs (in the comment section):

"Personally, I think it's just the latest attempt to clarify what's actually an extremely difficult role. I don't think it's synonymous with any existing title (including statistician, mathematician, analyst, data miner, and so on) but on the same note, I don't think it's where we'll finally end up.

From my perspective, the market is trying to create a shorthand signal that describes someone with:

* An applied (rather than theoretical) focus
* A broader (rather than narrow) set of technical skills
* A focus on outcomes over analysis
* A belief in creating processes rather than doing independent activities
* An ability to communicate along with an awareness of organisational psychology, persuasion, and influence
* An emphasis on recommendations over insight

Existing roles and titles don't necessarily identify those characteristics.

While "Data Scientist" is the latest attempt to create an all-encompassing title, I don't think it'll last. On one hand, it's very generic. On the other, it still implies a technical focus - much of the value of these people stems from their ability to communicate, interface with the business, be creative, and drive outcomes. "Data Scientist", to me at least, carries very research-heavy connotations, something that dilutes the applied and (often) political nature of the field. "

One thing is for certain, recruiters might have a better shot at placing candidates for their clients if role descriptions would just say what they mean and leave the fighting over who's a real data scientist to the LinkedIn discussion boards and their tweetstream.

See also:

Economists as Data Scientists

Why Study Agricultural and Applied Economics?