## Wednesday, May 25, 2016

### Divide by 4 Rule for Marginal Effects

Previously I wrote about the practical differences between marginal effects and odds ratios with regard to logistic regression.

Recently, I ran across a tweet from Michael Grogan linking to one of his posts using logistic regression to model dividend probabilities. This really got me interested:

"Moreover, to obtain a measure in probability terms – one could divide the logit coefficient by 4 to obtain an estimate of the probability of a dividend increase. In this case, the logit coefficient of 0.8919 divided by 4 gives us a result of 22.29%, which means that for every 1 year increase in consecutive dividend increases, the probability of an increase in the dividend this year rises by 22.29%."

I had never heard of this 'divide by 4' short cut to get to marginal effects. While you can get those in STATA, R or SAS with a little work, I think this trick would be very handy for instance if you are reading someone else’s paper/results and just wanted a ballpark on marginal effects (instead of interpreting odds ratios).

I did some additional investigation on this and ran across Stephen Turner's Getting Genetics Done blog post related to this, where he goes a little deeper into the mathematics behind this:

"The slope of this curve (1st derivative of the logistic curve) is maximized at a+ßx=0, where it takes on the value:
ße0/(1+e0

=ß(1)/(1+1)²

=ß/4

So you can take the logistic regression coefficients (not including the intercept) and divide them by 4 to get an upper bound of the predictive difference in probability of the outcome y=1 per unit increase in x."

Stephen points to Andrew Gelman, who may be the originator of this, citing the text Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman, Jennifer Hill. There is some pushback in the comments to Stephen's post, but I still think this is a nice shortcut for an on the fly interpretation of reported results.

If you go back to the output from my previous post on marginal effects, the estimated logistic regression coefficients were:

Estimate Std. Error z value Pr(>|z|)
(Intercept)  5.92972    2.34258   2.531   0.0114 *
age            -0.14099    0.05656  -2.493   0.0127 *

And if you apply the divide by 4 rule you get:

-0.14099 / 4 = -.0352

While this is not a proof, or even a simulation, it is close to the minimum (which would be the upper bound for negative marginal effects) for this data (see full R program below):

summary(meffx)
Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
-0.03525 -0.03262 -0.02697 -0.02583 -0.02030 -0.01071

R Code:

``` #------------------------------------------------------------------
# PROGRAM NAME: MEFF and Odds Ratios
# DATE: 3/3/16
# CREATED BY: MATT BOGARD
# PROJECT FILE:
#------------------------------------------------------------------
# PURPOSE: GENERATE MARGINAL EFFECTS FOR LOGISTIC REGRESSION AND COMPARE TO:
# ODDS RATIOS / RESULTS FROM R
#
# REFERENCES: https://statcompute.wordpress.com/2012/09/30/marginal-effects-on-binary-outcome/
#             https://diffuseprior.wordpress.com/2012/04/23/probitlogit-marginal-effects-in-r-2/
#
# UCD CENTRE FOR ECONOMIC RESEARCH
#  WORKING PAPER SERIES
# 2011
# Simple Logit and Probit Marginal Effects in R
# Alan Fernihough, University College Dublin
# WP11/22
# October 2011
# http://www.ucd.ie/t4cms/WP11_22.pdf
#------------------------------------------------------------------;

#-----------------------------------------------------
#   generate data for continuous explantory variable
#----------------------------------------------------

Input = ("participate age
1 25
1 26
1 27
1 28
1 29
1 30
0 31
1 32
1 33
0 34
1 35
1 36
1 37
1 38
0 39
0 40
0 41
1 42
1 43
0 44
0 45
1 46
1 47
0 48
0 49
0 50
0 51
0 52
1 53
0 54
")

summary(dat1) # summary stats

#### run logistic regression model

mylogit <- glm(participate ~ age, data = dat1, family = "binomial")
summary(mylogit)

exp(cbind(OR = coef(mylogit), confint(mylogit))) # get odds ratios

#-------------------------------------------------
| marginal effects calculations
#-------------------------------------------------

#--------------------------------------------------------------------
# mfx function for maginal effects from a glm model
#
# from: https://diffuseprior.wordpress.com/2012/04/23/probitlogit-marginal-effects-in-r-2/
# based on:
# UCD CENTRE FOR ECONOMIC RESEARCH
# WORKING PAPER SERIES
# 2011
# Simple Logit and Probit Marginal Effects in R
# Alan Fernihough, University College Dublin
# WP11/22
#October 2011
# http://www.ucd.ie/t4cms/WP11_22.pdf
#---------------------------------------------------------------------

mfx <- function(x,sims=1000){
set.seed(1984)
marginal.effects <- pdf*coef(x)
sim <- matrix(rep(NA,sims*length(coef(x))), nrow=sims)
for(i in 1:length(coef(x))){
sim[,i] <- rnorm(sims,coef(x)[i],diag(vcov(x)^0.5)[i])
}
pdfsim <- rnorm(sims,pdf,pdfsd)
sim.se <- pdfsim*sim
res <- cbind(marginal.effects,sd(sim.se))
colnames(res)[2] <- "standard.error"
ifelse(names(x\$coefficients[1])=="(Intercept)",
return(res[2:nrow(res),]),return(res))
}

# marginal effects from logit
mfx(mylogit)

### code it yourself for marginal effects at the mean

summary(dat1)

b0 <-  5.92972   # estimated intercept from logit
b1 <-  -0.14099  # estimated b from logit

xvar <- 39.5   # reference value (i.e. mean)for explanatory variable
d <- .0001     # incremental change in x

xbi <- (xvar + d)*b1 + b0
xbj <- (xvar - d)*b1 + b0
meff <- ((exp(xbi)/(1+exp(xbi)))-(exp(xbj)/(1+exp(xbj))))/(d*2) ;
print(meff)

### a different perhaps easier formulation for me at the mean

XB <- xvar*b1 + b0 # this could be expanded for multiple b's or x's
meffx <- (exp(XB)/((1+exp(XB))^2))*b1
print(meffx)

### averaging the meff for the whole data set

dat1\$XB <- dat1\$age*b1 + b0

meffx <- (exp(dat1\$XB)/((1+exp(dat1\$XB))^2))*b1
summary(meffx) # get mean

#### marginal effects from linear model

lpm <- lm(dat1\$participate~dat1\$age)
summary(lpm)

#---------------------------------------------------
#
#
# multivariable case
#
#
#---------------------------------------------------

summary(dat2) # summary stats

#### run logistic regression model

mylogit <- glm(admit ~ gre + gpa, data = dat2, family = "binomial")
summary(mylogit)

exp(cbind(OR = coef(mylogit), confint(mylogit))) # get odds ratios

# marginal effects from logit
mfx(mylogit)

### code it yourself for marginal effects at the mean

summary(dat1)

b0 <-  -4.949378    # estimated intercept from logit
b1 <-  0.002691     # estimated b for gre
b2 <-   0.754687    # estimated b for gpa

x1 <- 587    # reference value (i.e. mean)for gre
x2 <- 3.39   # reference value (i.e. mean)for gre
d <- .0001   # incremental change in x

# meff at means for gre
xbi <- (x1 + d)*b1 + b2*x2 + b0
xbj <- (x1 - d)*b1 + b2*x2 + b0
meff <- ((exp(xbi)/(1+exp(xbi)))-(exp(xbj)/(1+exp(xbj))))/(d*2) ;
print(meff)

# meff at means for gpa
xbi <- (x2 + d)*b2 + b1*x1 + b0
xbj <- (x2 - d)*b2 + b1*x1 + b0
meff <- ((exp(xbi)/(1+exp(xbi)))-(exp(xbj)/(1+exp(xbj))))/(d*2) ;
print(meff)

### a different perhaps easier formulation for me at the mean

XB <- x1*b1 +x2*b2 + b0 # this could be expanded for multiple b's or x's

# meff at means for gre
meffx <- (exp(XB)/((1+exp(XB))^2))*b1
print(meffx)

# meff at means for gpa
meffx <- (exp(XB)/((1+exp(XB))^2))*b2
print(meffx)

### averaging the meff for the whole data set

dat2\$XB <- dat2\$gre*b1 + dat2\$gpa*b2 + b0

# sample avg meff for gre
meffx <- (exp(dat1\$XB)/((1+exp(dat1\$XB))^2))*b1
summary(meffx) # get mean

# sample avg meff for gpa
meffx <- (exp(dat1\$XB)/((1+exp(dat1\$XB))^2))*b2
summary(meffx) # get mean

#### marginal effects from linear model

lpm <- lm(admit ~ gre + gpa, data = dat2)
summary(lpm)```
Created by Pretty R at inside-R.org

## Tuesday, May 24, 2016

### Data Scientists vs Algorithms vs Real Solutions to Real Problems

A few weeks ago there was a short tweet in my tweetstream that kindled some thoughts.

"All people are biased, that's why we need algorithms!" "All algorithms are biased, that's why we need people!" via @pmarca

And a retweet/reply by Diego Kuonen:

"Algorithms are aids to thinking and NOT replacements for it"

This got me thinking about  a lot of work that I have been doing,  past job interviews and conversations I have had with headhunters and 'data science' recruiters, as well as a number of discussions or sometimes arguments about what defines data science and about 'unicorns' and 'fake' data scientists. I ran across a couple interesting perspectives related to this some years back:

What sets data scientists apart from other data workers, including data analysts, is their ability to create logic behind the data that leads to business decisions. "Data scientists extract data, formulate models and apply quantitative analysis in a proactive manner" -Laura Kelley, Vice President, Modis.

"They can suck data out of a server log, a telecom billing file, or the alternator on a locomotive, and figure out what the heck is going on with it. They create new products and services for customers. They can also interface with carbon-based lifeforms — senior executives, product managers, CTOs, and CIOs. You need them." -

I have seen numerous variations on Drew Conway's data science Venn Diagram, but I think Drew still nails it down pretty well. If you can write code, understand statistics, and can apply those skills based on your specific subject matter expertise, to me that meets the threshold of the minimum skills most employers might require for you to do value added work. But these tweets beg the question, for many business problems, do we need algorithms at all, or what kind of people do we really need?

Absolutely there are differences in skillsets required for machine learning vs traditional statistical inference, and I know there are definitely instances where knowing how to set up a Hadoop cluster can be valuable for certain problems. Maybe you do need a complex algorithm to power a killer ap or recommender system.

I think part of the hype and snobery around the terms data science and data scientist might stem from the fact that they are used in so many different contexts and they mean so many things to so many people that there is fear that the true meaning will be lost along with one's relevance in this space as a data scientist. It might be better to forget about semantics and just concentrate on the ends that we are trying to achieve.

I think a vast majority of businesses really need insights driven by people with subject matter expertise and the ability to clean, extract, analyze, visualize, and probably most importantly, communicate. Sometimes the business need requires prediction, other times inference. Many times you may not need a complicated algorithm or experimental design at all, not as much as you need someone to make sense of the nasty transactional data your business is producing and summarize it all with something maybe as simple as cross tabs. Sometimes you might need a PhD computer scientist or Engineer that can meet the strictest of data science thresholds, but lots of times what you really may need is a statistician, econometrician, biometrician, or just a good MBA or business analyst that understands predictive modeling, causal inference, and the basics of a left join.
This is why one recommendation may be to pursue 'data scientists'  outside of some of the traditional academic disciplines assocaited with data science.

There was a similar discussion a couple years ago on the SAS Voices Blog in a post by Tonay Balan titled "Data scientist or statistician: What's in a name?"

"Obviously, there are a lot of opinions about the terminology.  Here’s my perspective:  The rise and fall of these titles points to the rapid pace of change in our industry.  However, there is one thing that remains constant.  As companies collect more and more data, it is imperative that they have individuals who are skilled at extracting information from that data in a way that is relevant to the business, scientifically accurate, and that can be used to drive better decisions, faster.  Whether they call themselves statisticians, data scientists, or SAS users, their end goal is the same."

And really good insight from Evan Stubbs (in the comment section):

"Personally, I think it's just the latest attempt to clarify what's actually an extremely difficult role. I don't think it's synonymous with any existing title (including statistician, mathematician, analyst, data miner, and so on) but on the same note, I don't think it's where we'll finally end up.

From my perspective, the market is trying to create a shorthand signal that describes someone with:

* An applied (rather than theoretical) focus
* A broader (rather than narrow) set of technical skills
* A focus on outcomes over analysis
* A belief in creating processes rather than doing independent activities
* An ability to communicate along with an awareness of organisational psychology, persuasion, and influence
* An emphasis on recommendations over insight

Existing roles and titles don't necessarily identify those characteristics.

While "Data Scientist" is the latest attempt to create an all-encompassing title, I don't think it'll last. On one hand, it's very generic. On the other, it still implies a technical focus - much of the value of these people stems from their ability to communicate, interface with the business, be creative, and drive outcomes. "Data Scientist", to me at least, carries very research-heavy connotations, something that dilutes the applied and (often) political nature of the field. "

One thing is for certain, recruiters might have a better shot at placing candidates for their clients if role descriptions would just say what they mean and leave the fighting over who's a real data scientist to the LinkedIn discussion boards and their tweetstream.