Tuesday, February 13, 2018

Intuition for Random Effects

Previously I wrote a post based on course notes from J.Blumenstock that attempted to provide some intuition for how fixed effects estimators can account for unobserved heterogeneity (individual specific effects).

Recently someone asked if I could provide a similarly motivating and intuitive example regarding random effects. Although I was not able to come up with a new example, I can definitely discuss random effects in the same context of the previous example. But first a little (less intuitive) background.

Background

To recap, the purpose of both fixed and random effects estimators is to model treatment effects in the face of unobserved individual specific effects.

yit =b xit + αi + uit  (1) 

In the model above this is represented by αi . In terms of estimation, the difference between fixed and random effects depends on how we choose to model this term. In the context of fixed effects it can be captured through a dummy variable estimation (this creates different intercepts or shifts capturing specific effects) or by transforming the data, subtracting group (fixed effects) means from individual observations within each group.  In random effects models, individual specific effects are captured by a composite error term (αi + uit) which assumes that individual intercepts are drawn from a random distribution of possible intercepts. The random component of the error term αi captures the individual specific effects in a different way from fixed effects models. 

As noted in another post, Fixed, Mixed, and Random Effects, the random effects model is estimated using Generalized Least Squares (GLS) :

βGLS = (X’Ω-1X)-1(X’Ω-1Y) where Ω = I  Σ    (2) 

Where Σ is the variance αi+ uit If  Σ is unknown, it is estimated, producing a feasible generalized least squares estimate βFGLS

Intuition for Random Effects

In my post Intuition for Fixed Effects I noted: 

"Essentially using a dummy variable in a regression for each city (or group, or type to generalize beyond this example) holds constant or 'fixes' the effects across cities that we can't directly measure or observe. Controlling for these differences removes the 'cross-sectional' variation related to unobserved heterogeneity (like tastes, preferences, other unobserved individual specific effects). The remaining variation, or 'within' variation can then be used to 'identify' the causal relationships we are interested in."

Lets look at the toy data I used in that example. 







The crude ellipses in the plots above (motivated by the example given in Kennedy, 2008) indicate the data for each city and the the 'within' variation exploited by fixed effects models (that allowed us to correctly identify the correct price/quantity relationships expected in the previous post). The differences between the ellipses represents 'between variation.' As Kennedy discusses, random effects models differ from fixed effects models in that they are able to exploit both 'within' and 'between' variation, producing an estimate that is a weighted average of both kinds of variation (via Σ in equation 2 above). OLS, on the other hand exploits both kinds of variation as an unweighted average.

More Details 

As Kennedy discusses, both FE and RE can be viewed as running OLS on different transformations of the data.

For fixed effects: "this transformation consists of subtracting from each observation the average of the values within its ellipse"

For random effects: "the EGLS (or FGLS above) calculation is done by finding a transformation of the data that creates a spherical variance-covariance matrix and then performing OLS on the transformed data."

As Kennedy notes, the increased information used by RE makes them more efficient estimators, but correlation between 'x' and the error term creates bias. i.e. RE assumes that αis uncorrelated with (orthogonal to) regressors. Angrist and Pischke (2009) discuss (footnote, p. 223) that they prefer FE because the gains in efficiency are likely to be modest while the finite sample properties of RE may be worse. As noted on p.243 an important assumption for identification in FE is that the most important sources of variation are time invariant (because information from time varying regressors gets differenced out). Angrist and Pischke also have a nice discussion on page 244-245 discussing the choice between FE and lagged dependent variable models.

References:

A Guide to Econometrics. Peter Kennedy. 6th Edition. 2008
Mostly Harmless Econometrics. Angrist and Pischke. 2009

See also: ‘Metrics Monday: Fixed Effects, Random Effects, and (Lack of) External Validity (Marc Bellemare.

Marc notes: 

"Nowadays, in the wake of the Credibility Revolution, what we teach students is: “You should use RE when your variable of interest is orthogonal to the error term; if there is any doubt and you think your variable of interest is not orthogonal to the error term, use FE.” And since the variable can be argued to be orthogonal pretty much only in cases where it is randomly assigned in the context of an experiment, experimental work is pretty much the only time the RE estimator should be used."

Friday, February 2, 2018

Deep Learning vs. Logistic Regression ROC vs Calibration Explaining vs. Predicting

Frank Harrel writes Is Medicine Mesmerized by Machine Learning? Some time ago I wrote about predictive modeling and the differences between what the ROC curve may tell us and how well a model 'calibarates.'

There I quoted from the journal Circulation:

'When the goal of a predictive model is to categorize individuals into risk strata, the assessment of such models should be based on how well they achieve this aim...The use of a single, somewhat insensitive, measure of model fit such as the c statistic can erroneously eliminate important clinical risk predictors for consideration in scoring algorithms'

Not too long ago Dr. Harrel shares the following tweet related to this:

I have seen hundreds of ROC curves in the past few years.  I've yet to see one that provided any insight whatsoever.  They reverse the roles of X and Y and invite dichotomization.  Authors seem to think they're obligatory.  Let's get rid of 'em. @f2harrell 8:42 AM - 1 Jan 2018

In his Statistical Thinking post above, Dr. Harrel writes:

"Like many applications of ML where few statistical principles are incorporated into the algorithm, the result is a failure to make accurate predictions on the absolute risk scale. The calibration curve is far from the line of identity as shown below...The gain in c-index from ML over simpler approaches has been more than offset by worse calibration accuracy than the other approaches achieved."

i.e. depending on the goal, better ROC scores don't necessarily mean better models.

But this post was about more than discrimination and calibration. It was discussing the logistic regression approach taken in Exceptional Mortality Prediction by Risk Scores from Common Laboratory Tests  vs the deep learning approach used in Improving Palliative Care with Deep Learning.

"One additional point: the ML deep learning algorithm is a black box, not provided by Avati et al, and apparently not usable by others. And the algorithm is so complex (especially with its extreme usage of procedure codes) that one can’t be certain that it didn’t use proxies for private insurance coverage, raising a possible ethics flag. In general, any bias that exists in the health system may be represented in the EHR, and an EHR-wide ML algorithm has a chance of perpetuating that bias in future medical decisions. On a separate note, I would favor using comprehensive comorbidity indexes and severity of disease measures over doing a free-range exploration of ICD-9 codes."

This kind of pushes back against the idea that deep neural nets can effectively bypass feature engineering, or at least raises cautions in specific contexts.

Actually, he is not as critical of the authors of this paper as he is about what he considers undue accolades it has received.

This ties back to my post on LinkedIn a couple weeks ago, Deep Learning, Regression, and SQL. 

See also:

To Explain or Predict
Big Data: Causality and Local Expertise Are Key in Agronomic Applications

And: 

Feature Engineering for Deep Learning
In Deep Learning, Architecture Engineering is the New Feature Engineering