Friday, December 2, 2011

Culture War (original post)


 'Statistical Modeling: The Two Cultures' by L. Breiman (Statistical Science
2001, Vol. 16, No. 3, 199–231) is an interesting paper that is a must read for anyone traditionally trained in statistics, but new to the concept of machine learning. It gives  perspective and context to anyone that may attempt to learn to use data mining software such as  SAS Enterprise Miner or who may take a course in machine learning  (like Dr. Ng's (Stanford) youtube lectures in machine learning .) The algorithmic machine learning paradigm is in great contrast to the traditional probabilistic approaches of 'data modeling' in which I had been groomed both as an undergraduate and in graduate school.

From the article, two cultures are defined:

"There are two cultures in the use of statistical modeling to reach conclusions from data.

Classical Statistics/Stochastc Data Modeling Paradigm:

" assumes that the data are generated by a given stochastic data model. "

Algorithmic or Machine Learning Paradigm:

"uses algorithmic models and treats the data mechanism as unknown."

In a lecture  for Eco 5385
 Data Mining Techniques for Economists
, Professor  Tom Fomby
 of Southern Methodist University distinguishes machine learning from classical statistical techniques:

Classical Statistics: Focus is on hypothesis testing of causes and effects and interpretability of models.  Model Choice is based on parameter significance and In-sample Goodness-of-fit.

Machine Learning:  Focus is on Predictive Accuracy even in the face of lack of interpretability of models.  Model Choice is based on Cross Validation of Predictive Accuracy using Partitioned Data Sets.

For some, this distinction may be made more transparent by comparing the methods used under each approach. Professor Fomby does a great job making these distinctions:

Methods Classical Statistics:  Regression, Logit/Probit, Duration Models, Principle Components, Discriminant Analysis, Bayes Rules

Artificial Intelligence/Machine Learning/Data Mining: Classification and Regression Trees, Neural Nets, K-Nearest Neighbors, Association Rules, Cluster Analysis
 
From the standpoint of econometrics, the data modeling culture is described very well in this post by Tim Harford:

"academic econometrics is rarely used for forecasting. Instead, econometricians set themselves the task of figuring out past relationships. Have charter schools improved educational standards? Did abortion liberalisation reduce crime? What has been the impact of immigration on wages?"

This is certainly consistent with the comparisons presented in the Statistical Science article. Note however, that the methodologies referenced in the article (like logistic regression)  that are utilized under the data modeling or classical statistics paradigm are a means to fill what Brieman refers to as a black box. Under this paradigm analysts are attempting to characterize an outcome by estimating parameters and making inferences about them based on some assumed data generating process. It is not to say that these methods are never used under the machine learning paradigm, but how they are used. The article provides a very balanced 'ping-pong' discussion citing various experts from both cultures, including some who seem to promote both including the authors of The Elements of Statistical Learning: Data Mining, Inference, and Prediction.

In my first econometrics course, the textbook cautioned against 'data mining,' described as using techniques such as stepwise regression. It insisted on letting theory drive model development, rating the model on total variance explained, and the significance of individual coefficients. This advice was certainly influenced by the 'data modeling' culture. The text was published in the same year as the Breiman article. ( I understand this caution has been moderated in contemporary editions).
Of course, as the article mentions, if what you are interested in is theory and the role of particular variables in underlying processes, then traditional inference seems to be the appropriate direction to take. (Breiman of course still takes issue, arguing that we can't trust the significance of an estimated co-efficient if the model overall is a poor predictor).

"Higher predictive accuracy is associated with more reliable information about the underlying data mechanism. Weak predictive accuracy can lead to questionable conclusions."

"Algorithmic models can give better predictive accuracy than data models,and provide better information about the underlying mechanism.
"

"The goal is not interpretability, but accurate information."

When algorithmic models are more appropriate (especially when the goal is prediction) a stoachastic model designed to make inferences about specific model co-efficients may provide "the right answer to the wrong question" as Emanuel Parzen puts it in his comments on Breiman.

While in graduate school, the stochastic culture was dominant. However, there was some inclusiveness, take for example Kennedy's A Guide to Econometrics . Kennedy discusses the expectation maximization algorithm (also a lecture in Dr. Ng's course) and also provides a great introduction to neural networks which greatly influenced my understanding of such allegedly 'uninterpretable' algorithmic techniques.


Keeping an Open Mind: Multiculturalism in Data Science

As  Breiman states:

"Approaching problems by looking for a data model imposes an apriori straight jacket that restricts the ability of statisticians to deal with a wide range of statistical problems."

A multicultural approach to analysis (stochastic or algorithmic) seems to be the take away message of the Breiman article and discussions that follow. In fact, this may be the best approach in the universe of all things empirical. This is certainly true in the new field of data science, and is clearly depicted in Drew Conway's  data science Venn diagram depicted below. 





As Parzen states "I believe statistics has many cultures." He points out that many practitioners are well aware of the divides that exist between Bayesians and frequentists, algorithmic approaches aside. Even if we restrict our tool box to stochastic methods, we can often find our hands tied if we are not open minded. We can still sacrifice results over technique. And there are plenty of divisive debates.  For example, in discussing the state of modern econometrics Leamer faults econometric results that emphasize statistical significance over practicality or importance in an interesting podcast from Econ Talk here.

Brieman touches on  his idea of importance as well:

"My definition of importance is based on prediction. A variable might be considered important if deleting it seriously affects prediction accuracy"

Note, no mention of statistical significance (a child of the stochastic data modeling culture) by  Breiman

This is echoed in a paper from the Joint Statistical Meetings-Section on Statistical Education (2009) 'The Cult of Statistical Significance' by Stephen T. Ziliak and Deirdre N. McCloskey who have authored a book by the same title.  They note:

"Statistical significance is, we argue, a diversion from the proper objects of scientific study…Significant does not mean important and insignificant does not mean unimportant"

I even find a hint of this in Greene, a well known econometrics textbook author:

 "It remains an interesting question for research whether fitting y well or obtaining good parameter estimates is a preferable estimation criterion. Evidently, they need not be the same thing."
  p. 686 Greene,  Econometric Analysis 5th ed

It seems that cultural divides are hard to avoid. Take another example from the article below:

Statistical alternatives for studying college student retention: A comparative analysis of logit, probit, and linear regression. Eric L. Dey and Alexander W. Astin Research in Higher Education
Volume 34, Number 5, 569-581 (1993)

 The above article concluded:

"Results indicate that despite the theoretical advantages offered by logistic regression and probit analysis, there is little practical difference between either of these two techniques and more traditional linear regression. "

It never fails to go to a paper presentation and see someone getting uptight over linear probability models.  In fact, there is some interesting literature on the robustness and pragmatism of linear probability models. 

An emphasis on robustness  or even parsimony seems to be an overarching theme in the fairly new book "Mostly Harmless Econometrics" by Joshua Angrist and Jörn-Steffen Pischke. Angrist and Pishke write about  the robustness of the LPM in their book (but also see econometrician Dave Giles  post with some challenging objections to the LPM, a followup by Angrist and Pischke, and more discussion from economist Marc Bellemare).

Their web site for the book seems to caution practitioners about being married to fancier techniques:

fancier techniques are typically unnecessary and even dangerous.


I have become more and more open minded as my experience working under both paradigms has increased. Software packages like SAS Enterprise Miner certainly accommodate open minded curiosity. Packages like (or even SAS IML) let the very curious get their hands even dirtier. Both have been very influential to me. When it comes to estimating the marginal effects of a treatment for a binary outcome, I usually have no issue with using an LPM over logistic regression (but I would have taken issue with it not not long ago). But when it comes to prediction I certainly won't shy away from using logistic regression or for that matter a neural net, decision tree, or a 'multicultural' ensemble of all of the above.