Saturday, January 29, 2011

Culture War: Classical Statistics vs. Machine Learning

 'Statistical Modeling: The Two Cultures' by L. Breiman (Statistical Science
2001, Vol. 16, No. 3, 199–231) is an interesting paper that is a must read for anyone traditionally trained in statistics, but new to the concept of machine learning. It gives  perspective and context to anyone that may attempt to learn to use data mining software such as  SAS Enterprise Miner or who may take a course in machine learning  (like Dr. Ng's (Stanford) youtube lectures in machine learning .) The algorithmic machine learning paradigm is in great contrast to the traditional probabilistic approaches of 'data modeling' in which I had been groomed both as an undergraduate and in graduate school.

From the article, two cultures are defined:

"There are two cultures in the use of statistical modeling to reach conclusions from data.

Classical Statistics/Stochastc Data Modeling Paradigm:

" assumes that the data are generated by a given stochastic data model. "

Algorithmic or Machine Learning Paradigm:

"uses algorithmic models and treats the data mechanism as unknown."

In a lecture  for Eco 5385
 Data Mining Techniques for Economists
, Professor  Tom Fomby
 of Southern Methodist University distinguishes machine learning from classical statistical techniques:

Classical Statistics: Focus is on hypothesis testing of causes and effects and interpretability of models.  Model Choice is based on parameter significance and In-sample Goodness-of-fit.

Machine Learning:  Focus is on Predictive Accuracy even in the face of lack of interpretability of models.  Model Choice is based on Cross Validation of Predictive Accuracy using Partitioned Data Sets.

For some, this distinction may be made more transparent by comparing the methods used under each approach. Professor Fomby does a great job making these distinctions:

Methods Classical Statistics:  Regression, Logit/Probit, Duration Models, Principle Components, Discriminant Analysis, Bayes Rules

Artificial Intelligence/Machine Learning/Data Mining: Classification and Regression Trees, Neural Nets, K-Nearest Neighbors, Association Rules, Cluster Analysis
 
From the standpoint of econometrics, the data modeling culture is described very well in this post by Tim Harford:

"academic econometrics is rarely used for forecasting. Instead, econometricians set themselves the task of figuring out past relationships. Have charter schools improved educational standards? Did abortion liberalisation reduce crime? What has been the impact of immigration on wages?"

This is certainly consistent with the comparisons presented in the Statistical Science article. Note however, that the methodologies referenced in the article (like logistic regression)  that are utilized under the data modeling or classical statistics paradigm are a means to fill what Brieman refers to as a black box. Under this paradigm analysts are attempting to characterize an outcome by estimating parameters and making inferences about them based on some assumed data generating process. It is not to say that these methods are never used under the machine learning paradigm, but how they are used. The article provides a very balanced 'ping-pong' discussion citing various experts from both cultures, including some who seem to promote both including the authors of The Elements of Statistical Learning: Data Mining, Inference, and Prediction.

In my first econometrics course, the textbook cautioned against 'data mining,' described as using techniques such as stepwise regression. It insisted on letting theory drive model development, rating the model on total variance explained, and the significance of individual coefficients. This advice was certainly influenced by the 'data modeling' culture. The text was published in the same year as the Breiman article. ( I understand this caution has been moderated in contemporary editions).
Of course, as the article mentions, if what you are interested in is theory and the role of particular variables in underlying processes, then traditional inference seems to be the appropriate direction to take. (Breiman of course still takes issue, arguing that we can't trust the significance of an estimated co-efficient if the model overall is a poor predictor).

"Higher predictive accuracy is associated with more reliable information about the underlying data mechanism. Weak predictive accuracy can lead to questionable conclusions."

"Algorithmic models can give better predictive accuracy than data models,and provide better information about the underlying mechanism.
"

"The goal is not interpretability, but accurate information."

When algorithmic models are more appropriate (especially when the goal is prediction) a stoachastic model designed to make inferences about specific model co-efficients may provide "the right answer to the wrong question" as Emanuel Parzen puts it in his comments on Breiman.


I even find a hint of this in Greene, a well known econometrics textbook author:

 "It remains an interesting question for research whether fitting y well or obtaining good parameter estimates is a preferable estimation criterion. Evidently, they need not be the same thing."
  p. 686 Greene,  Econometric Analysis 5th ed

Keeping an Open Mind: Multiculturalism in Data Science

As  Breiman states:

"Approaching problems by looking for a data model imposes an apriori straight jacket that restricts the ability of statisticians to deal with a wide range of statistical problems."

A multicultural approach to analysis (stochastic or algorithmic) seems to be the take away message of the Breiman article and discussions that follow. The field  data science, and as clearly depicted in Drew Conway's  data science Venn diagram is multi-pillared:  




Parzen points out that many practitioners are well aware of the divides that exist between Bayesians and frequentists, algorithmic approaches aside. Even if we restrict our tool box to stochastic methods, we can often find our hands tied if we are not open minded or understand the social norms that distinguish theory from practice.  And there are plenty of divisive debates, like the use of linear probability models for one.

As Parzen states "I believe statistics has many cultures."  We will be a much more effective in our work and learning if we have this understanding, and embrace rather than fight the diversity of thought across various fields of science and each discipline's differing social norms and practices. Data science at best, is an interdisciplinary field of study and work.

This article was updated and abridged on December 2, 2014.

2 comments:

  1. Great article! I think a major problem for econometricians is that most machine-learning techniques provide some strange implications for traditional theory-based economic parameters, especially elasticities. Do you know of any papers to have addressed this?

    ReplyDelete
  2. The permeation of machine learning techniques into science actually scares me because of this. Paraphrasing Terry Tao the point of academic research isn't to prove facts true but to understand why/how things are they way they are.

    Machine Learning has its place in the parts of science where the data is just too messy and/or uninteresting to study but the way its being used now is concerning.

    ReplyDelete