Monday, January 31, 2011

Culture War: Classical Statistics vs Machine Learning

 'Statistical Modeling: The Two Cultures' by L. Breiman (Statistical Science
2001, Vol. 16, No. 3, 199–231) is an interesting paper that is a must read for anyone traditionally trained in statistics, but new to the concept of machine learning. It gives  perspective and context to anyone that may attempt to learn to use data mining software such as  SAS Enterprise Miner or who may take a course in machine learning  (like Dr. Ng's (Stanford) youtube lectures in machine learning .) The algorithmic machine learning paradigm is in great contrast to the traditional probabilistic approaches of 'data modeling' in which I had been groomed both as an undergraduate and in graduate school.

From the article, two cultures are defined:

"There are two cultures in the use of statistical modeling to reach conclusions from data.

Classical Statistics/Stochastc Data Modeling Paradigm:

" assumes that the data are generated by a given stochastic data model. "

Algorithmic or Machine Learning Paradigm:

"uses algorithmic models and treats the data mechanism as unknown."

In a lecture  for Eco 5385
 Data Mining Techniques for Economists
, Professor  Tom Fomby
 of Southern Methodist University distinguishes machine learning from classical statistical techniques:

Classical Statistics: Focus is on hypothesis testing of causes and effects and interpretability of models.  Model Choice is based on parameter significance and In-sample Goodness-of-fit.

Machine Learning:  Focus is on Predictive Accuracy even in the face of lack of interpretability of models.  Model Choice is based on Cross Validation of Predictive Accuracy using Partitioned Data Sets.

For some, this distinction may be made more transparent by comparing the methods used under each approach. Professor Fomby makes these distinctions:

Methods Classical Statistics:  Regression, Logit/Probit, Duration Models, Principle Components, Discriminant Analysis, Bayes Rules

Artificial Intelligence/Machine Learning/Data Mining: Classification and Regression Trees, Neural Nets, K-Nearest Neighbors, Association Rules, Cluster Analysis

From the standpoint of econometrics, the data modeling culture is described very well in this post by Tim Harford:

"academic econometrics is rarely used for forecasting. Instead, econometricians set themselves the task of figuring out past relationships. Have charter schools improved educational standards? Did abortion liberalisation reduce crime? What has been the impact of immigration on wages?"

The methodologies referenced in the article (like logistic regression)  that are utilized under the data modeling or classical statistics paradigm are a means to fill what Brieman refers to as a black box. Under this paradigm analysts are attempting to characterize an outcome by estimating parameters and making inferences about them based on some assumed data generating process. It is not to say that these methods are never used under the machine learning paradigm, but how they are used. The article provides a very balanced 'ping-pong' discussion citing various experts from both cultures, including some who seem to promote both including the authors of The Elements of Statistical Learning: Data Mining, Inference, and Prediction.

In my first econometrics course, the textbook cautioned against 'data mining,' described as using techniques such as stepwise regression. It insisted on letting theory drive model development, rating the model on total variance explained, and the significance of individual coefficients. This advice was certainly influenced by the 'data modeling' culture. The text was published in the same year as the Breiman article.

Of course, as the article mentions, if what you are interested in is theory and the role of particular variables in underlying processes, then traditional inference seems to be appropriate.

When algorithmic models are more appropriate (especially when the goal is prediction) a stoachastic model designed to make inferences about specific model co-efficients may provide "the right answer to the wrong question" as Emanuel Parzen puts it in his comments on Breiman.

Keeping an Open Mind: 'Multiculturalism' in Data Science

As  Breiman states:

"Approaching problems by looking for a data model imposes an apriori straight jacket that restricts the ability of statisticians to deal with a wide range of statistical problems."

As Parzen states "I believe statistics has many cultures." He points out that many practitioners are well aware of the divides that exist between Bayesians and frequentists, algorithmic approaches aside. Even if we restrict our tool box to stochastic methods, we can often find our hands tied if we are not open minded or understand the social norms that distinguish theory from practice.  And there are plenty of divisive debates, like the use of linear probability models for one.

This article was updated and abridged on December 2, 2018.

No comments:

Post a Comment