Saturday, January 29, 2011

Culture War: Classical Statistics vs. Machine Learning

 'Statistical Modeling: The Two Cultures' by L. Breiman (Statistical Science
2001, Vol. 16, No. 3, 199–231) is an interesting paper that is a must read for anyone traditionally trained in statistics, but new to the concept of machine learning. It gives  perspective and context to anyone that may attempt to learn to use data mining software such as  SAS Enterprise Miner or who may take a course in machine learning  (like Dr. Ng's (Stanford) youtube lectures in machine learning .) The algorithmic machine learning paradigm is in great contrast to the traditional probabilistic approaches of 'data modeling' in which I had been groomed both as an undergraduate and in graduate school.

From the article, two cultures are defined:

"There are two cultures in the use of statistical modeling to reach conclusions from data.

Classical Statistics/Stochastc Data Modeling Paradigm:

" assumes that the data are generated by a given stochastic data model. "

Algorithmic or Machine Learning Paradigm:

"uses algorithmic models and treats the data mechanism as unknown."

In a lecture  for Eco 5385
 Data Mining Techniques for Economists
, Professor  Tom Fomby
 of Southern Methodist University distinguishes machine learning from classical statistical techniques:

Classical Statistics: Focus is on hypothesis testing of causes and effects and interpretability of models.  Model Choice is based on parameter significance and In-sample Goodness-of-fit.

Machine Learning:  Focus is on Predictive Accuracy even in the face of lack of interpretability of models.  Model Choice is based on Cross Validation of Predictive Accuracy using Partitioned Data Sets.

For some, this distinction may be made more transparent by comparing the methods used under each approach. Professor Fomby does a great job making these distinctions:

Methods Classical Statistics:  Regression, Logit/Probit, Duration Models, Principle Components, Discriminant Analysis, Bayes Rules

Artificial Intelligence/Machine Learning/Data Mining: Classification and Regression Trees, Neural Nets, K-Nearest Neighbors, Association Rules, Cluster Analysis
 
From the standpoint of econometrics, the data modeling culture is described very well in this post by Tim Harford:

"academic econometrics is rarely used for forecasting. Instead, econometricians set themselves the task of figuring out past relationships. Have charter schools improved educational standards? Did abortion liberalisation reduce crime? What has been the impact of immigration on wages?"

This is certainly consistent with the comparisons presented in the Statistical Science article. Note however, that the methodologies referenced in the article (like logistic regression)  that are utilized under the data modeling or classical statistics paradigm are a means to fill what Brieman refers to as a black box. Under this paradigm analysts are attempting to characterize an outcome by estimating parameters and making inferences about them based on some assumed data generating process. It is not to say that these methods are never used under the machine learning paradigm, but how they are used. The article provides a very balanced 'ping-pong' discussion citing various experts from both cultures, including some who seem to promote both including the authors of The Elements of Statistical Learning: Data Mining, Inference, and Prediction.

In my first econometrics course, the textbook cautioned against 'data mining,' described as using techniques such as stepwise regression. It insisted on letting theory drive model development, rating the model on total variance explained, and the significance of individual coefficients. This advice was certainly influenced by the 'data modeling' culture. The text was published in the same year as the Breiman article. ( I understand this caution has been moderated in contemporary editions).
Of course, as the article mentions, if what you are interested in is theory and the role of particular variables in underlying processes, then traditional inference seems to be the appropriate direction to take. (Breiman of course still takes issue, arguing that we can't trust the significance of an estimated co-efficient if the model overall is a poor predictor).

"Higher predictive accuracy is associated with more reliable information about the underlying data mechanism. Weak predictive accuracy can lead to questionable conclusions."

"Algorithmic models can give better predictive accuracy than data models,and provide better information about the underlying mechanism.
"

"The goal is not interpretability, but accurate information."

When algorithmic models are more appropriate (especially when the goal is prediction) a stoachastic model designed to make inferences about specific model co-efficients may provide "the right answer to the wrong question" as Emanuel Parzen puts it in his comments on Breiman.


I even find a hint of this in Greene, a well known econometrics textbook author:

 "It remains an interesting question for research whether fitting y well or obtaining good parameter estimates is a preferable estimation criterion. Evidently, they need not be the same thing."
  p. 686 Greene,  Econometric Analysis 5th ed

Keeping an Open Mind: Multiculturalism in Data Science

As  Breiman states:

"Approaching problems by looking for a data model imposes an apriori straight jacket that restricts the ability of statisticians to deal with a wide range of statistical problems."

A multicultural approach to analysis (stochastic or algorithmic) seems to be the take away message of the Breiman article and discussions that follow. This is certainly true in the new field of data science, and is clearly depicted in Drew Conway's  data science Venn diagram depicted below. 




As Parzen states "I believe statistics has many cultures." He points out that many practitioners are well aware of the divides that exist between Bayesians and frequentists, algorithmic approaches aside. Even if we restrict our tool box to stochastic methods, we can often find our hands tied if we are not open minded or understand the social norms that distinguish theory from practice.  And there are plenty of divisive debates, like the use of linear probability models for one.


I have become more and more open minded as my experience working under both paradigms has increased. Software packages like SAS Enterprise Miner certainly accommodate open minded curiosity. Packages like (or even SAS IML) let the very curious get their hands even dirtier. When it comes to estimating the marginal effects of a treatment for a binary outcome, I usually have no issue with using an LPM over logistic regression. But when it comes to prediction I certainly won't shy away from using logistic regression or for that matter a neural net, decision tree, or a 'multicultural' ensemble of all of the above.

This article was updated and abridged on December 2, 2014. You can find the original longer discussion here.

4 comments:

  1. Great article! I think a major problem for econometricians is that most machine-learning techniques provide some strange implications for traditional theory-based economic parameters, especially elasticities. Do you know of any papers to have addressed this?

    ReplyDelete
  2. The permeation of machine learning techniques into science actually scares me because of this. Paraphrasing Terry Tao the point of academic research isn't to prove facts true but to understand why/how things are they way they are.

    Machine Learning has its place in the parts of science where the data is just too messy and/or uninteresting to study but the way its being used now is concerning.

    ReplyDelete
  3. What in the world makes you think that logistic regression is not used in Machine Learning?

    It is used ubiquitously.

    ReplyDelete
  4. Of course it is. If you are referring to Fomby's distinction, I wouldn't take it for more than an illustration. He could easily include Logit models in both camps. I'm guessing in his course he probably makes a distinction between using them in econometric vs. machine learning applications. And just think about logistic activation functions in neural networks. I sometimes prefer logistic regression over other algorithms I mention like decision trees in certain machine learning problems that have required continuous posterior probabilities although sometimes I can get them from boosted trees. In fact, I use logistic regression more often in a machine learning context than I do in causal inference when I require estimates of marginal effects of a given treatment. Typically, a linear probability model with robust standard error will get the job done with little practical difference in the results. I wouldn't take Fomby's distinctions too seriously or as literal. I'm thinking he is trying to appeal to econometricians or students already familiar with those methods prior to introducing a totally different paradigm. I think he is saying, you guys are already familiar with these methods that you have used in econometrics, but here are some algorithmic approaches used in machine learning that you may not be familiar with that we will introduce in this course. I don't think he is in the business of making hard and fast distinctions nor would I.

    ReplyDelete

Note: Only a member of this blog may post a comment.