'Statistical Modeling: The Two Cultures' by L.
Breiman (Statistical Science
2001, Vol. 16, No. 3, 199–231) is an
interesting paper that is a must read for anyone traditionally trained in
statistics, but new to the concept of machine learning. It gives
perspective and context to anyone that may attempt to learn to use data mining
software such as SAS Enterprise Miner or who may take a course in machine
learning (like Dr. Ng's (Stanford) youtube lectures in machine learning .) The
algorithmic machine learning paradigm is in great contrast to the traditional
probabilistic approaches of 'data modeling' in which I had been groomed both as
an undergraduate and in graduate school.
From the article, two cultures are
defined:
"There
are two cultures in the use of statistical modeling to reach conclusions from
data.
Classical Statistics/Stochastc Data
Modeling Paradigm:
"
assumes that the data are generated by a given stochastic data model. "
Algorithmic or Machine Learning
Paradigm:
"uses
algorithmic models and treats the data mechanism as unknown."
In a lecture for Eco 5385
Data Mining Techniques for
Economists
, Professor Tom Fomby
of Southern Methodist
University distinguishes machine learning from classical statistical
techniques:
Classical
Statistics:
Focus is on hypothesis testing of causes and effects and interpretability of
models. Model Choice is based on parameter significance and In-sample
Goodness-of-fit.
Machine
Learning: Focus is on
Predictive Accuracy even in the face of lack of interpretability of
models. Model Choice is based on Cross Validation of Predictive Accuracy
using Partitioned Data Sets.
For some, this distinction may be made
more transparent by comparing the methods used under each approach. Professor
Fomby does a great job making these distinctions:
Methods
Classical Statistics: Regression, Logit/Probit,
Duration Models, Principle Components, Discriminant Analysis, Bayes Rules
Artificial
Intelligence/Machine Learning/Data Mining: Classification
and Regression Trees, Neural Nets, K-Nearest Neighbors, Association Rules,
Cluster Analysis
From the standpoint of econometrics,
the data modeling culture is described very well in this post by Tim Harford:
"academic
econometrics is rarely used for forecasting. Instead, econometricians set
themselves the task of figuring out past relationships. Have charter schools
improved educational standards? Did abortion liberalisation reduce crime? What
has been the impact of immigration on wages?"
This is certainly consistent with the
comparisons presented in the Statistical Science article. Note however, that
the methodologies referenced in the article (like logistic regression)
that are utilized under the data modeling or classical statistics paradigm are
a means to fill what Brieman refers to as a black box. Under this paradigm
analysts are attempting to characterize an outcome by estimating parameters and
making inferences about them based on some assumed data generating process. It is
not to say that these methods are never used under the machine learning
paradigm, but how they are used. The article provides a very balanced
'ping-pong' discussion citing various experts from both cultures, including
some who seem to promote both including the authors of The Elements of Statistical Learning: Data Mining,
Inference, and Prediction.
In my first econometrics course, the
textbook cautioned against 'data mining,' described as using techniques such as
stepwise regression. It insisted on letting theory drive model development,
rating the model on total variance explained, and the significance of
individual coefficients. This advice was certainly influenced by the 'data modeling'
culture. The text was published in the same year as the Breiman article. ( I
understand this caution has been moderated in contemporary editions).
Of course, as the article mentions, if
what you are interested in is theory and the role of particular variables in
underlying processes, then traditional inference seems to be the appropriate
direction to take. (Breiman of course still takes issue, arguing that we can't
trust the significance of an estimated co-efficient if the model overall is a
poor predictor).
"Higher
predictive accuracy is associated with more reliable information about the
underlying data mechanism. Weak predictive accuracy can lead to questionable
conclusions."
"Algorithmic models can give better predictive accuracy than data models,and provide better information about the underlying mechanism."
"Algorithmic models can give better predictive accuracy than data models,and provide better information about the underlying mechanism."
"The
goal is not interpretability, but accurate information."
When algorithmic models are more
appropriate (especially when the goal is prediction) a stoachastic model
designed to make inferences about specific model co-efficients may provide
"the right answer to the wrong question" as Emanuel Parzen puts it in
his comments on Breiman.
While in graduate school, the
stochastic culture was dominant. However, there was some inclusiveness, take
for example Kennedy's A Guide to Econometrics . Kennedy discusses the
expectation maximization algorithm (also a lecture in Dr. Ng's course) and also provides a great
introduction to neural networks which greatly influenced my
understanding of such allegedly 'uninterpretable' algorithmic techniques.
Keeping
an Open Mind: Multiculturalism in Data Science
As Breiman states:
"Approaching
problems by looking for a data model imposes an apriori straight jacket that
restricts the ability of statisticians to deal with a wide range of statistical
problems."
A multicultural approach to analysis
(stochastic or algorithmic) seems to be the take away message of the Breiman
article and discussions that follow. In fact, this may be the best approach in
the universe of all things empirical. This is certainly true in the new field
of data science, and is clearly depicted in Drew Conway's
data science Venn diagram depicted below.
As Parzen states "I believe statistics has many cultures." He points out
that many practitioners are well aware of the divides that exist between
Bayesians and frequentists, algorithmic approaches aside. Even if we restrict
our tool box to stochastic methods, we can often find our hands tied if we are
not open minded. We can still sacrifice results over technique. And there are
plenty of divisive debates. For example, in discussing the state of
modern econometrics Leamer faults econometric results that emphasize
statistical significance over practicality or importance in an interesting
podcast from Econ Talk here.
Brieman touches on his idea of
importance as well:
"My
definition of importance is based on prediction. A variable might be considered
important if deleting it seriously affects prediction accuracy"
Note, no mention of statistical
significance (a child of the stochastic data modeling culture) by Breiman
This is echoed in a paper from the
Joint Statistical Meetings-Section on Statistical Education (2009) 'The Cult of
Statistical Significance' by Stephen T. Ziliak and Deirdre N. McCloskey who
have authored a book by the same title. They note:
"Statistical
significance is, we argue, a diversion from the proper objects of scientific
study…Significant does not mean important and insignificant does not mean
unimportant"
I even find a hint of this in Greene, a
well known econometrics textbook author:
"It remains an interesting question for research
whether fitting y well or obtaining good parameter estimates is a preferable
estimation criterion. Evidently, they need not be the same thing."
p. 686 Greene, Econometric
Analysis 5th ed
It seems that cultural divides are hard
to avoid. Take another example from the article below:
Statistical
alternatives for studying college student retention: A comparative analysis of
logit, probit, and linear regression. Eric L. Dey and
Alexander W. Astin Research in Higher Education
Volume 34, Number 5, 569-581 (1993)
The above article concluded:
"Results
indicate that despite the theoretical advantages offered by logistic regression
and probit analysis, there is little practical difference between either of
these two techniques and more traditional linear regression. "
It never fails to go to a paper
presentation and see someone getting uptight over linear probability
models. In fact, there is some interesting literature on the robustness and pragmatism of
linear probability models.
An emphasis on robustness or even
parsimony seems to be an overarching theme in the fairly new book "Mostly
Harmless Econometrics" by Joshua Angrist and Jörn-Steffen
Pischke. Angrist and Pishke write about the robustness of the LPM in
their book (but also see econometrician Dave Giles post with some challenging
objections to the LPM, a followup by Angrist and Pischke, and more discussion from economist Marc Bellemare).
Their web site for the book seems to
caution practitioners about being married to fancier techniques:
fancier
techniques are typically unnecessary and even dangerous.
|
I have become more and more open minded
as my experience working under both paradigms has increased. Software packages
like SAS Enterprise Miner certainly accommodate open
minded curiosity. Packages like R (or even SAS IML) let the very curious
get their hands even dirtier. Both have been very influential to me. When it
comes to estimating the marginal effects of a treatment for a binary outcome, I
usually have no issue with using an LPM over logistic regression (but I would
have taken issue with it not not long ago). But when it comes to prediction I certainly
won't shy away from using logistic regression or for that matter a neural net,
decision tree, or a 'multicultural' ensemble of all of the above.