## Thursday, March 26, 2015

### To Explain or Predict

Some people may not make the important distinctions between prediction vs inference when it comes to modeling approaches/methodologies/data handling/assumptions.  I recently ran across a blog post by Rob J. Hyndman that pointed to the following article in the journal Statistical Science:

Statist. Sci.
Volume 25, Number 3 (2010), 289-310.

"Statistical modeling is a powerful tool for developing and testing theories by way of causal explanation, prediction, and description. In many disciplines there is near-exclusive use of statistical modeling for causal explanation and the assumption that models with high explanatory power are inherently of high predictive power. Conflation between explanation and prediction is common, yet the distinction must be understood for progressing scientific knowledge. While this distinction has been recognized in the philosophy of science, the statistical literature lacks a thorough discussion of the many differences that arise in the process of modeling for an explanatory versus a predictive goal. The purpose of this article is to clarify the distinction between explanatory and predictive modeling, to discuss its sources, and to reveal the practical implications of the distinction to each step in the modeling process."

This is a nice article which I think complements Leo Brieman's paper discussed here before regarding two cultures of predictive modeling. Rob gives a nice synopsis of some of the main points from the paper:

1. The AIC is better suited to model selection for prediction as it is asymptotically equivalent to leave-​​one-​​out cross-​​validation in regression, or one-​​step-​​cross-​​validation in time series. On the other hand, it might be argued that the BIC is better suited to model selection for explanation, as it is consistent.
2. P-​​values are associated with explanation, not prediction. It makes little sense to use p-​​values to determine the variables in a model that is being used for prediction. (There are problems in using p-​​values for variable selection in any context, but that is a different issue.)
3. Multicollinearity has a very different impact if your goal is prediction from when your goal is estimation. When predicting, multicollinearity is not really a problem provided the values of your predictors lie within the hyper-​​region of the predictors used when estimating the model.
4. An ARIMA model has no explanatory use, but is great at short-​​term prediction.
5. How to handle missing values in regression is different in a predictive context compared to an explanatory context. For example, when building an explanatory model, we could just use all the data for which we have complete observations (assuming there is no systematic nature to the missingness). But when predicting, you need to be able to predict using whatever data you have. So you might have to build several models, with different numbers of predictors, to allow for different variables being missing.
6. Many statistics and econometrics textbooks fail to observe these distinctions. In fact, a lot of statisticians and econometricians are trained only in the explanation paradigm, with prediction an afterthought. That is unfortunate as most applied work these days requires predictive modelling, rather than explanatory modelling.

Rob also links Galit Shmueli's web page, (the author of the article above) who apparently has done some extensive research related to these distinctions.  Lots of additional resources (blog) here in this regard. Galit states:

"My thesis is that statistical modeling, from the early stages of study design and data collection to data usage and reporting, takes a different path and leads to different results, depending on whether the goal is predictive or explanatory."

I touched on these distinctions before, but did not realize the extent of the actual work being done in this area by Galit.

Analytics vs Causal Inference
Big Data: Don't throw the baby out with the bath water

See also: Paul Allison on multicollinearity