## Tuesday, January 14, 2014

### Analytics vs. Causal Inference

When I think of analytics, I primarily think about what Leo Breiman referred to as an 'algorithmic' approach to analysis. Breiman states:"There are two cultures in the use of statistical modeling to reach conclusions from data.  One "assumes that the data are generated by a given stochastic data model" the other"uses algorithmic models and treats the data mechanism as unknown."

To take an example from higher education, lets look at a hypothetical I have proposed before. Assume we want to evaluate the causal effects of a summer camp designed to prepare high school graduates for their first year of college. If we are interested in making inferences about the causal impact of the camp on retention (this fits under the stochastic data modeling culture) we realize that the impact of the camp program itself (which is the causal effect of interest) as well as academic potential and motivation are confounded. Students that attend the camp may also be very motivated and academically strong, and likely to have high retention rates regardless of their attendance. We proceed with our analysis using some form of experimental or quasi-experimental design to attempt to statistically estimate  or 'identify' the causal effects related to camp. We are concerned with things like standard errors, confidence intervals, statistical significance etc. We may use the results to evaluate the effectiveness of the camp program and improve resource allocation.

On the other hand, we may simply be interested in building a model that gives a measure of the probability of retention for first year students. We may want to take those results and segment the student population into strata based on their risk profile, and tailor programs to help with their academic success an improve resource allocation. The variable indicating 'camp' attendance among others might be a good predictor and aid in producing the probability estimates and well calibrated stratifications. We might accomplish this with logistic regression, decision trees, neural networks, random forests or gradient boosting or some other machine learning algorithm. We are concerned with things like predictive accuracy, discrimination, sensitivity, specificity, true positives, false positives,  ranking, and calibration.  We are not so concerned with p-values, statistical significance, standard errors etc.

Both approaches are data driven, and one is not more meritorious than the other. People often adopt one culture  or paradigm and impugn the other.  As Brieman states:

"Approaching problems by looking for a data model imposes an apriori straight jacket that restricts the ability of statisticians to deal with a wide range of statistical problems."

In a past Campus Technology article "The Predictive Analytics Framework Moves Forward" the following comments depict this schism that sometimes occurs between cultures:

"For some in education what we’re doing might seem a bit heretical at first--we’ve all been warned in research methods classes that data snooping is bad! But the newer technologies and the sophisticated analyses that those technologies have enabled have helped us to move away from looking askance at pattern recognition. That may take a while in some research circles, but in decision-making circles it’s clear that pattern recognition techniques are making a real difference in terms of the ways numerous enterprises in many industries are approaching their work"

In fact, both paradigms can often be complimentary in application. We might first develop an algorithm as indicated in the second approach and based on those results design a program or intervention targeted at students with a certain risk profile. We might then step into the world of inferential statistics and attempt to evaluate the effectiveness of this program.

In summary, both are going to utilize some sort of data driven model development, and as the popular parahrase of statistican George E.P. Box goes, all models are wrong, but some are useful. Regardless of the paradigm you are working in, the key is to to produce something useful for solving the problem at hand.

References:

The Predictive Analytics Reporting Framework Moves Forward
A Q & A with WCET Executive Director Ellen Wagner on the PAR Framework.  Mary Grush 01/18/12 Campus Technology. http://campustechnology.com/articles/2012/01/18/the-predictive-analytics-reporting-framework-moves-forward.aspx

'Statistical Modeling: The Two Cultures' by L. Breiman (Statistical Science
2001, Vol. 16, No. 3, 199–231)