Friday, May 2, 2014
Big Data: Don't Throw the Baby Out with the Bathwater
"Data and algorithms alone will not fulfill the promises of “big data.” Instead, it is creative humans who need to think very hard about a problem and the underlying mechanisms that drive those processes. It is this intersection of creative critical thinking coupled with data and algorithms that will ultimately fulfill the promise of “big data.”
I couldn't agree more. I think the above article is interesting because I think on one hand people can get carried away about 'big data' but on the other hand throw the big data baby out with the bath water. Its true, there is no law of large numbers that implies that as n approaches infinity selection bias and unobserved heterogeneity go away. Correlations in large data sets still do not imply causation. But I don't think people that have seriously thought about the promises of 'big data' and predictive analytics believe that anyway. In fact, if we are trying to predict or forecast vs. make causal inferences,selection bias can be our friend. We can still get useful information from an algorithm. If a 'big data' ap tells me that someone is spending 14 hours each week on the treadmill, that might be a useful predictor for their health status. If all I care about is identifying people based on health status I think hrs of physical activity would provide useful info. I might care less if the relationship is causal as long as it is stable. Maybe there are lots of other factors correlated with time at the gym like better food choices, stress management, or even income and geographic and genetic related factors. But in a strictly predictive framework, this kind of 'healtheir people are more likely to go to the gym anyway' selection bias actually improves my predicton without having to have all of the other data involved. The rooster crowing does not cause the sun to come up, but if I'm blindfolded and don't have an alarm clock, hearing the crow might serve as a decent indicator that dawn is approaching. As long as I can reliabley identify healthy people I may not care about the causal connection between hours at the gym and health status, or any of the other variables that may actually be more important in determining health status. It may not be worth the cost of collecting it if I get decent predictions without it. Similarly, if I can get a SNP profile that correlates with some health or disease status, it may tell me very little about what is really going on from a molecular or biochemical or 'causal' standpoint, but the test might be very useful. In both of these cases correlations or 'flags' from big data might not 'identify' causal effects, but they are useful for prediction and might point us in directions where we can more rigorously investigate causal relationships if interested, and 'big data' or having access to more data or richer or novel data never hurts. If causality is the goal, then merge 'big data' from the gym app with biometrics and the SNP profiles and employ some quasi-expermental methodology to investigate causality.
UPDATE: A very insightful and related article by Tim Hartford:
"But a theory-free analysis of mere correlations is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down."
“Big data” has arrived, but big insights have not. The challenge now is to solve new problems and gain new answers – without making the same old statistical mistakes on a grander scale than ever."
This relates back to my earlier statements "As long as I can reliabley identify healthy people I may not care about the causal connection between hours at the gym and health status, or any of the other variables that may actually be more important in determining health status... In both of these cases correlations or 'flags' from big data might not 'identify' causal effects, but they are useful for prediction and might point us in directions where we can more rigorously investigate causal relationships if interested"
In a strictly algorithmic and predictive modeling context, the best we can do is assess generalization error through some form of cross validation or use of training, validation, and test data. And always, monitor the performance of our models so we can recognize when some of these correlatons begin to breakdown and update our models to incorporate new information if possible.
It is crucial that if we are interested in causality, we ensure that we are addressing these issues using the appropriate methodology. But again, if all I care about is an accurate prediction that is stable over time, it may not be worth the effort to find an instrumental variable that will help me identify causal effects or to seek out proper controls if all I need is a stable cost effective prediction.
See also: http://econometricsense.blogspot.com/2014/01/analytics-vs-causal-inference.html