Saturday, May 31, 2014

Big Data: Causality and Local Expertise Are Key in Agronomic Applications

In a previous post Big Data: Don't throw the baby out with the bathwater, I made the case that in many instances, we aren't concerned with issues related to causality.

"If a 'big data' ap tells me that someone is spending 14 hours each week on the treadmill, that might be a useful predictor for their health status. If all I care about is identifying people based on health status I think hrs of physical activity would provide useful info.  I might care less if the relationship is causal as long as it is stable....correlations or 'flags' from big data might not 'identify' causal effects, but they are useful for prediction and might point us in directions where we can more rigorously investigate causal relationships"

But sometimes we are interested in causal effects. If that is the case, the article that I reference in the previous post makes a salient point:

"But a theory-free analysis of mere correlations is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down."

“Big data” has arrived, but big insights have not. The challenge now is to solve new problems and gain new answers – without making the same old statistical mistakes on a grander scale than ever."

I think that may be the instance in many agronomic applications of big data. I've written previously about the convergence of big data, genomics, and agriculture.  In those cases, when I think about applications like ACRES or Field Scripts, I have algorithmic approaches (finding patterns and correlations) in mind, not necessarily causation.

But Dan Frieberg points out some very important things to think about when it comes to using agronomic data in an corn and soybean digest article "Data Decisions: Meaningful data analysis involves agronomic common sense, local expertise." 

He gives an example where data indicates better yields are associated with faster planting speeds, but something else is really going on:

"Sometimes, a data layer is actually a “surrogate” for another layer that you may not have captured. Planting speed was a surrogate for the condition of the planting bed.  High soil pH as a surrogate for cyst nematode. Correlation to slope could be a surrogate for an eroded area within a soil type or the best part of the field because excess water escaped in a wet year."

He concludes:

"big data analytics is not the crystal ball that removes local context. Rather, the power of big data analytics is handing the crystal ball to advisors that have local context"

This is definitely a case where we might want to more rigorously look at relationships identified by data mining algorithms that may not capture this kind of local context.  It may or may not apply to the seed selection algorithms coming to market these days, but as we think about all the data that can potentially be captured through the internet of things from seed choice, planting speed, depth, temperature, moisture, etc this could become especially important. This might call for a much more personal service including data savvy reps to help agronomists and growers get the most from these big data apps or the data that new devices and software tools can collect and aggregate.  Data savvy agronomists will need to know the assumptions and nature of any predictions or analysis, or data captured by these devices and apps to know if surrogate factors like Dan mentions have been appropriately considered. And agronomists, data savvy or not will be key in identifying these kinds of issues.  Is there an ap for that? I don't think there is an automated replacement for this kind of expertise, but as economistTyler Cowen says, the ability to interface well with technology and use it to augment human expertise and judgement is the key to success in the new digital age of big data and automation.

References:

Big Data…Big Deal? Maybe, if Used with Caution. http://andrewgelman.com/2014/04/27/big-data-big-deal-maybe-used-caution/

See also: Analytics vs. Causal Inference http://econometricsense.blogspot.com/2014/01/analytics-vs-causal-inference.html



Thursday, May 29, 2014

AllAnalytics - Michael Steinhart - Doctors: Time to Unleash Medical Big Data

Examples:  "Correlating grocery shopping patterns with incidence of obesity and diabetes
Measuring response rates to cholesterol-lowering drugs by correlating pharmacy refills with exercise data from wearable sensors
Correlating physical distance to hospitals and pharmacies with utilization of healthcare services
Analyzing the influence of social network connections on lifestyle choices and treatment compliance."

http://www.allanalytics.com/author.asp?section_id=3314&doc_id=273502&f_src=allanalytics_sitedefault 

Friday, May 2, 2014

Big Data: Don't Throw the Baby Out with the Bathwater


"Data and algorithms alone will not fulfill the promises of “big data.” Instead, it is creative humans who need to think very hard about a problem and the underlying mechanisms that drive those processes. It is this intersection of creative critical thinking coupled with data and algorithms that will ultimately fulfill the promise of “big data.”

From:  http://andrewgelman.com/2014/04/27/big-data-big-deal-maybe-used-caution/


I couldn't agree more.  I think the above article is interesting because I think on one hand people can get carried away about 'big data' but on the other hand throw the big data baby out with the bath water. Its true, there is no law of large numbers that implies that as n approaches infinity selection bias and unobserved heterogeneity go away.  Correlations in large data sets still do not imply causation. But I don't think people that have seriously thought about the promises of 'big data' and predictive analytics believe that anyway. In fact, if we are trying to predict or forecast vs. make causal inferences,selection bias can be our friend. We can still get useful information from an algorithm. If a 'big data' ap tells me that someone is spending 14 hours each week on the treadmill, that might be a useful predictor for their health status. If all I care about is identifying people based on health status I think hrs of physical activity would provide useful info.  I might care less if the relationship is causal as long as it is stable. Maybe there are lots of other factors correlated with time at the gym like better food choices, stress management, or even income and geographic and genetic related factors.   But in a strictly predictive framework, this kind of 'healtheir people are more likely to go to the gym anyway'  selection bias actually improves my predicton without having to have all of the other data involved. The rooster crowing does not cause the sun to come up, but if I'm blindfolded and don't have an alarm clock, hearing the crow might serve as a decent indicator that dawn is approaching.  As long as I can reliabley identify healthy people I may not care about the causal connection between hours at the gym and health status, or any of the other variables that may actually be more important in determining health status. It may not be worth the cost of collecting it if I get decent predictions without it.  Similarly, if I can get a SNP profile that correlates with some health or disease status, it may tell me very little about what is really going on from a molecular or biochemical or 'causal' standpoint, but the test might be very useful. In both of these cases correlations or 'flags' from big data might not 'identify' causal effects, but they are useful for prediction and might point us in directions where we can more rigorously investigate causal relationships if interested, and 'big data' or having access to more data  or richer or novel data never hurts.  If causality is the goal, then merge 'big data' from the gym app with biometrics and the SNP profiles and employ some quasi-expermental methodology to investigate causality.

UPDATE: A very insightful and related article by  Tim Hartford:

http://timharford.com/2014/04/big-data-are-we-making-a-big-mistake/

"But a theory-free analysis of mere correlations is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down."

“Big data” has arrived, but big insights have not. The challenge now is to solve new problems and gain new answers – without making the same old statistical mistakes on a grander scale than ever."

This relates back to my earlier statements "As long as I can reliabley identify healthy people I may not care about the causal connection between hours at the gym and health status, or any of the other variables that may actually be more important in determining health status... In both of these cases correlations or 'flags' from big data might not 'identify' causal effects, but they are useful for prediction and might point us in directions where we can more rigorously investigate causal relationships if interested"

In a strictly algorithmic and predictive modeling context, the best we can do is assess generalization error through some form of cross validation or use of training, validation, and test data. And always, monitor the performance of our models so we can recognize when some of these correlatons begin to breakdown and update our models to incorporate new information if possible.

It is crucial that if we are interested in causality, we ensure that we are addressing these issues using the appropriate methodology. But again, if all I care about is an accurate prediction that is stable over time, it may not be worth the effort to find an instrumental variable that will help me identify causal effects or to seek out proper controls if all I need is a stable cost effective prediction.

See also: http://econometricsense.blogspot.com/2014/01/analytics-vs-causal-inference.html