In a previous post Big Data: Don't throw the baby out with the bathwater, I made the case that in many instances, we aren't concerned with issues related to causality.
"If a 'big data' ap tells me that someone is spending 14 hours each week
on the treadmill, that might be a useful predictor for their health
status. If all I care about is identifying people based on health status
I think hrs of physical activity would provide useful info. I might
care less if the relationship is causal as long as it is stable....correlations or 'flags' from big data might not 'identify' causal
effects, but they are useful for prediction and might point us in
directions where we can more rigorously investigate causal relationships"
But sometimes we are interested in causal effects. If that is the case, the article that I reference in the previous post makes a salient point:
"But a theory-free analysis of mere correlations is inevitably
fragile. If you have no idea what is behind a correlation, you have no
idea what might cause that correlation to break down."
“Big data” has arrived, but big insights have not. The challenge now
is to solve new problems and gain new answers – without making the same
old statistical mistakes on a grander scale than ever."
I think that may be the instance in many agronomic applications of big data. I've written previously about the convergence of big data, genomics, and agriculture. In those cases, when I think about applications like ACRES or Field Scripts, I have algorithmic approaches (finding patterns and correlations) in mind, not necessarily causation.
But Dan Frieberg points out some very important things to think about when it comes to using agronomic data in an corn and soybean digest article "Data Decisions: Meaningful data analysis involves agronomic common sense, local expertise."
He gives an example where data indicates better yields are associated with faster planting speeds, but something else is really going on:
"Sometimes, a data layer is actually a “surrogate” for another layer that
you may not have captured. Planting speed was a surrogate for the
condition of the planting bed. High soil pH as a surrogate for cyst
nematode. Correlation to slope could be a surrogate for an eroded area
within a soil type or the best part of the field because excess water
escaped in a wet year."
"big data analytics is not the crystal ball that removes local context. Rather, the power of big data analytics is handing the crystal ball to advisors that have local context"
This is definitely a case where we might want to more rigorously look at relationships identified by data mining algorithms that may not capture this kind of local context. It may or may not apply to the seed selection algorithms coming to market these days, but as we think about all the data that can potentially be captured through the internet of things from seed choice, planting speed, depth, temperature, moisture, etc this could become especially important. This might call for a much more personal service including data savvy reps to help agronomists and growers get the most from these big data apps or the data that new devices and software tools can collect and aggregate. Data savvy agronomists will need to know the assumptions and nature of any predictions or analysis, or data captured by these devices and apps to know if surrogate factors like Dan mentions have been appropriately considered. And agronomists, data savvy or not will be key in identifying these kinds of issues. Is there an ap for that? I don't think there is an automated replacement for this kind of expertise, but as economistTyler Cowen says, the ability to interface well with technology and use it to augment human expertise and judgement is the key to success in the new digital age of big data and automation.
Big Data…Big Deal? Maybe, if Used with Caution. http://andrewgelman.com/2014/04/27/big-data-big-deal-maybe-used-caution/
See also: Analytics vs. Causal Inference http://econometricsense.blogspot.com/2014/01/analytics-vs-causal-inference.html