Because I just don't always have time to fully develop a post on everthing I come across, here are a few shorties:
Pauls Allison has done some great posts recently related to logistic regression and model assessment.
With regard to the pseudo R^2, see this post as well as the article associated with a new proposed alternative:
Tjur,
T. (2009) “Coefficients of determination in logistic regression
models—A new proposal: The coefficient of discrimination.” The American Statistician 63: 366-372.
(I've written about the pseudo R-square before here.)
His most recent post discusses the Hosmer-Lemeshaw test. In the futrue I'd like to expand more on this, but he's critical of the test because it is sensitive to strata size. I am too, and I've also seen many criticisms related to its sensitivity to large sample sizes. I'll come back and expand more on that later, or do a separate post, but for now I'm just looking forward to his next article in which Paul is going to discuss some recent advancements and alternatives to the HL test.
An attempt to make sense of econometrics, biostatistics, machine learning, experimental design, bioinformatics, ....
Friday, March 29, 2013
Thursday, March 21, 2013
Big Ag Meets Big Data Part II
Previously
I discussed the role of social media in producing ‘big data’ and tools that may
be used to get the most from this data in the ag industry. In this second
installment I’m going to discuss other sources of ‘big
data.’
I recall once about
10 years ago attending a UK College of Agriculture field day in Princeton Ky,
and someone made the comment that went
something like this:
“these events are good
because on the farm we don’t have time to set up experiments, collect data, and
analyze to figure out best practices. We can’t stop and measure and record and
report about everything we do.”
It’s certainly true that extension services will continue to
conduct valuable research and it will probably remain a fact that producers
aren’t going to necessarily have the time and resources to reduce their
operation to a collection of well-crafted scientific experiments. However,
every decision made on the farm is a trial of sorts, and with modern technology
it is much easier to collect and log data about your operation, and some
companies are now figuring out ways to take this farm level data and turn it
into powerful analytical tools that can boost productivity and efficiency. In a recent article ‘Building Big Data:
Farming Big Data Goes To The Cows’ the following statement is made:
"The
major problem we keep on seeing — especially in bigger, modern farms — is that
there's a lot of data being created and not being used, on how they're
performing, what they're doing."
How is this data being generated? Lots if it is generated
via your equipment including GPS:
“Next generation farm equipment
like combines and tillers are going to be able to take soil samples as they
move along, perform analysis on those samples, and feed the results of the
analysis back to the manufacturer for crunching on a macro scale. This will result
in a better understanding of what is happening in that entire area and make it
possible to adjust things like the amount or types of fertilizer and chemicals
that should be applied. If the farm equipment manufacturers figure out how to
harness all this information, this kind of big-picture analysis could
change the commodity trading markets forever." – from 4 Examples of
Big Data Trends. Spetember 27,2012. VmwareBlogs.
And how might we use
this data? Well some seed companies are
already combining farm level data, public data, and their own proprietary data
to develop some pretty powerful analytical tools. As discussed recently in an
AgWeb technology article Steyer seeds offers a great example with its ACRES
tool which is based on a complex form of decision
tree:
Another company, Climate Corporation is also taking
advantage of massive amounts of data useful in agricultural applications:
"We took 60 years
of crop yield data, and 14 terabytes of information on soil types, every two
square miles for the United States, from the Department of Agriculture,"
says David Friedberg, chief executive of the Climate Corporation, …We match
that with the weather information for one million points the government scans with
Doppler radar — this huge national infrastructure for storm warnings — and make
predictions for the effect on corn, soybeans and winter wheat." –New
York Times
We’ve seen lots of efficiency, environmental, and
productivity gains in agriculture related to GPS/GIS and biotechnology.
But with every trip across the field
more and more data is being generated. Combining these technologies with ‘big
data’ definitely will have its benefits, if not continue to revolutionize the
industry.
References and
Further Reading:
Climate Corp. Updates Crop Insurance via High Tech. BloombergBusinessWeek. By Ashlee Vance on March 22, 2012. http://www.businessweek.com/articles/2012-03-22/climate-corp-dot-updates-crop-insurance-via-high-tech
Climate Corp. Updates Crop Insurance via High Tech. BloombergBusinessWeek. By Ashlee Vance on March 22, 2012. http://www.businessweek.com/articles/2012-03-22/climate-corp-dot-updates-crop-insurance-via-high-tech
Big Data Goes to the
Cows
Big Data in the Dirt
(and the Cloud) October 11,2011. NYT. Quentin Hardy.
4 Examples of Big
Data Trends. Spetember 27,2012. Vmware|Blogs.
Data analysis,
biotech are key in agriculture's future sustainability
By Sarah Gonzalez
© Copyright Agri-Pulse Communications, Inc.
Unlock Your Farm Data
February 15, 2013
By: Ben
Potter, Farm Journal Technology Editor
Wednesday, March 6, 2013
Decision Trees and Gradient Boosting
Decision Trees
Decision tree algorithms search through the input space and find values of the input variables (split values) that maximize the differences in the target value between groups created by the split. The final model is characterized by the split values for each explanatory variable and creates a set of rules for classifying new cases.
Gradient Boosting
Boosting algorithms are ensemble methods that make predictions based on the average results of a series of weak learners. Gradient boosting involves fitting a series of trees, with each successive tree being fit to a resampled training set that is weighted according to the classification accuracy of the previously fit tree. The original training data is resampled several times and the combined series of trees form a single predictive model. This differs from other ensemble methods using trees, such as random forests. Random forests are a modified type of bootstrap aggregation or bagging estimator (Freidman et al,2009). With random forests, we get a predictor that is an average of a series of trees grown on a bootstrap sample of the training data with only a random subset of the available inputs from the training data used to fit each tree (De Ville, 2006). Gradient boosting can perform similarly to random forests and boosting may tend to dominate bagging methods in many applications. (Freidman et al,2009).
References:
Friedman, Jerome H. (2001), Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29, 1189-1232. Available at http://stat.stanford.
Hasti, Tibshirani and Friedman. (2009)Elements of Statistical Learning: Data Mining,Inference, and Prediction. Second Edition. Springer-Verlag.
DeVille, Barry. (2006). Decision Trees for Business Intelligence and Data Mining Using SAS® Enterprise Miner. SAS® Institute.
Monday, March 4, 2013
Big Ag Meets Big Data: Part 1
Social media has allowed farmers to organize and communicate about their industry. The #agchat conversations on twitter are a good example. Not to mention Facebook (see Agriculture Proud for example) and YouTube ( like this look behind the scenes of a family farm). We've seen powerful examples of how social media can be used to mobilize voices and impact perceptions on a national level ( for example issues related to Yellow Tail wine and Pilot Travel Centers).
Social media also provides a rich data source for measuring sentiment or perceptions about the industry. Take for instance text mining. With Twitter, Facebook, email, online forums, open response surveys, customer and reader comments on web pages and news articles etc. there is a lot of information available to companies and organizations in the form of text. Without hiring experts to read through all of the thousands of pages worth of text available and making subjective claims about its meaning, text mining allows us to take otherwise unusable 'qualitative' data and convert it into quantitative measures that we can use for various types of reporting and modeling. Companies are finding that by mining text from web pages, comments, blogs, and social media, they can get measure consumer perceptions almost as well or better than they can through explicit surveys or other directly measurable outcomes in their databases. In my own personal experience, I've bench marked predictions made from traditional data base variables vs. text mining and found remarkable comparisons in performance. The validity of these tools is not based necessarily on their ability to make new breakthrough discoveries, but on the contrary, how these algorithms give us almost exactly what we would expect, if we had time to manually process all of the information social media provides. (For a basic example of mining tweets related to 'factory farms' see: http://ageconomist.blogspot.com/2011/04/mining-tweets-abou-factory-farms.html ).
Besides the actual text we get from social media, the actual structure of social networks can also be very informative. Social network analysis (SNA) allows us to answer questions such as who are key actors in a network? Who are the most influential members of a network? Who seems to be acting on the peripheral? Which connections in the network are most important? Are there key players bridging connections or information between otherwise disconnected groups? Have policies or other forces changed the overall dynamics/interaction between people in the network (i.e. has the network structure changed in any meaningful way) and does that relate to some other performance outcome or goal? I’ve recently used this kind of information to help a company develop a predictive model to improve its viral marketing campaigns.
Of course, it doesn't take a rocket scientist to read tweets, Facebook posts, or blog comments to know when people are upset about a product. But there is also a wealth of knowledge to be gained from this type of information that is so voluminous, it would take an army of social media experts to glean and analyze. This is the essence of what has been termed in the industry as 'big data.' It requires new tools for capturing, storing, processing and analyzing this data, and a new type of analyst referred to as a data scientist. These powerful analytics could be very beneficial to those in the ag industry or agvocacy groups. But this goes beyond social media, and I will discuss how big data is revolutionizing agriculture at the farm level in the second part of this two part series on big data.
*Note: I’m not using the term ‘big ag’ in the derogatory sense used by anti-agricultural activists, but in a complimentary sense referring to the complex network of modern family farms, biotechnology companies, food processors, other agribusinesses and retailers that cooperate to bring healthy and sustainable food to your table.
References:
Social Media Analytics. Matt Bogard, Applied Econometric and Analytical Consulting.
http://econometricsense.blogspot.com/2012/09/social-media-analytics.html
With Hadoop, Big Data Analytics Challenges Old-School Business Intelligence. Doug Henschen, Information Week
http://www.informationweek.com/software/business-intelligence/with-hadoop-big-data-analytics-challenge/240001922
Big Bets On Big Data. Eric Savitz, Forbes. http://www.forbes.com/sites/ciocentral/2012/06/22/big-bets-on-big-data/
Creative Commons
Image Attributions:
Handheld GPS
By Paul Downey from Berkhamsted, UK (Earthcache De
Slufter Uploaded by Partyzan_XXI)
[CC-BY-2.0 (http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons
Satellite: NAVSTAR-2 (GPS-2) satellite Source:
http://www.jpl.nasa.gov/images/grace/grace_083002_browse.jpg Status:
PD-USGov-Military-Air Force {{PD-USGov-Military-Air Force}} Category:Satellites
Tractor: bdk [CC-BY-SA-3.0
(http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
Subscribe to:
Posts (Atom)