Wednesday, August 12, 2015

Index Based Crop Insurance and Big Data

 There is some interesting work going on currently in relation to risk management in the agriculture space as it relates to 'big data.'

"Agriculture risk management is about having access to ‘big data’ since growth conditions, risk types, climate and insurance terms vary largely in space. Solid crop models are based on large databases including simulated weather patterns with tempo-spatial correlations, crop planting areas, soil types, irrigation application, fertiliser use, crop rotation and planting calendars. … Similarly, livestock data need to include livestock densities which drive diseases, disease spread vectors and government contingency plans to address outbreaks of highly contagious diseases…Ultimately, big data initiatives will support selling agriculture insurance policies via smart phones based on highly sophisticated indices and will make agriculture insurance and risk management rapidly scalable and accessible to a large majority of those mostly affected – the farmers." - from The Actuary, March 2015

Similarly, in another issue of The Actuary there is more discussion related to this:

"In more recent years IBI (Indemnity Based Insurance) has received a renewed interest, largely drivenby advances in infrastructure (i.e., weather stations), technology (i.e., remote sensing
and satellites), as well as computing power, which has enabled the development of new statistical and mathematical models. With an IBI contract, indemnities are paid based on some index level, which is highly correlated to actual losses. Possible indices include rainfall, yields, or vegetation levels measured by satellites. When an index exceeds a certain predetermined threshold, farmers receive a fast, efficient payout, in some cases delivered via mobile phones. "

The article notes several benefits related to IBI products, including decreased moral hazard and adverse selection as well as the ability to transfer risk. However some challenges were noted related to 'basis' risk, where the index used to determine payments may not be directly linked to actual losses. In such cases, a farmer may recieve a payment when no loss is realized, or may actually experience loss but the index values don't trigger a payment. The farmer is left feeling like they have paid for something without benefit in the latter case. The article discusses three types of basis risk; variable, spatial, and temporal. Variable risk occurs when other unmeasured factors impact a peril not captured by the index. Maybe its wind speed during pollination or some undocumented pest damage or something vs measured items like temperature or humidity. An example of spatial risk might be related to cases where index data may be data generated from meteorological stations too far from the field location to accurately trigger payments for perils related to rain or temperature.  Temporal risk is really interesting to me in terms of the potential for big data:

"The temporal component of the basis risk is related to the fact that the sensitivity of yield to the insured peril often varies over the crops’ stages of growth. Factors such as changes in planting dates, where planting decisions are made based on the onset of rains, for example, can have a substantial impact on correlation as they can shift critical growth stages, which then do not align with the critical periods of risk assumed when the crop insurance product was designed." 

It would seem to me that the kinds of data elements being capture by services offered by companies like Climate Corp, Farmlink, John Deere etc. in combination of other aps (drones/smartphones/other modes of censoring/data collection) might be informative to creating and monitoring the performance of better indexes to help mitigate the basis risk associated with IBI related products.


New frontiers in agricultural insurance . The Actuary. March 2015. DR AUGUSTE BOISSONNADE

Lysa Porth and Ken Seng Tan

See also:

Copula Based Agricultural Risk Models
 Big Ag Meets Big Data (Part 1 & Part 2)

Copula Based Agricultural Risk Models

I have written previously about copulas, with some very elementary examples (see here and here). Below are some papers I have added to my reading list with applications in agricultural risk management. I'll likely followup in the future with some annotation/review.

Zimmer, D. M. (2015), Crop price comovements during extreme market downturns. Australian Journal of Agricultural and Resource Economics. doi: 10.1111/1467-8489.12119

Energy prices and agricultural commodity prices: Testing correlation using copulas method
Krishna H Koirala, Ashok K Mishra, Jeremy M D 'antoni, Joey E Mehlhorn
Energy 01/2015; DOI:10.1016/ · 

Xiaoguang Feng, Dermot J. Hayes
Diversifying Systemic Risk in Agriculture: A Copula-based Approach

Mixed-Copula Based Extreme Dependence Analysis: A Case Study of Food and Energy Price Comovements
Feng Qiu and Jieyuan Zhao
Selected Paper prepared for presentation at the Agricultural & Applied Economics Association’s 2014 AAEA Annual Meeting, Minneapolis, MN, July 27-29, 2014.

Price asymmetry between different pork cuts in the USA: a copula approach
Panagiotou and Stavrakoudis Agricultural and Food Economics (2015) 3:6
DOI 10.1186/s40100-015-0029-2

Copula-Based Models of Systemic Risk in U.S. Agriculture: Implications for Crop Insurance and
Reinsurance Contracts Barry K. Goodwin
October 22, 2012

Friday, August 7, 2015

Data Cleaning

I previously wrote about the importance of hacking skills for economics researchers and said that it may be one of the most important econometrics lessons you never learned.

In one of his regular 'metrics Monday posts, Marc Bellemare recently wrote about data cleaning. He made a lot of great points and this really stands out to me:

"So what I suggest–and what I try to do myself–is to write a .do file that begins by loading raw data files (i.e., Excel or ASCII files) in memory, merges and appends them with one another, and which documents every data-cleaning decision via embedded comments (in Stata, those comments are lines that begin with an asterisk) so as to allow others to see what assumptions have been made and when. This is like writing a chemistry lab report which another chemist could use to replicate your work.

Lastly, another thing I did when I first cleaned data was to “replicate” my own data cleaning: When I had received all the files for my dissertation data in 2004, the data were spread across a dozen spreadsheets. I first merged and cleaned them and did about a month’s worth of empirical work with the data. I then decided to re-merge and re-clean everything from scratch just to make sure I had done everything right the first time."

The ability to document everything you do with the data, and ensure repeatability and accuracy is priceless! I somehow get the feeling that this is not something that is practiced religiously in a lot of spaces, and makes me cringe every time a hear about results from a ‘study.’ But in the corporate space where I work, this makes collaboration with other groups and teams go a lot smoother if everyone can get back to the same starting point when it comes to processing the data. This is especially helpful when pulling it together from a myriad of servers and non-traditional data sources and formats that characterize today's big data challenges. And in the academic space, just think about the Quarterly Journal of Political Science's requirements related to providing a replication package with article submissions.  Sometimes this sort of 'janitor' work can constitute up to 80% of a data scientist's workload, but its well worth the effort if done appropriately. Good documentation throughout the process adds to the workload, but pays dividends.

See also:
In God we trust, all others show me your code.
Data Science, 10% inspiration, 90% perspiration

Friday, July 3, 2015

Analytical Translators

A recent Deloitte Press article discusses a role that is becoming more and more important as a result of the explosion of big data and data science in industry. The article In praise of “light quants” and “analytical translators” discussed the important role of analtyical translators, which may be even harder to find than actual data scientists themselves.

“When we think about the types of people who make analytics and big data work, we typically think of highly quantitative or computational folks with hard knowledge and skills. You know the usual suspects: data scientists who can make Hadoop jump through hoops, statisticians who dream in SAS or R, data wizards who can extract two years of data from a medical device that normally dumps it after 20 minutes (a true request)....A “light quant” is someone who knows something about analytical and data management methods, and who also knows a lot about specific business problems. The value of the role comes, of course, from connecting the two."

Actually, in some of my previous ponderings and speculations about the coming convergence of big data, analytics, and genomics, I discussed the potential for such a role in the precision agriculture and data science space (See Big Data: Causality and Local Expertise Are Key in Agronomic Applications):

"as we think about all the data that can potentially be captured through the internet of things from seed choice, planting speed, depth, temperature, moisture, etc this could become especially important. This might call for a much more personal service including data savvy reps to help agronomists and growers get the most from these big data apps or the data that new devices and software tools can collect and aggregate.  Data savvy agronomists will need to know the assumptions and nature of any predictions or analysis, or data captured by these devices and apps to know if surrogate factors like Dan mentions have been appropriately considered. And agronomists, data savvy or not will be key in identifying these kinds of issues.  Is there an ap for that? I don't think there is an automated replacement for this kind of expertise, but as economist Tyler Cowen says, the ability to interface well with technology and use it to augment human expertise and judgement is the key to success in the new digital age of big data and automation. "

In fact, I recently discovered an actual position for a major player in this space with the title "BioAg Knowledge Transfer Agronomist" that seems to fit the bill. I think we will see more roles like this in the future.

Farm Link: The Rise of Data Science in Agriculture

The Internet of Things, Big Data, and John Deere

The Use of Knowledge (in a) Big Data Society

Friday, June 19, 2015

Got Data? Probably not like your econometrics textbook!

Recently there has been a lot of discussion of the Angrist and Pischke piece entitled "Why Econometrics Teaching Needs an Overhaul." (read more...) and I have discussed before the large gap between theoretical and applied econometrics.

But here I plan to discuss another potential gap in teaching and application and this is a topic that often is not introduced at any point in a traditional undergraduate or graduate economics curriculum, and that is hacking skills. This becomes extremely important for economists that someday find themselves doing applied work in a corporate environment, or working in the area of data science. Drew Conway points out the there are three spheres of data science including hacking skills, math and statistics knowledge, and subject matter expertise. For many economists, the hacking sphere might be the weakest (read also Big Data Requires a New Kind of Expert: The Econinformatrician) while their quantitative training otherwise makes them ripe to become very good data scientists.

Drew Conway's Data Science Venn Diagram

In a recent whitepaper, I discuss this issue:

Students of econometrics might often spend their days learning proofs and theorems, and if they are lucky they will get their hands on some data and access to software to actually practice some applied work rather it be for a class project or part of a thesis or dissertation. I have written before about the large gap between theoretical and applied econometrics, but there is another gap to speak of, and it has nothing to do with theoretical properties of estimators or interpreting output from STATA, SAS or R. This has to do with raw coding, hacking, and data manipulation skills; the ability to tease out relevant observations and measures from both large structured transactional databases or unstructured log files or web data like tweet-streams. This gap becomes more of an issue as econometricians move from more academic environments to corporate environments and especially so for those economists that begin to take on roles as data scientists. In these environments, not only is it true that problems don’t fit the standard textbook solutions (see article ‘Applied Econometrics’), but the data doesn't look much like the simple data sets often used in textbooks either.  One cannot always expect their IT people to be able to just dump them a flat file with all the variables and formats that will work for your research project. In fact, the absolute best you might hope for in many environments is a SQL or Oracle data base with hundreds or thousands of tables and the tiny bits of information you need spread across a number of them. How do you bring all of this information together to do an analysis? This can be complicated, but for the uninitiated I will present some ‘toy’ examples to give a feel for executing basic database queries to bring together different pieces of information housed in separate tables in order to produce a ‘toy’ analytics ready data set.

I am certain that many schools actually do teach some of the basics related to joining and cleaning data sets, and if they don't then others might figure this out on the job or through one research project or another. I am not certain that this gap needs to be filled  necessarily as part of any econometrics course. However, it is something students need to be aware of and offering some sort of workshop, lab or formal course (maybe as part of a more comprehensive data science curriculum like this) would be very beneficial.

Read the whole paper here:

Matt Bogard. 2015. "Joining Tables with SQL: The most important econometrics lesson you may ever learn" The SelectedWorks of Matt Bogard
Available at:  

See also: Is Machine Learning Trending with Economists?

Wednesday, June 17, 2015

Farmlink and the Rise of Data Science in Agriculture

At a recent Global Ag Investing Conference Dave Gebhardt (Chief Strategy Officer for FarmLink ) spoke about the rise of data science in agriculture. You can read the story and find a link to the podcast here:

In the podcast he discusses the way data science is revolutionizing agriculture, and how we are at a "tipping point where advances in science, IT, technology, and computing power have put a whole new level of opportunities before us."

This sounds a lot like what I have previously discussed in relation to big data and the internet of things: 

Watch more about how FarmLink is leveraging IoT, big data, and advanced analytics:


 Big Ag Meets Big Data (Part 1 & Part 2)

Saturday, June 13, 2015

SAS vs R? The right answer to the wrong question?

For a long time I tracked a discussion on LinkedIn that consisted of various opinions about using SAS vs R. Some people can take this very personal.  Recently there was an interesting post at the DataCamp blog addressing this topic. They also provided an interesting infographic making some comparisons between SAS and R as well as SPSS.  Other popular debates also include python in the mix. (By the way, it is possible to integrate all three on the SAS platform and you can also run R via the open source integration node in SAS Enterprise Miner 13.1).

Aside: For older versions of SAS EM-can you drop in a code node and call R via PROC IML?

Anyway, getting back to the article, I tend to agree with this one point:

"While these debates are a good thing for the community and the programming language as a whole, they unfortunately also have a negative effect on those individuals that are just in the beginning of their data analytics career. Biased opinions on all sides of the table make it difficult for new data analysts to see the forest for the trees when choosing a statistical programming language."

While I agree with this notion, I want to reflect for a minute on the concept of a programming language. If you think of SAS as just a programming language, then perhaps these kinds of comparisons and discussions make sense, but for a data scientist, I think one's view of analtyics should transcend just a language. When we think of an overall analytical solution there is a lot to consider, from how the data is generated, how it is captured and warehoused, how it is extracted and cleaned and accessed by whatever programming tool(s), how it is visualized and analyzed, and ultimately, how do we operationalize the solution so that it can be consumed by business users.

So to me the relevant question is not, which programming language is preferred by data scientists, or which program is better for implementing specific machine learning algorithms; but perhaps what is the best analytical solutions platform for solving the problems at hand?