Wednesday, September 30, 2015

Big Data, IoT, Ag Finance, and Causal Inference

Over at my applied economics blog, I recently discussed an article from AgWeb; How the feds interest rate decision affects farmers. This actually got me questioning some of the ramifications of leveraging data analysis in the context of ag lending (from both a farmer and lender perspective), which ultimately lead to me thinking about some interesting questions that would be exciting to investigate:
  1.  Is there a causal relationship between producers that leverage IoT and Big Data analytics applications and farm output/performance/productivity
  2. How do we quantify the outcome-is it some measure of efficiency or some financial ratio?
  3. If we find improvements in this measure-is it simply a matter of selection? Are great producers likely to be productive anyway, with or without the technology?
  4. Among the best producers, is there still a marginal impact (i.e. treatment effect) for those that adopt a technology/analytics based strategy?
  5. Can we segment producers based on the kinds of data collected by IoT devices on equipment, aps, financial records, GPS etc.?  (maybe this is not that much different than the TrueHarvest benchmarking done at FarmLink) and are there differentials in outcomes, farming practices, product use patterns etc. by segment
See also:
Big Ag Meets Big Data (Part 1 & Part 2)
Big Data- Causality and Local Expertise are Key in Agronomic Applications
Big Ag and Big Data-Marc Bellemare
Other Big Data and Agricultural related Application Posts at EconometricSense
Causal Inference and Experimental Design Roundup

Friday, September 25, 2015

Propensity Score Matching Meets Difference-in-Differences

I recently have stumbled across a number of studies incorporating both difference-in-differences  (DD) and propensity score methods.  As discussed before, DD is a special case of fixed effects panel methods.  

In the World Bank's publication "Impact Evaluation in Practice" they give a nice summary of the power of DD in identification of causal effects:

"...we can conclude that many unobserved characteristics of individuals are also more or less constant over time. Consider, for example, a person's intelligence or such personality traits as motivation, optimism, self-discipline, or family health history...Interestingly, we are canceling out(or controling for) not only the effect of observed time invariant characteristics but also the effect of unobserved time invariant characteristics such as those mentioned above"

So with DD we can actually control for unobserved characteristics that we may not have data on or maybe couldn't measure appropriately or even quantify! That's powerful. In this framework we are controlling for unobservable characteristics that may be contributing to selection bias, we are achieving identification of treatment effects in a selection on unobservables context.

On the other hand, with propensity score matching, we are appealing to the conditional independence assumption, the idea that matched comparisons imply balance on observed covariates, which ‘recreates’ a situation similar to a randomized experiment  where all subjects are essentially the same except for the treatment(Thoemmes and Kim, 2011). Propensity score matching can identify treatment effects in a selection on observables context. 

But, what if we combine both approaches. The Impact Evaluation book has a section on mixed methods that gives a really good treatment of the power of using both PSM and DD:

"Matched difference-in-differences is one example of combining methods. As discussed previusly, simple propensity score matching cannot account for unobserved characteristics that might explain why a group chooses to enroll in a program and that might also affect outcomes. By contrast, matching combined with difference-in-differences at least takes care of any unobserved characteristics that are constant across time between the two groups"

Below are several papers that utilize the combination of DD and PSM:

Does Matching Overcome Lalonde’s Critique of Nonexperimental Estimators? Jeffrey Smith and Petra Todd. University of Maryland. 2003

Do Agricultural Land Preservation Programs Reduce Farmland Loss? Evidence from a Propensity Score Matching Estimator
Xiangping Liu and Lori Lynch January 2010

Measuring the Impact of Meat Packing and Processing Facilities in the Nonmetropolitan Midwest: A Difference- in-Differences Approach
Georgeanne M Artz, Peter Orazem, Daniel Otto
November 2005 Working Paper # 03003
Iowa State University

How Effective is Health Coaching in Reducing Health Services Expenditures?
Yvonne Jonk, PhD,* Karen Lawson, MD,w Heidi O’Connor, MS,z Kirsten S. Riise, PhD,y David Eisenberg, MD,8z Bryan Dowd, PhD,z and Mary J. Kreitzer, PhD, RN, FAANw
Medical Care 􏰃 Volume 53, Number 2, February 2015

References:

Impact Evaluation in Practice
Paul J. Gertler Sebastian Martinez, Patrick Premand,
Laura B. Rawlings and Christel M. J. Vermeersch
Default Book Series.December 2010

Friday, September 11, 2015

Mastering Metrics....and the Grain Markets

I recently just finished two great books, Mastering 'Metrics, and Mastering the Grain Markets.

Mastering the Grain Markets

While I have a background in agricultural and applied economics, my interest was always related to the public choice and the environmental implications of biotechnology, as well as econometrics (hence this blog). So, I didn't really have much formal background related to commodity markets, other than a little exposure to options through a couple of finance classes.  I have certainly read some really good extension publications related to futures, options, and hedging but Mastering the Grain markets by Elaine Kub really brings these issues to life. She brought me back to my crop scouting days in her many discussions of corn production and the agronomics of our major commodities. She also tackles some major issues and controversies associated with modern agriculture, everything from speculation, to biotech to sustainability issues, gluten fad diets and more. Prepare for a trip from gate to plate in this book that teaches like a textbook but reads like a novel!

Even if you think all you are interested in are the specifics around how futures and options work, you'll end up being convinced that the holistic approach is essential. To borrow one quote:

"..any participation in the grain markets is a form of participation in agriculture, and it should be regarded as one piece of a beautiful, challenging, miraculous whole."
 
A couple areas that struck me as particularly interesting were her discussions of counterparty risk and over the counter contracts. I'll probably have a separate post on this blog or my ag econ blog regarding counterparty risk.

So, why share a review about a grain markets book on an applied econometrics blog? Well, all the discussion about OTCs and risk management rekindled my interest in copulas, which I have blogged about before, and also made me a little more curious about index based crop insurance. Risk modeling in commodities go hand in hand with econometrics. Oh, and she even hits on precision agriculture and alludes to big data in agriculture:

"At the end of the growing season, he has every data point he could possibly need (seed population, seed depth, input rates, final yield, soil moisture, etc.) to fine tune his production practices on each GPS mapped square foot of his farm."

Mastering Metrics

Before reading MM, I had previously read Angrist and Pischke's Mostly Harmless Econometrics. It was my first rigorous introduction to the potential outcomes framework and causal inference. It took me a while to work through and I still reference it often. Even though Mastering 'Metrics was supposed to be a 'lite' version or maybe an undergraduate version of MHE, reading in 'reverse' order worked out well. What I really liked was their intro to regression, and the presentation of regression as a matching estimator becomes even more crystal clear to me than it did in MHE. To borrow a quote:

"Specifically, regression estimates are weighted averages of multiple matched comparisons"

I really think a lot of people I encounter have a hard time thinking about that. I also got better insight and clarification on a number of issues related to instrumental variables, regression discontinuity, and difference-in-differences.  Within the IV discussion, I really like the causal-chain of effects presentation and discussion of 'intent to treat', and better understand all the things related to compliers and noncompliers etc. They also really got me up to speed with regard to the differences in parametric vs. non-parametric RD and an important distinction between fuzzy and sharp RD:

"....with fuzzy, applicants who cross a threshold are exposed to a more intense treatment, while with a sharp design, treatment switches cleanly on or off at the cutoff."

Another thing that stood out with me, in their DD chapter they made some clarifications about weighted regression and clustered standard errors that seemed very helpful. Other things in general, I really liked their treatment of the regression anatomy formula and understood it much better in this reading.  Their basic review and treatment of inference, standard errors, and t-statistics is really great and a good way to segway an undergraduate student from an introductory statistics class into the more advanced topics they present later in the text. I could also see certain graduate programs, even outside of economics making use of this text.

Both Mastering the Grain Markets and Mastering Metrics end with a final chapter tying everything together.

I highly recommend both books.

More thoughts....

So above I mentioned that risk modeling and econometrics go hand in hand, but have been thinking, were any of the techniques covered in MM useful for work related to the commodities markets? In terms of informing marketing and risk management strategies, I'm not sure. Maybe some readers have some idea. But, in terms of policy analysis as it relates to commodity markets, perhaps. There are some that advocate that we should restrict speculation in commodity markets. Scott Irwin looked at the impact of index funds on commodity markets, using granger causality (although granger causality was not discussed in MHE or MM).  Other work has relied on panel methods. A quick google search reveals some related work using instrumental variables discussed in MM. For now I'll just say to be continued.....

References:

Irwin, S. H. and D. R. Sanders (2010), “The Impact of
Index and Swap Funds on Commodity Futures Markets:
Preliminary Results”, OECD Food, Agriculture and
Fisheries Working Papers, No. 27, OECD Publishing.
doi: 10.1787/5kmd40wl1t5f-en


Saturday, September 5, 2015

Econometrics, Math, and Machine Learning...what?

In a recent Bloomberg View piece, Noah Smith wrote a piece titled "Economics has a Math Problem" that has caught a lot of attention lately.  There were three interesting arguments or subjects I found interesting in the piece.

#1 In economics, theory often takes a unique role in the determination of causality

"In most applied math disciplines -- computational biology, fluid dynamics, quantitative finance -- mathematical theories are always tied to the evidence. If a theory hasn’t been tested, it’s treated as pure conjecture....Not so in econ. Traditionally, economists have put the facts in a subordinate role and theory in the driver’s seat. "

This alone might seem controversial to some, but to many economists, causality is a theory driven phenomenon, and can never truly be determined by data. I won't expand on this any further. But the point is that often, economists, outside of a purely predictive or forecasting scenario, are interested in answering causal questions, and despite all the work since the credibility revolution in terms of quasi-experimental designs, theory still plays an important role in determine causality and the direction of effects.

 #2 In economics and econometrics, there is a huge emphasis on explaining causal relationships, both theoretically and empirically, but in machine learning the emphasis is prediction, classification, and pattern recognition devoid of theory or data generating processes

"Machine learning is a broad term for a collection of statistical data analysis techniques that identify key features of the data without committing to a theory. To use an old adage, machine learning “lets the data speak.”…machine learning techniques emphasized causality less than traditional economic statistical techniques, or what's usually known as econometrics. In other words, machine learning is more about forecasting than about understanding the effects of policy."

That really gets at what I have written before, about machine learning vs classical inference. (If Noah's article is interesting to you, then I highly recommend the Leo Brieman paper I reference in that post). Its true, at first it might seem that most economists interested in causal inference might sideline machine learning methods for their lack if emphasis on identification of causal effects or a data generating process. One of the biggest differences between the econometric theory most economists have been trained in and the new field of data science is in effect familiarity and use of methods from machine learning. But if they are interested strictly in predictive modeling and forecasting, these methods might be quite appealing. (I've argued before that economists are ripe for being data scientists). As we know, the methods and approaches we take to analyzing our data differ substantially depending on whether we are trying to explain vs. predict.

But then things start to get interesting:

#3 Recent work in econometrics has narrowed the gap between machine learning and econometrics

"But Athey and Imbens have also studied how machine learning techniques can be used to isolate causal effects, which would allow economists to draw policy implications."

I have not actually drilled into the references and details around this but it is interesting. Just thinking about it a little, I recalled that not long ago I worked on a project where I used gradient boosting (a machine learning algorithm) to estimate propensity scores to estimate treatment effects associated with a web ap.

Even one of the masters of metrics and causal inference, Josh Angrist is offering a course titled "Applied Econometrics:Mostly Harmless Big Data" via the MIT open course platform. And for a long time, economist Kenneth Sanford has been following this trend of emphasis on data science and machine learning in econometrics.

Overall, I think it will be interesting to see more examples of applications of machine learning in causal inference. But, when these applications involve big data and the internet of things, economists will really have to test their knowledge of a range of other big data tools that have little to do with building models or doing calculations.

See also:
Analytics vs Causal Inference
Big Data: Don't throw the baby out with the bath water
Propensity Score Weighting: Logistic vs CART vs Boosting vs Random Forests 
Data Cleaning
Got Data? Probably not like your econometrics textbook!
In God we trust, all others show me your code.
Data Science, 10% inspiration, 90% perspiration
Related:
 Big Ag Meets Big Data (Part 1 & Part 2)