Tuesday, December 2, 2014

A Cookbook Econometrics Analogy

Previously I wrote a post on applied econometrics, which really was not all that original. It was motivated by a previous post made by Marc Bellemare and Dave Giles, and I just added some commentary on my personal experience as well as some quotes from Peter Kennedy's A Guide to Econometrics.

Since then, I've been reading Kennedy's chapter on applied econometrics in greater detail (I have a 6th edition copy) and I found the following interesting analogy. Typically cookbook analogies relate negatively to practitioners mind numbingly running regressions and applying tests etc. without strong appreciation for the underlying theory, but this is of a different flavor and to me gives a good impression of what 'doing econometrics' actually feels like:

From The Valavanis (1959, p.83) in Kennedy 6th Edition:

"Econometric theory is like an exquisitely balanced French recipe, spelling out precisely with how many turns to mix the sauce, how many carats of spice to add, and for how many milliseconds to bake the mixture at exactly 474 degrees of temperature. But when the statistical cook turns to raw materials, he finds that hearts of cactus fruit are unavailable, so he substitutes chunks of cantaloupe; where the recipe calls for vermicelli he used shredded wheat; and he substitutes green garment die for curry, ping-pong balls for turtles eggs, and for Chalifougnac vintage 1883, a can of turpentine."

It really gets uncomfortable when you are presenting at a seminar or conference or other audeince and someone that isn't elbow deep in the data challenges points out that your estimator isn't valid theoretically because you used 'turpentine' when the recipe (or econometric theory) calls for  Chalifougnac vintage 1883 or someone well-versed in theory but unaware of the social norms of applied econometrics tries to make you look incompetent by pointing out this 'mistake.'

Also gives me  That Modeling Feeling.

Friday, November 28, 2014

Applied Econometrics

I really enjoy Marc Bellemare's applied econ posts, but I really enjoy his econometric related posts (for instance a while back he wrote some really nice posts related  linear probability models   here and here).

Recently he wrote a piece entitled "In defense of the cookbook  approach to econometrics." At one point he states:

"The problem is that there is often a wide gap between the theory and practice of econometrics. This is especially so given that the practice of econometrics is often the result of social norms..."

He goes on to make the point that 'cookbook' classes should be viewed as a complement, not a supplement to theoretical econometrics classes.

It is the gap between theory and practice that has given me a ton of grief in the last few years. After spending hours and hours in graduate school working through tons of theorems and proofs to basically restate everything I learned as an undergraduate in a more rigorous tone, I found that when it came to actually doing econometrics I wasn't much the better, or  sometimes wondered if maybe I even regressed. At every corner was a different challenge that seemed at odds with everything I learned in grad school.  Doing econometrics felt like running in water. As Peter Kennedy states in the applied econometrics chapter of his popular A Guide to Econometrcs, "econometrics is much easier without data".

As Angrist and Pischke state: "if applied econometrics were easy theorists would do it."

I was very lucky that my first job out of graduate school was at the same university where I attended as an undergraduate, and I had the benefit of my former professors to show me the ropes, or the 'social norms' as mentioned by Bellemare. The thing is, all along, I just thought that since I was an MS vs PhD graduate, maybe I didn't know these things because I just hadn't had that last theory course, or maybe the 'applied' econometrics course I took was too weak on theory. But as Kennedy points out:

"In virtually every econometric analysis there is a gap, usually a vast gulf, between the problem at hand and the closest scenario to which standard econometric theory is applicable....the issue here is that in their econometric theory courses students are taught standard solutions to standard problems, but in practice there are no standard problems...Applied econometricians are continually faced with awkward compromises..."

The hard part for the recent graduate that has not had a good applied econometrics course is figuring out how to compromise, or which sins are more forgivable or harmless than others.

Another issue with applied vs. theoretical econometrics is software implementation. Most economists I know seem to use STATA, but I have primarily worked in a SAS shop and have taught myself R. But most of my coursework econometrics was done on PAPER with some limited work in SPSS.  GRETL is also popular as a teaching tool. Statistical programming is a whole new world, and propensity score matching in SAS is not straight forward (although here  314-2012 is a really nice paper if you are interested). Speaking of which, if you don't have the luxury of someone showing you the ropes, maybe the best thing you can do is attend some conferences. While not strictly an academic conference, SAS Global Forum has been a great conference with proceedings replete with applied papers with software implementation. R bloggers also offer some good examples of applied work with software implementation.

See also:
Culture War: Classical Statistics vs. Machine Learning

Ambitious vs. Ambiguous Modeling

Mostly Harmless Econometrics as an off-road backwoods survival manual for practitioners.

Friday, August 15, 2014

Implications of Maximum Likelihood Methods for Missing Data in Predictive Modeling Applications

I believe there are some key things to consider when we deal with situations where we have missing data when we are 1) estimating a model to be used for prediction and 2) using the model to predict new cases which also may have missing values for key predictor variables.  For purposes of this discussion, I'm thinking specifically about situations where one intends to use ML or FIML to estimate the parameters that define  or train your model.  However, if you are using some other method, you still must consider how to handle missing values in both the model training and scoring exercises. Also, there may be distinctions to consider in a purely predictive/machine learning application vs. causal inference.

But for this conversation, I will start with discussion around maximum likelihood estimation.

Standard Maximum Likelihood: 

Maximize L = Π f(y,x1,…xk;β)  

With standard ML, the likelihood function is optimized providing the values for β which define our regression model. (like Y = β0 + β1 x1  + … βk xk  + e)

 As is the case in many modeling scenarios, with standard MLE, only complete cases are used to estimated the model. That is, for each 'row' or individual case, all values of 'x' and 'y' must be defined. If a single explanatory variable 'x' or the dependent variable 'y' have a missing value, then that individual/case/row is excluded from the data. This is referred to as listwise deletion. In many scenarios, this can be undesirable because for one thing, you are reducing the amount of information used to estimate you model. Paul Allison has a very informative discussion of this in a recent post at Statistical Horizons.

Full Information Maximum Likelihood (FIML): 

Maximize L = Π f(y,x1,…xk;β)  Π f(y,x3,…xk;β)  

Full information maximum likelihood is an estimation strategy that allows for us to get parameter estimates even in the presence of missing data.  The overall likelihood is the product of the likelihoods specified for all observations. If there are m observations with no missing values but n observations missing x1 and x2 we account for that by specifying the overall likelihood function as the product of two terms i.e. likelihood function is specified as a product of likelihoods for both complete and incomplete cases.  In the example above the second term in the product depicts a case where for individual ‘i’ there are missing values for the first 3 variables. The first term represents the likelihood for all other complete cases. The overall likelihood is then optimized providing the values for β which define our regression model.

Both ML and FIML are methods for estimating parameters; they are not imputation procedures per-say.  As Karen Grace Martin (Analysis Factor) aptly puts it  “This method does not impute any data, but rather uses each case's available data to compute maximum likelihood estimates.”

Predictive Modeling Applications

So if we have missing data, we could use FIML to obtain parameter estimates for a model, but what if we actually want to predict outcomes ‘y’ for some new data set. By assumption, if we are trying to predict ‘y’ we don’t have values for y in our data set.  We will attempt to take the model or parameter estimates we got from FIML and predict Y based on the estimated values of  our β’s and observed x’s. But what if in the new data we have missing x’s? Can’t we just use FIML to get our model and predictions? No. First we have already derived our model via FIML using our original or training data. Again, FIML is a model or parameter estimation procedure. To apply FIML in our new data set would imply 2 things:

1)   We want to estimate a new model
       2)  We have observed values for what we are trying to predict ‘y’ which by assumption we don’t have that! So there is no way to properly specify the likelihood to even implement FIML.

But, we don't want to estimate a new model in the first place. If we want to make new predictions based on our original model estimated using maximum likelihood, we have to utilize some type of actual imputation procedure to derive values for missing x’s.  This may often be the case for scoring or targeting customers for special promotions where original scoring models used income data, but have to impute income for some customers based on relationships with other available data sources like income from zip code tabulation area etc. 

SAS Global Forum Paper 312-2012
Handling Missing Data by Maximum Likelihood
Paul D. Allison, Statistical Horizons, Haverford, PA, USA

Two Recommended Solutions for Missing Data: Multiple Imputation and Maximum Likelihood. Karen Gace-Martin. The Analysis Factor: http://www.theanalysisfactor.com/missing-data-two-recommended-solutions/ Accessed 8/14/14

Listwise Deletion: It's Not Evil. Paul Allison, Statistical Horizons. June 13,2014. http://www.statisticalhorizons.com/listwise-deletion-its-not-evil  

Saturday, June 28, 2014

Linear Probability Models for Skewed Distributions with High Mass Points

There are a lot of methods discussed in the literature related to modeling skewed distributions with high mass points including log transformations, two part models,  GLM etc. In some previous posts I have discussed linear probability models in the context of causal inference.  I've also discussed the use of quantile regression as a strategy to model highly skewed continuous and count data. Mullahy (2009) alludes to the use of quantile regression as well:

"Such concerns should translate into empirical strategies that target the high-end parameters of particular interest, e.g. models for Prob(y ≥ k | x) or quantile regression models"

The focus on high end parameters  using linear probability models is mentioned in Angrist and Pischke (2009) :

"COP [conditional-on-positive] effects are sometimes motivated by a researcher's sense that when the outcome distribution has a mass point-that is, when it piles up on a particular value, such as zero-or has a heavily skewed distribution, or both, then an analysis of effects on averages misses something. Analysis of effects on averages indeed miss some things, such as changes in the probability of specific values or a shift in quantiles away from the median. But why not look at these distribution effects directly? Distribution outcomes include the likelihood that annual medical expenditures exceed zero, 100 dollars, 200 dollars, and so on. In other words, put 1[Yi > c] for different choices of c on the left hand side of the regression of interest...the idea of looking directly at distribution effects with linear probability models is illustrated by Angrist (2001),...Alternatively, if quantiles provide a focal point, we can use quantile regressions to model them."


Mostly Harmless Econometrics. Angrist and Pischke. 2009

Angrist, J.D. Estimation of Limited Dependent Variable Models With Dummy Endogenous Regressors: Simple Strategies for Empirical Practice. Journal of Business & Economic Statistics January 2001, Vol. 19, No. 1.

John Mullahy Univ. of Wisconsin-Madison
January 2009

Friday, June 27, 2014

Is distance a proxy for pesticide exposure and is it related to ASD? Some thoughts...

Recently a paper has made some headlines, and the message getting out seems to be that living near a farm field where there has been pesticide applications has been found to increase the risk of Autism spectrum disorder. A few things about the paper. First, one of the things I admire about econometric work is the attempt to make use of some data set, some variable, or some measurement to estimate the effect of some intervention or policy, in a world where we can’t always get our hands on the thing we are really trying to measure. The book Freakonomics comes to mind, or quasi-experimental designs and the use of instrumental variables.

Second, I’m not an epidemiologist, toxicologist, entemologist, or have a background in medicine or public health. I don’t have the subject matter expertise to critique this article, but I can express my appreciation for the statistical methods that they used. While the authors could not  (or simply did not) actually measure pesticide exposure in any medical or biological sense, they attempted to infer that distance from an agricultural field might correlate well enough to proxy for exposure. That is a large assumption and perhaps one of the greatest challenges of the study. It is not a study on actual exposure. So I’ll  try to only refer to exposure from this point in quotes.  But the authors did make clever use of some interesting data sources. They matched up required reported pesticide applications and report dates with zipcodes of the study respondents and reported pregnancy stages to determine distance from application and at what point of their pregnancy they were exposed.  They reported distance in three bands  or buffer zones of 1.25, 1.5, & 1.75 km   This was actually nice work, if distance could be equated to some known level of exposure. Unfortunately, while they cited some other work attempting to tie exposure to ASD, I did not see a citation in the body of the text where any work had been done justifying the use of distance as a proxy, or those particular bands. More on this later. They also attempted to control for a number of confounders, applied survey weighting to ‘weight up’ the effects to reflect the parent population, and in addition, at least based on my reading, may have even tried to control for some level of selection bias by using IPTW regression with SAS.

Discussion of Results

There were at least four major findings in the paper:

(1) Proximity to organophosphates at some point during gestation was associated with a 60% increased risk for ASD

(2) higher for 3rd trimester exposures [OR = 2.0, 95% confidence interval (CI) = (1.1, 3.6)],

(3) and 2nd trimester chlorpyrifos applications: OR = 3.3 [95% CI = (1.5, 7.4)].

(4)Children of mothers residing near pyrethroid insecticide applications just prior to conception or during 3rd trimester were at greater risk for both ASD and DD, with OR's ranging from 1.7 to 2.3.

So where do we go with these results? First off all of these findings are based on odds ratios. The reported odds ratio in the first finding above was 1.60 which implies a [1.6-1.0]*100 = 60% increase in odds of ASD for ‘exposed’ vs ‘non-exposed’ children. This is an increase in odds, and does not have the exact same interpretation as an increase in probability. (see more about logistic regression and odds ratios here). Some might read the headline and walk away with the wrong idea that living within proximity of farm fields with organophosphate applications constitutes  ‘exposure’ to organophosphates  and is associated with a 60% increased probability of ASD, but that is stacking one large assumption on top of another misinterpretation.

However, these findings are but a slice of the full results reported in the paper. Table 3 reports a number of findings across the distance bands, types of pesticide, and pregnancy stage. One thing about odds ratios, an odds ratio of ‘1’ implies no effect. The vast majority of these findings were associated with odds ratios with 95% confidence intervals containing 1, or very very close to 1. For those that like to interpret p-values, a 95% CI for an odds ratio that contains 1 implies that the estimated regression coefficient in the model has a p-value > .05, i.e. non-significant results.

Another interesting thing about the table, is that there doesn’t seem to be any pattern of distance/pregnancy stage/chemistry associated with the estimated effects or odds ratios. A point made well in a recent blog post regarding this study at scienceblogs.com here.


From the paper: “In additional analyses, we evaluated the sensitivity of the estimates to the choice of buffer size, using 4 additional sizes between 1 and 2km: results and interpretation remained stable (data not shown).”

That’s unfortunate too. Given the previous discussion of odds ratios, lack of empirical support or literature related to using distance as a proxy for exposure, you would think more sensitivity analysis would be merited to show robustness to all of these assumptions even if and especially if there is no previous precedent in the literature related to distance.  This in combination with the previous discussion regarding the large number of insignificant odds ratios and select reporting of the marginally significant results is probably what fueled accusations of data drudging.

Omitted Controls

From the Paper: “Primarily, our exposure estimation approach does not encompass all potential sources of exposure to each of these compounds: among them external non-agricultural sources (e.g. institutional use, such as around schools); residential indoor use; professional pesticide application in or around the home for gardening, landscaping or other pest control; as well as dietary sources (Morgan 2012).”

So, there are a number of important routes of exposure that were not controlled for, or perhaps a good deal of omitted variable bias and unobserved heterogeneity.  The point of my post is not to pick apart a study linking pesticides to ASD. There are no perfect data sets and no perfect experimental designs. All studies have weaknesses, and my interpretation of this study certainly has flaws. The point is, while this study has made some headlines with some media outlets, and seems scary; it is not one that should be used to draw sharp conclusions or to run to your legislator for new regulations.
This reminds me of a quote I have shared here recently:
"Social scientists and policymakers alike seem driven to draw sharp conclusions, even when these can be generated only by imposing much stronger assumptions than can be defended. We need to develop a greater tolerance for ambiguity. We must face up to the fact that we cannot answer all of the questions that we ask." (Manski, 1995)

Manski, C.F. 1995. Identification Problems in the Social Sciences. Cambridge: Harvard University Press.

Neurodevelopmental Disorders and Prenatal Residential Proximity to Agricultural Pesticides: The CHARGE Study
Janie F. Shelton, Estella M. Geraghty, Daniel J. Tancredi, Lora D. Delwiche, Rebecca J. Schmidt, Beate Ritz, Robin L. Hansen, and Irva Hertz-Picciotto
Environmental Health Perspectives.   June 23, 2014

Sunday, June 15, 2014

Big Ag and Big Data | Marc F. Bellemare

A very good post about big data in general and applications in agriculture specifically by Marc Bellemere can be found here:


He clears up a misconception that I've talked about before, where some gainsay big data because it doesn't solve all of the fundamental issues of causal inference.

The promises of big data were never about causal inference. The promise of big data is prediction:

"There is a fundamental difference between estimating causal relationships and forecasting. The former requires a research design in which X is plausibly exogenous to Y. The latter only requires that X include as much stuff as possible."

"When it comes to forecasting, big data is unbeatable. With an ever larger number of observations and variables, it should become very easy to forecast all kinds of things …"
"But when it comes to doing science, big data is dumb. It is only when we think carefully about the research design required to answer the question "Does X cause Y?" that we know which data to collect, and how much of them. The trend in the social sciences over the last 20 years has been toward identifying causal relationships, and away from observational data — big or not."
He goes on to that end to discuss how big data is being leveraged in food production, and shares a point of enthusiasm that I think is reveals an important point that I have made before regarding the convergence of big data, technology, and genomics

"This is exactly the kind of innovation that makes me so optimistic about the future of food and that makes me think the neo-Malthusians, just like the Malthusians of old, are wrong."

Saturday, May 31, 2014

Big Data: Causality and Local Expertise Are Key in Agronomic Applications

In a previous post Big Data: Don't throw the baby out with the bathwater, I made the case that in many instances, we aren't concerned with issues related to causality.

"If a 'big data' ap tells me that someone is spending 14 hours each week on the treadmill, that might be a useful predictor for their health status. If all I care about is identifying people based on health status I think hrs of physical activity would provide useful info.  I might care less if the relationship is causal as long as it is stable....correlations or 'flags' from big data might not 'identify' causal effects, but they are useful for prediction and might point us in directions where we can more rigorously investigate causal relationships"

But sometimes we are interested in causal effects. If that is the case, the article that I reference in the previous post makes a salient point:

"But a theory-free analysis of mere correlations is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down."

“Big data” has arrived, but big insights have not. The challenge now is to solve new problems and gain new answers – without making the same old statistical mistakes on a grander scale than ever."

I think that may be the instance in many agronomic applications of big data. I've written previously about the convergence of big data, genomics, and agriculture.  In those cases, when I think about applications like ACRES or Field Scripts, I have algorithmic approaches (finding patterns and correlations) in mind, not necessarily causation.

But Dan Frieberg points out some very important things to think about when it comes to using agronomic data in an corn and soybean digest article "Data Decisions: Meaningful data analysis involves agronomic common sense, local expertise." 

He gives an example where data indicates better yields are associated with faster planting speeds, but something else is really going on:

"Sometimes, a data layer is actually a “surrogate” for another layer that you may not have captured. Planting speed was a surrogate for the condition of the planting bed.  High soil pH as a surrogate for cyst nematode. Correlation to slope could be a surrogate for an eroded area within a soil type or the best part of the field because excess water escaped in a wet year."

He concludes:

"big data analytics is not the crystal ball that removes local context. Rather, the power of big data analytics is handing the crystal ball to advisors that have local context"

This is definitely a case where we might want to more rigorously look at relationships identified by data mining algorithms that may not capture this kind of local context.  It may or may not apply to the seed selection algorithms coming to market these days, but as we think about all the data that can potentially be captured through the internet of things from seed choice, planting speed, depth, temperature, moisture, etc this could become especially important. This might call for a much more personal service including data savvy reps to help agronomists and growers get the most from these big data apps or the data that new devices and software tools can collect and aggregate.  Data savvy agronomists will need to know the assumptions and nature of any predictions or analysis, or data captured by these devices and apps to know if surrogate factors like Dan mentions have been appropriately considered. And agronomists, data savvy or not will be key in identifying these kinds of issues.  Is there an ap for that? I don't think there is an automated replacement for this kind of expertise, but as economistTyler Cowen says, the ability to interface well with technology and use it to augment human expertise and judgement is the key to success in the new digital age of big data and automation.


Big Data…Big Deal? Maybe, if Used with Caution. http://andrewgelman.com/2014/04/27/big-data-big-deal-maybe-used-caution/

See also: Analytics vs. Causal Inference http://econometricsense.blogspot.com/2014/01/analytics-vs-causal-inference.html