There are a lot of methods discussed in the literature related to modeling skewed distributions with high mass points including log transformations, two part models, GLM etc. In some previous posts I have discussed linear probability models in the context of causal inference. I've also discussed the use of quantile regression as a strategy to model highly skewed continuous and count data. Mullahy (2009) alludes to the use of quantile regression as well:
"Such concerns should translate into empirical strategies that target the high-end parameters of particular interest, e.g. models for Prob(y ≥ k | x) or quantile regression models"
The focus on high end parameters using linear probability models is mentioned in Angrist and Pischke (2009) :
"COP [conditional-on-positive] effects are sometimes motivated by a researcher's sense that when the outcome distribution has a mass point-that is, when it piles up on a particular value, such as zero-or has a heavily skewed distribution, or both, then an analysis of effects on averages misses something. Analysis of effects on averages indeed miss some things, such as changes in the probability of specific values or a shift in quantiles away from the median. But why not look at these distribution effects directly? Distribution outcomes include the likelihood that annual medical expenditures exceed zero, 100 dollars, 200 dollars, and so on. In other words, put 1[Yi > c] for different choices of c on the left hand side of the regression of interest...the idea of looking directly at distribution effects with linear probability models is illustrated by Angrist (2001),...Alternatively, if quantiles provide a focal point, we can use quantile regressions to model them."
References:
Mostly Harmless Econometrics. Angrist and Pischke. 2009
Angrist, J.D. Estimation of Limited Dependent Variable Models With Dummy Endogenous Regressors: Simple Strategies for Empirical Practice. Journal of Business & Economic Statistics January 2001, Vol. 19, No. 1.
ECONOMETRIC MODELING OF HEALTH CARE COSTS AND EXPENDITURES: A SURVEY OF ANALTICAL ISSUES AND RELATED POLICY CONSIDERATIONS
John Mullahy Univ. of Wisconsin-Madison
January 2009
An attempt to make sense of econometrics, biostatistics, machine learning, experimental design, bioinformatics, ....
Saturday, June 28, 2014
Friday, June 27, 2014
Is distance a proxy for pesticide exposure and is it related to ASD? Some thoughts...
Recently a paper has made some headlines, and
the message getting out seems to be that living near a farm field where there
has been pesticide applications has been found to increase the risk of Autism
spectrum disorder. A few things about the paper. First, one of the things I
admire about econometric work is the attempt to make use of some data set, some
variable, or some measurement to estimate the effect of some intervention or
policy, in a world where we can’t always get our hands on the thing we are really
trying to measure. The book Freakonomics comes to mind, or quasi-experimental
designs and the use of instrumental
variables.
Second, I’m not an epidemiologist, entomologist, or have a background in toxicology, but my expertise is more focused on statistical methods so I will comment on the article from that perspective. While the authors
could not (or simply did not)
actually measure pesticide exposure in any medical or biological sense, they
attempted to infer that distance from
an agricultural field might correlate well enough to proxy for exposure. That is a large assumption and perhaps one of
the greatest challenges of the study. It
is not a study on actual exposure. So I’ll try to only refer to exposure from this point in quotes. But the authors did make clever use of
some interesting data sources. They matched up required reported pesticide applications
and report dates with zipcodes of the study respondents and reported pregnancy
stages to determine distance from application and at what point of their
pregnancy they were exposed. They
reported distance in three bands or buffer zones of 1.25, 1.5, & 1.75 km This was actually nice work, if distance could be equated to
some known level of exposure. Unfortunately, while they cited some other work
attempting to tie exposure to ASD, I did not see a citation in the body of the
text where any work had been done justifying the use of distance as a proxy, or
those particular bands. More on this later. They also attempted to control for
a number of confounders, applied survey weighting to ‘weight up’ the effects to reflect the parent population, and in
addition, at least based on my reading, may have even tried to control for some
level of selection bias by using IPTW
regression with SAS.
Discussion of Results
There were at least four major findings in the
paper:
(1) Proximity to organophosphates at some point
during gestation was associated with a 60% increased risk for ASD
(2) higher for 3rd trimester exposures [OR = 2.0, 95% confidence
interval (CI) = (1.1, 3.6)],
(3) and 2nd trimester chlorpyrifos applications: OR = 3.3
[95% CI = (1.5, 7.4)].
(4)Children of mothers residing near pyrethroid
insecticide applications just prior to conception or during 3rd trimester were
at greater risk for both ASD and DD, with OR's ranging from 1.7 to 2.3.
So where do we go with these results? First off
all of these findings are based on odds ratios. The reported odds ratio in the
first finding above was 1.60 which implies a [1.6-1.0]*100 = 60% increase in
odds of ASD for ‘exposed’ vs ‘non-exposed’ children. This is an
increase in odds, and does not have
the exact same interpretation as an increase in probability. (see
more about logistic regression and odds ratios here). Some might read the
headline and walk away with the wrong idea that living within proximity of farm
fields with organophosphate applications constitutes ‘exposure’ to organophosphates and is associated with a 60% increased probability of ASD,
but that is stacking one large assumption on top of another misinterpretation.
However, these findings are but a slice of the
full results reported in the paper. Table 3 reports a number of findings across
the distance bands, types of pesticide, and pregnancy stage. One thing about
odds ratios, an odds ratio of ‘1’ implies no effect. The vast majority of these
findings were associated with odds ratios with 95% confidence intervals
containing 1, or very very close to 1. For those that like to interpret
p-values, a 95% CI for an odds ratio that contains 1 implies that the estimated
regression coefficient in the model has a p-value > .05, i.e.
non-significant results.
Another interesting thing about the table, is
that there doesn’t seem to be any pattern of distance/pregnancy stage/chemistry
associated with the estimated effects or odds ratios. A point made well in a
recent blog post regarding this study at scienceblogs.com here.
Sensitivity
From the paper: “In additional analyses, we evaluated the sensitivity of the estimates
to the choice of buffer size, using 4 additional sizes between 1 and 2km:
results and interpretation remained stable (data not shown).”
That’s unfortunate too. Given the previous
discussion of odds ratios, lack of empirical support or literature related to
using distance as a proxy for exposure, you would think more sensitivity
analysis would be merited to show robustness to all of these assumptions even if
and especially if there is no previous precedent in the literature related to
distance. This in combination with
the previous discussion regarding the large number of insignificant odds ratios
and select reporting of the marginally significant results is probably what
fueled accusations of data
drudging.
Omitted Controls
From the Paper: “Primarily, our exposure estimation approach does not encompass all
potential sources of exposure to each of these compounds: among them external
non-agricultural sources (e.g. institutional use, such as around schools);
residential indoor use; professional pesticide application in or around the
home for gardening, landscaping or other pest control; as well as dietary
sources (Morgan 2012).”
So, there are a number of important routes of
exposure that were not controlled for, or perhaps a good deal of omitted
variable bias and unobserved heterogeneity. The point of my post is not to pick apart a study linking
pesticides to ASD. There are no perfect data sets and no perfect experimental
designs. All studies have weaknesses, and my interpretation of this study
certainly has flaws. The point is, while this study has made some headlines
with some media outlets, and seems scary; it is not one that should be used to
draw sharp conclusions or to run to your legislator for new regulations.
This reminds me of a quote
I have shared here recently:
"Social scientists and policymakers alike seem driven to draw
sharp conclusions, even when these can be generated only by imposing much
stronger assumptions than can be defended. We need to develop a greater
tolerance for ambiguity. We must face up to the fact that we cannot answer all
of the questions that we ask." (Manski, 1995)
References:
Manski, C.F. 1995.
Identification Problems in the Social Sciences. Cambridge: Harvard University
Press.
Neurodevelopmental Disorders and Prenatal Residential
Proximity to Agricultural Pesticides: The CHARGE Study
Janie F. Shelton, Estella M. Geraghty, Daniel J. Tancredi,
Lora D. Delwiche, Rebecca J. Schmidt, Beate Ritz, Robin L. Hansen, and Irva
Hertz-Picciotto
Environmental Health Perspectives. June 23, 2014
Sunday, June 15, 2014
Big Ag and Big Data | Marc F. Bellemare
A very good post about big data in general and applications in agriculture specifically by Marc Bellemere can be found here:
http://marcfbellemare.com/wordpress/2014/06/big-ag-and-big-data/#comment-40620
He clears up a misconception that I've talked about before, where some gainsay big data because it doesn't solve all of the fundamental issues of causal inference.
The promises of big data were never about causal inference. The promise of big data is prediction:
"There is a fundamental difference between estimating causal relationships and forecasting. The former requires a research design in which X is plausibly exogenous to Y. The latter only requires that X include as much stuff as possible."
http://marcfbellemare.com/wordpress/2014/06/big-ag-and-big-data/#comment-40620
He clears up a misconception that I've talked about before, where some gainsay big data because it doesn't solve all of the fundamental issues of causal inference.
The promises of big data were never about causal inference. The promise of big data is prediction:
"There is a fundamental difference between estimating causal relationships and forecasting. The former requires a research design in which X is plausibly exogenous to Y. The latter only requires that X include as much stuff as possible."
"When it comes to forecasting, big data is unbeatable. With an ever larger number of observations and variables, it should become very easy to forecast all kinds of things …"
"But when it comes to doing science, big data is dumb. It is only when we think carefully about the research design required to answer the question "Does X cause Y?" that we know which data to collect, and how much of them. The trend in the social sciences over the last 20 years has been toward identifying causal relationships, and away from observational data — big or not."
He goes on to that end to discuss how big data is being leveraged in food production, and shares a point of enthusiasm that I think is reveals an important point that I have made before regarding the convergence of big data, technology, and genomics:
"This is exactly the kind of innovation that makes me so optimistic about the future of food and that makes me think the neo-Malthusians, just like the Malthusians of old, are wrong."
Subscribe to:
Posts (Atom)