Saturday, November 28, 2015

Econometrics, Multiple Testing, and Researcher Degrees of Freedom

Some have criticized that econometrics courses often give too much emphasis to things like heteroskedasticity, and multicollinearity, or  clinical concerns about linearity. Maybe even at the expense of more important concerns  related to causality and prediction. 

On the other hand, the experimental design courses I took in graduate school provided a treatment of multiple testing; things like bonferroni adjustments in an analysis of variance setting. And in a non-inferential, predictive modeling context, bonferroni and kass adjustments are key in implementations of some decision tree models I have implemented. But not so much in a lot of econometrics work that I have seen.

Why the gap in emphasis on multiple testing? Probably because a lot of what I have read (or work that I have done) involves regressions with binary treatment indicators. The emphasis is almost entirely on a single test of significance related to the estimated regression coefficient...or so it would seem. More on this later. 

But I have spent more and more time in the last couple years in the literature related to epidemiology, health, and wellness research. In one particular article, the authors noted, "Because of the exploratory character of the study, no adjustments for multiple hypotheses testing were performed." (Bender, et al 2002). They cited an article (Bender et al, 2001). In this article a distinction was made between multiple testing adjustments for inferential confirmatory studies vs. what might be characterized as more exploratory work.

"Exploratory studies frequently require a flexible approach for design and analysis. The choice and the number of tested hypotheses may be data dependent, which means that multiple significance tests can be used only for descriptive purposes but not for decision making, regardless of whether multiplicity corrections are performed or not. As the number of tests in such studies is frequently large and usually a clear structure in the multiple tests is missing, an appropriate multiple test adjustment is difficult or even impossible. Hence, we prefer that data of exploratory studies be analyzed without multiplicity adjustment. “Significant” results based upon exploratory analyses should clearly be labeled as exploratory results. To confirm these results the corresponding hypotheses have to be tested in further confirmatory studies."

They certainly follow their own advice in the 2001 paper. De Groot provides some great context around making distinctions between confirmatory and exploratory analysis. De Groot describes explorotory analysis as follows: 

"the material has not been obtained specifically and has not been processed specifically as concerns the testing of one or more hypotheses that have been precisely postulated in advance. Instead, the attitude of the researcher is: “This is interesting material; let us see what we can find.” With this attitude one tries to trace associations (e.g., validities); possible differences between subgroups, and the like. The general intention, i.e. the research topic, was probably determined beforehand, but applicable processing steps are in many respects subject to ad- hoc decisions. Perhaps qualitative data are judged, categorized, coded, and perhaps scaled; differences between classes are decided upon “as suitable as possible”; perhaps different scoring methods are tried along-side each other; and also the selection of the associations that are researched and tested for significance happens partly ad-hoc, depending on whether “something appears to be there”, connected to the interpretation or extension of data that have already been processed."

" does not so much serve the testing of hypotheses as it serves hypothesis-generation, perhaps theory-generation — or perhaps only the interpretation of the available material itself."

Gelman gets at this in his discussion of multiple testing and researcher degrees of freedom (See the Garden of Forking Paths). But, the progress of science might not be possible without some flavor of multiple testing, and to tie your hands with strict and clinical adjustment processes might hinder important work.

"At the same time, we do not want demands of statistical purity to strait-jacket our science. The most valuable statistical analyses often arise only after an iterative process involving the data" (see, e.g., Tukey, 1980, and Box, 1997).

What Gelman addresses in this paper goes beyond a basic discussion of failing to account for multiple comparisons or even multiple hypotheses:

"What we are suggesting is that, given a particular data set, it is not so difficult to look at the data and construct completely reasonable rules for data exclusion, coding, and data analysis that can lead to statistical significance—thus, the researcher needs only perform one test, but that test is conditional on the data…Whatever data-cleaning and analysis choices were made, contingent on the data, would seem to the researchers as the single choice derived from their substantive research hypotheses. They would feel no sense of choice or “fishing” or “p-hacking”—even though different data could have led to different choices, each step of the put it another way, we view these papers—despite their statistically significant p-values—as exploratory, and when we look at exploratory results we must be aware of their uncertainty and fragility."

This is starting to sound familiar to me. Looping back to a discussion about applied econometrics, this reminds me a lot of the EconTalk podcast discussion between Russ Roberts and Ed Leamer. They discuss something very similar to what I think Gelman is getting at. They suggest that a lot of empirical work has a very explorotory flavor to it that needs admitting. Leamer recognized this a long time ago in his essay about taking the con out of econometrics.

"What is hidden from us as the readers and is the unspoken secret Leamer is referring to in his 1983 article, is that we don't get to go in the kitchen with the researcher. We don't see all the different regressions that were done before the chart was finished. The chart was presented as objective science. But those of us who have been in the kitchen--you don't just sit down and say you think these are the variables that count and this is the statistical relationship between them, do the analysis and then publish it. You convince yourself rather easily that you must have had the wrong specification--you left out a variable or included one you shouldn't have included. Or you should have added a squared term to allow for a nonlinear relationship. Until eventually, you craft, sculpt a piece of work that is a conclusion; and you publish that. You show that there is a relationship between A and B, x and y. Leamer's point is that if you haven't shown me all the steps in the kitchen, I don't really know whether what you found is robust. "

Going back to Gelman's garden of forking paths, he also seems to suggest in a sense that the solution is to in fact show all of the steps in the kitchen, or make sure that the dish can be replicable:

"external validation which is popular in statistics and computer science. The idea is to perform two experiments, the first being exploratory but still theory-based, and the second being purely confirmatory with its own preregistered protocol."

So, in econometrics, even if all I am after is a single estimate of a given regression coefficient, multiple testing and researcher degrees of freedom may actually become quite a relevant concern, despite the minimal treatment in many econometrics courses, textbooks and literature. Since Leamer's article, and the credibility revolution, sensitivity analysis and careful identification have certainly been more prevalent in lots of empirical work. Showing all the steps in the kitchen, providing external validity and or explicitly recognizing the exploratory nature of your work (like in Bender, 2002) appear to be the best ways of dealing with this. But its not yet true in every case, and this reveals the fragility in a lot of empirical work that prudence would require us to view with a critical eye when it comes to important policy papers.

See also:

In God We Trust, All Others Show Me Your Code

Pinning p-values to the wall


Am J Epidemiol. 2002 Aug 1;156(3):239-45.
Body weight, blood pressure, and mortality in a cohort of obese patients.
Bender R1, Jöckel KH, Richter B, Spraul M, Berger M.

J Clin Epidemiol. 2001 Apr;54(4):343-9.
Adjusting for multiple testing--when and how?
Bender R1, Lange S.

The Meaning of "Significance" for Different Types of Research. A.D. de Groot. 1956.

The garden of forking paths: Why multiple comparisons can be a problem, even when there is no ``fishing expedition'' or ``p-hacking'' and the research hypothesis was posited ahead of time (Andrew Gelman and Eric Loken) 

"Let's Take the 'Con' Out of Econometrics," by Ed Leamer. The American Economic Review, Vol. 73, Issue 1, (Mar. 1983), pp. 31-43. 

"The Credibility Revolution in Empirical Economics: How Better Research Design is Taking the Con out of Econometrics," by Joshua Angrist and Jörn-Steffen Pischke. NBER Working Paper No. 15794, Mar. 2010.

Wednesday, November 11, 2015

Directed Acyclical Graphs (DAGs) and Instrumental Variables

Previously I discussed several of the most useful descriptions of instrumental variables that I have encountered through various sources. I was recently reviewing some of Lawlor's work related to Mendelian instruments and realized this was the first place I have seen the explicit use of directed acyclical graphs to describe how instrumental variables work.

In describing the application of Mendelian instruments, Lawlor et al present instrumental variables with the aid of directed acyclic graphs. They describe an instrumental variable (Z) as depicted above in the following way, based on three major assumptions:

(1) Z is associated with the treatment or exposure of interest (X)
(2) Z is independent of the unobserved  confounding factors (U) that impact both X and the outcome of interest (Y).
(3) Z is independent of both the outcome of interest Y given X, and the unobservable factors U. (i.e. this is the ‘exclusion principle’ in that Z impacts Y only through X)

Our instrumental variable estimate, βIV is the ratio of E[Y|Z]/E[X|Z],  which can be estimated by two-stage least squares:

X* = β0 + β1 Z +  e
Y = β0 + βIV X* + e

The first regression gets only variation in our treatment or exposure of interest related to Z, and leaves all the variation related to U in the residual term. The second regression estimates βIV, and retains only the ‘quasi-experimental’ variation in X related to the instrument Z.

Stat Med. 2008 Apr 15;27(8):1133-63. Mendelian randomization: using genes as instruments for making causal inferences in epidemiology. Lawlor DA, Harbord RM, Sterne JA, Timpson N, Davey Smith G. Link:

Causal diagrams for empirical research
BY JUDEA PEARL. Biometrika (1995),82,4,pp.669-710

Wednesday, November 4, 2015

Instrumental Explanations of Instrumental Variables

I have recently discussed Marc Bellemare's 'Metrics Monday posts, but he's written many many more applied econometrics posts that are really really good. Another example is his post, Identifying Causal Relationships vs. Ruling Out All Other Possible Causes. The post is not about instrumental variables per say, but, in this post, he describes IVs in this way:

"As many readers of this blog know, disentangling causal relationships from mere correlations is the goal of modern science, social or otherwise, and though it is easy to test whether two variables x and y are correlated, it is much more difficult to determine whether x causes y. So while it is easy to test whether increases in the level of food prices are correlated with episodes of social unrest, it is much more difficult to determine whether food prices cause social unrest."

"In my work, I try to do so by conditioning food prices on natural disasters. To make a long story short, if you believe that natural disasters only affect social unrest through food prices, this ensures that if there is a relationship between food prices and social unrest, that relationship is cleaned out of whatever variation which is not purely due to the relationship flowing from food prices to social unrest. In other words, this ensures that the estimated relationship between the two variables is causal. This technique is known as instrumental variables estimation."

The idea of 'cleaning' out the bias or endogeneity etc. is consistent with how I tried to build intuition for IVs before  depicting an instrumental variable as being like a 'filter' that picks up only variation in the treatment (CAMP) unrelated to an omitted variable (INDEX) or selection bias.

"A very non-technical way to think about this is that we are taking Z and going through CAMP to get to Y, and bringing with us only those aspects of CAMP that are unrelated to INDEX.  Z is like a filter that picks up only the variation in CAMP (what we may refer to as ‘quasi-experimental variation) that we are interested in and filters out the noise from INDEX.  Z is technically related to Y only through CAMP."

Z →CAMP→Y   

 (you can read the full post for more context)

See also:

Below are some more examples of discussions and descriptions of instrumental variables that have been the most beneficial to my understanding:

Kennedy: “The general idea behind this estimation procedure is that it takes the variation in the explanatory variable that matches up with variation in the instrument (and so is uncorrelated with the error), and uses only this variation to compute the slope estimate. This in effect circumvents the correlation between the error and the troublesome variable, and so avoids the asymptotic bias”

Mastering Metrics:“The instrumental variables (IV) method harnesses partial or incomplete random assignment, whether naturally occurring or generated by researchers….."

“The IV method uses these three assumptions to characterize a chain reaction leading from the instrument to student achievement. The first link in this causal chain-the first stage-connects randomly assigned offers with KIPP attendance, while the second link-the one we’re after-connects KIPP attendance with achievement.”

Dr. Andrew Gelman with comments from Hal Varian: How to think about instrumental variables when you get confused

“Suppose z is your instrument, T is your treatment, and y is your outcome. So the causal model is z -> T -> y……. when I get stuck, I find it extremely helpful to go back and see what I've learned from separately thinking about the correlation of z with T, and the correlation of z with y. Since that's ultimately what instrumental variables analysis is doing.”

"You have to assume that the only way that z affects Y is through the treatment, T. So the IV model is
T = az + e
y = bT + d

It follows that
E(y|z) = b E(T|z) + E(d|z)
Now if we
1) assume E(d|z) = 0
2) verify that E(T|z) != 0
we can solve for b by division. Of course, assumption 1 is untestable.
An extreme case is a purely randomized experiment, where e=0 and z is a coin flip."

A Guide to Econometrics. Peter Kennedy.
Mastering 'Metrics. Joshua Angrist and Jörn-Steffen Pischke 

'Big' Data vs. 'Clean' Data

I've previously written about the importance of data cleaning, and recently I was reading a post- Data Science Can Transform Agriculture, If We Get It Right on FarmLink's blog and I was impressed by the following:

"We believe it's a transformational time for this industry – call it Ag 3.0 – when the combination of human know-how and insight, coupled with robust data science and analytics will change the productivity, profitability and sustainability of agriculture."

This reminds me, as I have discussed before in relation to big data in agriculture, of Economist Tyler Cohen's comments on an EconTalk podcast, " the ability to interface well with technology and use it to augment human expertise and judgement is the key to success in the new digital age of big data and automation."

 But in relation to data cleaning, I thought this was really impressive:

"...we disqualified more than two-thirds of the data collected during our first year. Now, we inspect each combine before use following a 50 point check list to identify any problem that could affect accuracy of collection, have developed a world class Quality Assurance process to test the data, and created IP addressable access to our combines to be able to identify and compensate for operator error. As a result, last year over 95% of collected data met our standard for being actionable. Admittedly, our first year data was “big.” But we chose to view it as largely worthless to our customers, just as much of the data being collected through farmer exchanges, open APIs, or memory sticks for example will be. It simply lacks the rigor to justify use in such important undertakings."

It takes patience and discipline to sometimes to make the necessary sacrifices and put the necessary resources into data quality, and it looks like this company gets it. Data cleaning isn't just academic. It's serious. Maybe it's time to replace #BigData with #CleanData.
Big Data
Data Cleaning
Got Data? Probably not like your econometrics textbook!
Big Ag Meets Big Data (Part 1 & Part 2)
Big Data- Causality and Local Expertise are Key in Agronomic Applications
Big Ag and Big Data-Marc Bellemare
Big Data, IoT, Ag Finance, and Causal Inference
In God we trust, all others show me your code.
Data Science, 10% inspiration, 90% perspiration