Saturday, November 28, 2015

Econometrics, Multiple Testing, and Researcher Degrees of Freedom

Some have criticized that econometrics courses often give too much emphasis to things like heteroskedasticity, and multicollinearity, or  clinical concerns about linearity. Maybe even at the expense of more important concerns  related to causality and prediction. 

On the other hand, the experimental design courses I took in graduate school provided a treatment of multiple testing; things like bonferroni adjustments in an analysis of variance setting. And in a non-inferential, predictive modeling context, bonferroni and kass adjustments are key in implementations of some decision tree models I have implemented. But not so much in a lot of econometrics work that I have seen.

Why the gap in emphasis on multiple testing? Probably because a lot of what I have read (or work that I have done) involves regressions with binary treatment indicators. The emphasis is almost entirely on a single test of significance related to the estimated regression coefficient...or so it would seem. More on this later. 

But I have spent more and more time in the last couple years in the literature related to epidemiology, health, and wellness research. In one particular article, the authors noted, "Because of the exploratory character of the study, no adjustments for multiple hypotheses testing were performed." (Bender, et al 2002). They cited an article (Bender et al, 2001). In this article a distinction was made between multiple testing adjustments for inferential confirmatory studies vs. what might be characterized as more exploratory work.

"Exploratory studies frequently require a flexible approach for design and analysis. The choice and the number of tested hypotheses may be data dependent, which means that multiple significance tests can be used only for descriptive purposes but not for decision making, regardless of whether multiplicity corrections are performed or not. As the number of tests in such studies is frequently large and usually a clear structure in the multiple tests is missing, an appropriate multiple test adjustment is difficult or even impossible. Hence, we prefer that data of exploratory studies be analyzed without multiplicity adjustment. “Significant” results based upon exploratory analyses should clearly be labeled as exploratory results. To confirm these results the corresponding hypotheses have to be tested in further confirmatory studies."

They certainly follow their own advice in the 2001 paper. De Groot provides some great context around making distinctions between confirmatory and exploratory analysis. De Groot describes explorotory analysis as follows: 

"the material has not been obtained specifically and has not been processed specifically as concerns the testing of one or more hypotheses that have been precisely postulated in advance. Instead, the attitude of the researcher is: “This is interesting material; let us see what we can find.” With this attitude one tries to trace associations (e.g., validities); possible differences between subgroups, and the like. The general intention, i.e. the research topic, was probably determined beforehand, but applicable processing steps are in many respects subject to ad- hoc decisions. Perhaps qualitative data are judged, categorized, coded, and perhaps scaled; differences between classes are decided upon “as suitable as possible”; perhaps different scoring methods are tried along-side each other; and also the selection of the associations that are researched and tested for significance happens partly ad-hoc, depending on whether “something appears to be there”, connected to the interpretation or extension of data that have already been processed."

" does not so much serve the testing of hypotheses as it serves hypothesis-generation, perhaps theory-generation — or perhaps only the interpretation of the available material itself."

Gelman gets at this in his discussion of multiple testing and researcher degrees of freedom (See the Garden of Forking Paths). But, the progress of science might not be possible without some flavor of multiple testing, and to tie your hands with strict and clinical adjustment processes might hinder important work.

"At the same time, we do not want demands of statistical purity to strait-jacket our science. The most valuable statistical analyses often arise only after an iterative process involving the data" (see, e.g., Tukey, 1980, and Box, 1997).

What Gelman addresses in this paper goes beyond a basic discussion of failing to account for multiple comparisons or even multiple hypotheses:

"What we are suggesting is that, given a particular data set, it is not so difficult to look at the data and construct completely reasonable rules for data exclusion, coding, and data analysis that can lead to statistical significance—thus, the researcher needs only perform one test, but that test is conditional on the data…Whatever data-cleaning and analysis choices were made, contingent on the data, would seem to the researchers as the single choice derived from their substantive research hypotheses. They would feel no sense of choice or “fishing” or “p-hacking”—even though different data could have led to different choices, each step of the put it another way, we view these papers—despite their statistically significant p-values—as exploratory, and when we look at exploratory results we must be aware of their uncertainty and fragility."

This is starting to sound familiar to me. Looping back to a discussion about applied econometrics, this reminds me a lot of the EconTalk podcast discussion between Russ Roberts and Ed Leamer. They discuss something very similar to what I think Gelman is getting at. They suggest that a lot of empirical work has a very explorotory flavor to it that needs admitting. Leamer recognized this a long time ago in his essay about taking the con out of econometrics.

"What is hidden from us as the readers and is the unspoken secret Leamer is referring to in his 1983 article, is that we don't get to go in the kitchen with the researcher. We don't see all the different regressions that were done before the chart was finished. The chart was presented as objective science. But those of us who have been in the kitchen--you don't just sit down and say you think these are the variables that count and this is the statistical relationship between them, do the analysis and then publish it. You convince yourself rather easily that you must have had the wrong specification--you left out a variable or included one you shouldn't have included. Or you should have added a squared term to allow for a nonlinear relationship. Until eventually, you craft, sculpt a piece of work that is a conclusion; and you publish that. You show that there is a relationship between A and B, x and y. Leamer's point is that if you haven't shown me all the steps in the kitchen, I don't really know whether what you found is robust. "

Going back to Gelman's garden of forking paths, he also seems to suggest in a sense that the solution is to in fact show all of the steps in the kitchen, or make sure that the dish can be replicable:

"external validation which is popular in statistics and computer science. The idea is to perform two experiments, the first being exploratory but still theory-based, and the second being purely confirmatory with its own preregistered protocol."

So, in econometrics, even if all I am after is a single estimate of a given regression coefficient, multiple testing and researcher degrees of freedom may actually become quite a relevant concern, despite the minimal treatment in many econometrics courses, textbooks and literature. Since Leamer's article, and the credibility revolution, sensitivity analysis and careful identification have certainly been more prevalent in lots of empirical work. Showing all the steps in the kitchen, providing external validity and or explicitly recognizing the exploratory nature of your work (like in Bender, 2002) appear to be the best ways of dealing with this. But its not yet true in every case, and this reveals the fragility in a lot of empirical work that prudence would require us to view with a critical eye when it comes to important policy papers.

See also:

In God We Trust, All Others Show Me Your Code

Pinning p-values to the wall


Am J Epidemiol. 2002 Aug 1;156(3):239-45.
Body weight, blood pressure, and mortality in a cohort of obese patients.
Bender R1, Jöckel KH, Richter B, Spraul M, Berger M.

J Clin Epidemiol. 2001 Apr;54(4):343-9.
Adjusting for multiple testing--when and how?
Bender R1, Lange S.

The Meaning of "Significance" for Different Types of Research. A.D. de Groot. 1956.

The garden of forking paths: Why multiple comparisons can be a problem, even when there is no ``fishing expedition'' or ``p-hacking'' and the research hypothesis was posited ahead of time (Andrew Gelman and Eric Loken) 

"Let's Take the 'Con' Out of Econometrics," by Ed Leamer. The American Economic Review, Vol. 73, Issue 1, (Mar. 1983), pp. 31-43. 

"The Credibility Revolution in Empirical Economics: How Better Research Design is Taking the Con out of Econometrics," by Joshua Angrist and Jörn-Steffen Pischke. NBER Working Paper No. 15794, Mar. 2010.

No comments:

Post a Comment