Saturday, March 30, 2019

Abandoning Statistical Significance - Or - Two Ways to Sell Snake Oil

There was recently a very good article in Nature pushing back against dichotomizing thresholds for p-values (i.e. p < .05). This follows the ASAs statement on the interpretation of p-values.

I've blogged before about previous efforts to pushback against p-values and proposals to focus on confidence intervals (which often just reframe the problem in other ways that get misinterpreted see hereherehere, and here). And absolutely there are problems with p-hacking, failures to account for multiple comparisons and multiple testing, and gardens of forking paths. 

The authors in the Nature article state:

"We are not calling for a ban on P values. Nor are we saying they cannot be used as a decision criterion in certain specialized applications (such as determining whether a manufacturing process meets some quality-control standard). And we are also not advocating for an anything-goes situation, in which weak evidence suddenly becomes credible. Rather, and in line with many others over the decades, we are calling for a stop to the use of P values in the conventional, dichotomous way — to decide whether a result refutes or supports a scientific hypothesis"

What worries me is that some readers won't read this with the careful thought and attention it deserves.

Andrew Gelman, one of those signing the letter, seems to have had some similar concerns noted in his post “Retire Statistical Significance”: The discussion.  In his post he shares a number of statements in the article and how they could be misleading. We have to remember that statistics and inference can be hard. It's hard for Phds that have spent their entire lives doing this stuff. It's hard for practitioners that have made their careers out of it. So it is important to consider the ways that these statements could be interpreted by others that are not as skilled in inference and experimental design as the authors and signatories.

Gelman states:

"the Comment is written with an undercurrent belief that there are zillions of true, important effects out there that we erroneously dismiss. The main problem is quite the opposite: there are zillions of nonsense claims of associations and effects that once they are published, they are very difficult to get rid of. The proposed approach will make people who have tried to cheat with massaging statistics very happy, since now they would not have to worry at all about statistics. Any results can be spun to fit their narrative. Getting entirely rid of statistical significance and preset, carefully considered thresholds has the potential of making nonsense irrefutable and invincible."

In addition Gelman says:

"statistical analysis at least has some objectivity and if the rules are carefully set before the data are collected and the analysis is run, then statistical guidance based on some thresholds (p-values, Bayes factors, FDR, or other) can be useful. Otherwise statistical inference is becoming also entirely post hoc and subjective"


An Illustration: Two Ways to Sell Snake Oil

So let me propose a fable. Suppose there is a salesman with an elixir claiming it is a miracle breakthrough for weight loss. Suppose they have lots and lots of data, large sample sizes, and randomized controlled trials supporting its effectiveness. In fact, in all of their studies they find that on average, consumers using the elixir have a loss of weight with highly statistically significant results (p < .001). Ignoring effect sizes (i.e how much weight do people actually lose on average?) the salesman touts the precision of the results and sells lots and lots of elixir based on the significance of the findings.

If the salesman were willing to confess that the estimates of the effects of taking the elixir were very precise but we are precisely measuring an average loss of about 1.5 pounds per year compared to controls - it would destroy his sales pitch!

So now the salesman reads our favorite article in Nature. He conducts a number of additional trials. This time he's going to focus only on the effect sizes from the studies and maybe this time goes with smaller sample sizes. After all, R&D is expensive! Looking only at effect sizes, he knows that a directional finding of 1.5 pounds per year isn't going to sell. So how large does the effect need to be to take his snake oil to market with data to support it? Is 2 pounds convincing? Or 3,4,5-10? Suppose his data show an average annual loss of weight near 10 pounds greater for those using the elixir vs. a control group. He goes to market with this claim. As he is making a pitch to a crowd of potential buyers, one savvy consumer gives him a critical review asking if his results were statistically significant. The salesman having read our favorite Nature article replies that mainstream science these days is more concerned with effect sizes than dichotomous notions of statistical significance. To the crowd this sounds like a sophisticated and informed answer so that day he sells his entire stock.

Eventually someone uncovers the actual research related to the elixer. They find that yes, on average most of those studies found an effect of about 10 pounds of annual loss of weight. But the p-values associated with these estimates in these studies ranged from .25-.40. What does this mean?

P-values tell us the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.

Simplifying we could say, if the elixir really is snake oil, a p-value equal to .25 tells us that there is a 25% probability that we would observe an average loss of weight equal to or greater than 10 pounds. People in the study would be likely to lose 10 pounds or more even if they did not take the elixir.

A p-value of .25 doesn't necessarily mean that the elixir is ineffective. That is sort of the point of the article in Nature. It just means that the evidence for rejecting the null hypothesis of zero effect is weak.

What if instead of selling elixir the salesman was taking bets with a two headed coin. How would we catch him in the act? What if he flipped the coin two times and got two heads in a row? (and we just lost $100 at $50/flip) If we only considered the observed outcomes, and knew nothing about the distribution of coin flips (and completely ignored intuition) we might think this is evidence of cheating. After all, two heads in a row would be consistent with a two headed coin. But I wouldn't be dialing my lawyer yet.

If we consider the simple probability distribution associated with tossing a two sided coin, we would know that there is a 50% chance of flipping a normal coin and getting heads and a 25% chance of flipping a normal coin twice and getting two heads in a row. This is roughly analogous to a p-value equal to .25. In other words, there is a good chance if our con-artist were using a fair coin he could in fact flip two heads in a row. This does not mean he is innocent, it just means that when we consider the distribution, variation, and probabilities associated with flipping coins the evidence just isn't that convincing. We might say that our observed data is compatible with the null hypotheses that the coin is fair. We could say the same thing about the evidence from our fable about weight loss or any study with a p-value equal to .25.

What if our snake oil elixir salesman flipped his coin 4 times and got 4 heads in a row?  The probability of 4 heads in a row is 6.25% if he has a fair coin. What about 5? Under the null hypothesis of a 'fair' coin the probability of observing an event as extreme as 5 heads in a row is 3.125%. Do we think our salesman could be that lucky and get 4 or 5 heads in a row? Many people would have their doubts. When we get past whatever threshold is required to start having doubts about the null hypothesis then intuitively we begin to feel comfortable rejecting the null hypothesis. As the article in Nature argues, this cutoff should not necessarily be 5% or p < .05. However in this example the probabilities are analogous to having p-values of .0625 and .03125 which are in the vicinity of our traditional threshold of .05. I don't think reading the article in Nature should change your mind about this.

Conclusion

We see with our fable, the pendulum could swing too far in either direction and lead to abusive behavior and questionable conclusions. Economist Noah Smith discussed the pushback against p-values a few years ago. He stated rightly that 'if people are doing science right, these problems won't matter in the long run.' Focusing on effect size only and ignoring distribution, variation, and uncertainty risks backsliding from the science that revolutionized the 20th century into the world of anecdotal evidence. Clearly the authors and signatories of the Nature article are not advocating this, as they stated in the excerpts I shared above. It is how this article gets interpreted and cited that matters most. As Gelman states:

"some rule is needed for the game to be fair. Otherwise we will get into more chaos than we have now, where subjective interpretations already abound. E.g. any company will be able to claim that any results of any trial on its product to support its application for licensing"