Saturday, August 08, 2015
The backlash to the backlash against p-values
Suddenly, everyone is getting really upset about p-values and statistical significance testing. The backlash has reached such a frenzy that some psych journals are starting to ban significance testing. Though there are some well-known problems with p-values and significance testing, this backlash doesn't pass the smell test. When a technique has been in wide use for decades, it's certain that LOTS of smart scientists have had a chance to think carefully about it. The fact that we're only now getting the backlash means that the cause is something other than the inherent uselessness of the methodology.
The problems with significance testing are pretty obvious. First of all, p-hacking causes publication bias - scientists have an incentive to keep mining the data until they get something with a p-value just under the level that people consider interesting. Also, significance testing doesn't usually specify alternative hypotheses very well - this means that rejection of a null hypothesis is usually over-interpreted as being evidence for the researcher's chosen alternative.
But the backlash against p-values and significance testing has been way overdone. Here are some reasons I think this is true.
1. Both of the aforementioned problems are easily correctable by the normal practices of science - by which I mean that if people are doing science right, these problems won't matter in the long run. Both p-hacking and improperly specified alternatives will cause false positives - people will think something is interesting when it's really not. The solution to false positives is replication. Natural sciences have been doing this for centuries. When an interesting result comes out, people 1) do the experiment again, 2) do other kinds of experiments to confirm the finding, and 3) try to apply the finding for a bunch of other stuff. If the finding was a false positive, it won't work. The person who got the false positive will take a hit to his or her reputation. Science will advance. My electrical engineer friends tell me that their field is full of false positives, and that they always get caught eventually when someone tries to apply them. That's how things are supposed to work.
2. Significance tests results shouldn't be used in a vacuum - to do good science, you should also look at effect sizes and goodness-of-fit. There is a culture out there - in econ, and probably in other fields, that thinks "if the finding is statistically significant, it's interesting." This is a bad way of thinking. Yes, it's true that most statistically insignificant findings are uninteresting, but the converse is not true. For something to be interesting, it should also have a big effect size. The definition of "big" will vary depending on the scientific question, of course. And in cases where you care about predictive power in addition to treatment effects, an interesting model should also do well on some kind of goodness-of-fit measure, like an information criterion or an adjusted R-squared or whatever - again, with "well" defined differently for different problems. Yes, there are people out there who only look at p-values when deciding whether a finding is interesting, but that just means they're using the tool of p-values wrong, not that p-values are a bad tool.
3. There is no one-size-fits-all tool for data analysis. Take any alternative to classical frequentist significance testing - for example, Bayesian techniques or machine learning approaches - and you'll find situations where it works well right out of the box, and situations where it has to be applied carefully in order to give useful results, and situations in which it doesn't work nearly as well as alternatives. Now, I don't have proof of this assertion, so it's just a conjecture. If any statisticians, data scientists, or econometricians want to challenge me on this - if any of y'all think there is some methodology that will always yield useful results whenever any research assistant presses a button to apply it to any given data set - please let me know. In the meantime, I will continue to believe that it's the culture of push-button statistical analysis that's the problem, not the thing that the button does when pushed.
Fortunately I see a few signs of a backlash-to-the-backlash against significance testing. In 2005 we had a famous paper that used simulations to show that "most published research findings [should be] false". Now, in 2015, we have a meta-analysis showing that the effect of p-hacking, though real, is probably quantitatively small. In addition, I see some signs on Twitter, blogs, etc. that people are starting to get tired of the constant denunciation of significance testing - it's more of a hipster trend than anything. Dissing p-values in 2015 is a little like dissing macroeconomics in 2011 - something that gives you a free pass to sound smart in certain circles (and as someone who did a lot of macro-dissing in 2011, I should know!). But like all hipster fads, I expect this one to fade.