Saturday, August 08, 2015
The backlash to the backlash against p-values
Suddenly, everyone is getting really upset about p-values and statistical significance testing. The backlash has reached such a frenzy that some psych journals are starting to ban significance testing. Though there are some well-known problems with p-values and significance testing, this backlash doesn't pass the smell test. When a technique has been in wide use for decades, it's certain that LOTS of smart scientists have had a chance to think carefully about it. The fact that we're only now getting the backlash means that the cause is something other than the inherent uselessness of the methodology.
The problems with significance testing are pretty obvious. First of all, p-hacking causes publication bias - scientists have an incentive to keep mining the data until they get something with a p-value just under the level that people consider interesting. Also, significance testing doesn't usually specify alternative hypotheses very well - this means that rejection of a null hypothesis is usually over-interpreted as being evidence for the researcher's chosen alternative.
But the backlash against p-values and significance testing has been way overdone. Here are some reasons I think this is true.
1. Both of the aforementioned problems are easily correctable by the normal practices of science - by which I mean that if people are doing science right, these problems won't matter in the long run. Both p-hacking and improperly specified alternatives will cause false positives - people will think something is interesting when it's really not. The solution to false positives is replication. Natural sciences have been doing this for centuries. When an interesting result comes out, people 1) do the experiment again, 2) do other kinds of experiments to confirm the finding, and 3) try to apply the finding for a bunch of other stuff. If the finding was a false positive, it won't work. The person who got the false positive will take a hit to his or her reputation. Science will advance. My electrical engineer friends tell me that their field is full of false positives, and that they always get caught eventually when someone tries to apply them. That's how things are supposed to work.
2. Significance tests results shouldn't be used in a vacuum - to do good science, you should also look at effect sizes and goodness-of-fit. There is a culture out there - in econ, and probably in other fields, that thinks "if the finding is statistically significant, it's interesting." This is a bad way of thinking. Yes, it's true that most statistically insignificant findings are uninteresting, but the converse is not true. For something to be interesting, it should also have a big effect size. The definition of "big" will vary depending on the scientific question, of course. And in cases where you care about predictive power in addition to treatment effects, an interesting model should also do well on some kind of goodness-of-fit measure, like an information criterion or an adjusted R-squared or whatever - again, with "well" defined differently for different problems. Yes, there are people out there who only look at p-values when deciding whether a finding is interesting, but that just means they're using the tool of p-values wrong, not that p-values are a bad tool.
3. There is no one-size-fits-all tool for data analysis. Take any alternative to classical frequentist significance testing - for example, Bayesian techniques or machine learning approaches - and you'll find situations where it works well right out of the box, and situations where it has to be applied carefully in order to give useful results, and situations in which it doesn't work nearly as well as alternatives. Now, I don't have proof of this assertion, so it's just a conjecture. If any statisticians, data scientists, or econometricians want to challenge me on this - if any of y'all think there is some methodology that will always yield useful results whenever any research assistant presses a button to apply it to any given data set - please let me know. In the meantime, I will continue to believe that it's the culture of push-button statistical analysis that's the problem, not the thing that the button does when pushed.
Fortunately I see a few signs of a backlash-to-the-backlash against significance testing. In 2005 we had a famous paper that used simulations to show that "most published research findings [should be] false". Now, in 2015, we have a meta-analysis showing that the effect of p-hacking, though real, is probably quantitatively small. In addition, I see some signs on Twitter, blogs, etc. that people are starting to get tired of the constant denunciation of significance testing - it's more of a hipster trend than anything. Dissing p-values in 2015 is a little like dissing macroeconomics in 2011 - something that gives you a free pass to sound smart in certain circles (and as someone who did a lot of macro-dissing in 2011, I should know!). But like all hipster fads, I expect this one to fade.
Subscribe to: Post Comments (Atom)
Strongly disagree. p-values at least as used in null hypothesis testing are useless.ReplyDelete
Rejecting an hypothesis of zero width is completely stupid. You are rejecting an infinitesimal slice of your useful hypothesis space. Using a p-value to reject an effect of zero does not even theoretically reject an effect of 0.00000000001 or -0.0000000001 and effect are _always_ at least that because of systematic bias in the experiment. With enough data your null hypothesis will always be rejected because it is sensitive to even tiny biases.
At the very least you need a confidence interval with a bounds sufficiently away from your "null hypothesis" to clear a reasonable approximation of systematic bias.
Also I've witness a sickening amount of p-hacking in university labs.
Hear hear! Very well said.Delete
As Tukey pointed out many years ago, the null hypothesis may never be true but the direction of an effect is often uncertain. A significant difference allows a confident conclusion about the direction of the effect.Delete
Did you even read the paper you have linked? It does not say what you think it says.ReplyDelete
"it's certain that LOTS of smart scientists have had a chance to think carefully about it."ReplyDelete
That comment reminded my of the time 50 years ago when E. T. Jaynes showed a simple real quality control problem to a group of engineers where the resulting Confidence Interval was impossible (according to a simple deduction from the same assumptions used to derive the CI). Here was the reaction:
"I first presented this result to a recent convention of reliability and quality control statisticians working int eh computer and aerospace industries; and at this point the meeting was thrown into an uproar, about a dozen people trying to shout me down at once. They told me, “This is complete nonsense. A method as firmly established and thoroughly worked over as confidence intervals couldn't possible do such a thing. You are maligning a very great man; Neyman would never have advocated a method that breaks down on such a simple problem. If you can't do your arithmetic right, you have no business running around giving talks like this:
After partial calm was restored, I went a second time, very slowly and carefully, through the numerical work leading to (18), with all of them leering at me, eager to see who would be the first to catch my mistake [it is easy to show the correctness of (18), at least to two figures, merely by applying a parallel rulers to a graph of F(u)]. In the end they had to concede that my result was correct after all.
To make a long story short, my talk was extended to four hours (all afternoon), and their reaction finally changed to: ”My God – why didn't somebody tell me about these things before? My professors and textbooks never said anything about this. Now I have to go back home and recheck everything I've done for year'.
This incident makes an interesting commentary on the kinds of indoctrination that teachers of orthodox [frequentist] statistics have been giving their students for two generations now.”
see page 198 here: http://bayes.wustl.edu/etj/articles/confidence.pdf
Frequentist statistics was a known theoretical disaster 50 years ago. It's proven to be an even bigger practical disaster in the meantime. There are lots of examples of very smart people clinging to failed theories for a long time. Your rationalizations for keeping these failed tools around is a good case study in how that can happen.
"There is no one-size-fits-all tool for data analysis."ReplyDelete
This is not true. What's true is there is no one set of model assumptions that fits every case, but that's a very different statement from claiming there is some problem that can't be tackled in a bayesian framework. Bayesians need never use frequentist methods (the use of frequencies is NOT the same as using frequentist methods), and many never do. We could in fact only teach and use Bayesian methods, and the only result would be a considerable conceptual/pedagogical/mathematical simplification to statistics.
"Both p-hacking and improperly specified alternatives will cause false positives"ReplyDelete
This is not a fact. It's a supposition born from the same intuitions which lead to the failed statistical methods in the first place. It might be true that p-hacking is correlated with false positives, but that isn't even known for sure.
Suppose two researchers are working together. One only computes one p-value, while the other computes 500. They both find the same p-value is significant and publish it together. So in this case will "p-hacking" cause a false positive?
In reality p-hacking, and a host of related ideas such as Mayo's "nominal p-values" or Gelman's "Garden of Forked Path" silliness are so expansive and so dependent the subjective psychological states of researchers, that they're all purpose, always applicable, and irrefutable explanations for any failure of frequentist methods.
In other words, frequentism is unfalsifiable. No amount of failures could ever convince an anti-baysian fanatic that frequentism itself was responsible for the failures. There's always an excuse at hand.
P.s. the real reason there's so many unreplicable results is that frequentist model checking doesn't really check the model.Delete
For example, given some data e1,...,en a frequentist might check whether "it's normally distributed". If it passes one of their normality tests (p-value less than .05 or whatever) they think that means future data will also be normally distribution. All their blather about "frequentist guarantees" and "objective verifications" is based on that.
Most of time the data rather stubbornly refuses to be like it was before and their model checking "test" couldn't have been more irrelevant to the question of whether it would change or not.
" to do good science, you should also look at effect sizes and goodness-of-fit."ReplyDelete
Here's the story behind all this. Long ago some people couldn't understand what Laplace and others did. Just like anti-Bayesians today, they concluded that since they couldn't understand it, it must be nonsense.
So they went out and created alternative methods based of their ad-hoc intuitions. Since each of these methods is inconsistent with the sum/product rules of probability theory applied in a Bayesian manner, they all work in some cases, but fail miserably when pushed too far.
Over time, some of these failures were observed, and when they're pointed out to frequentists they simply add new ad-hoc methods on top of the old ones to correct these mistakes. Each time they correct a problem, their solutions and methods get closer to the bayesian answer.
Any anti-bayesian can, if they so choose, keep this process up forever. They can inch ever closer to the Bayesian answer, but only give up just enough ground to cover up the past mistakes, but not so much they have to admit Bayesians were right.
"Now, I don't have proof of this assertion, so it's just a conjecture. If any statisticians, data scientists, or econometricians want to challenge me on this - if any of y'all think there is some methodology that will always yield useful results whenever any research assistant presses a button to apply it to any given data set - please let me know."ReplyDelete
-Shouldn't the question be as to whether there is some methodology that, in all cases when it will not yield useful results, no alternative methodology will yield useful results either?
This post makes some good points but I also think it's arguing against a bit of a straw man. Noah writes "Yes, there are people out there who only look at p-values when deciding whether a finding is interesting, but that just means they're using the tool of p-values wrong, not that p-values are a bad tool." But I don't think many people believe that p-values are an entirely useful tool if always done correctly. Instead, the steel manned argument is that the current focus on p-values invites misinterpretation and misuse. Statistics are always going to be used by researchers who are not methodologists and have imperfect understanding of the tools they are using so I think it's perfectly reasonable to criticize a method for being frequently abused in practice.ReplyDelete
I've always believed that one of the biggest problems with p-values in economics is that they feed into a focus on academically interesting work over policy relevant work by forceing researchers to specify a null hypothesis. Often a null hypothesis is interesting from the perspective of testing some particular model but, in the end, we already know that all models are wrong so this kind of model testing is of limited use. The actually important question usually isn't whether we know an effect is distinguishable from a certain value but rather what size the effect is and how precisely it is measured. Given that, for much work, there is not a natural null hypothesis, focusing on p-values encourages researchers to choose an arbitrary null hypothesis that is likely to be irrelevant (e.g. effect size of zero).
In general, I struggle to think of reasons it would not be better to have a norm of reporting confidence intervals alone instead of reporting p-values. Not saying frequentist confidence intervals are optional, just that they might dominate p-values as a norm.
How is it a straw man when there are journals refusing to accept any papers that do significance testing???Delete
One can develop some statistical testing procedures, show theoretically how these tests behave asymptotically, assess their finite sample properties via Monte Carlo, and study interesting research questions using real data. StataTheLeft, you can look at the studies of Ait-Sahalia, Jean Jacod and Protter to see how researchers can use statistical testing approach to understand the functioning of world (e.g. asset pricing, portfolio management, financial contagion).Delete
This comment has been removed by the author.Delete
"In general, I struggle to think of reasons it would not be better to have a norm of reporting confidence intervals alone instead of reporting p-values."Delete
Slowly working my way through the statistics Ph.D. coursework here, and that is in fact how the books are written and the material presented.
Surprised at the push back in the comments. I'm a stats PhD, CS BA, left academia and am now a data scientist,so I've seen this from all angles. It seems self evident to me that Noah is basically right. Any tool that gets developed into a plug and play version to allow non specialists to utilize it will get misused. Classical hypothesis testing just has a long history and a deceptively simplistic interpretation, so its misuse is widespread and problematic.ReplyDelete
I'm somewhat agnostic but was trained in Bayesian/likelihood based methods, so I lean that way. But even so, it seems to me that basically all the criticisms except the Gelman style thousand forks one are basically solved if you understand the issues of sample size, multiple testing/fishing, practical significance, and humility concerning the limits of statistics and single experiments/studies.
What's the thousand forks criticism?Delete
See Gelmans paper;Delete
Gelman, Andrew, and Eric Loken. "The garden of forking paths: Why multiple comparisons can be a problem, even when there is no ‘fishing expedition’or ‘p-hacking’and the research hypothesis was posited ahead of time."
"Dissing p-values in 2015 is like dissing macroeconomics in 2011..."ReplyDelete
I'd agree that the p-values thing is a dead horse , and there's no point in further flogging.
Macro is a different story , however. First of all , it's not a horse , it's a jackass , and second , it's far from dead. In fact it's reveling in spraying its foul-smelling piss on any sensible discussion of political economy. We need to move beyond mere flogging , and put that sucker down for good.
You should look at the work by Deborah Mayo on Severe testing and error statistics. She seems to me to answer all the usual criticisms and more than that provide a solid base for acquiring reliable knowledge. She has done a lot of work with econometrician Aris Spanos and they are both at Virginia Polytechnic. Mayo has a blog at errorstatistics.com and here is one post that makes a good start:ReplyDelete
Mayo is a philosopher not a statistician. She appears to have never done any statistics or mastered any theory beyond the first couple of weeks of stat 101.Delete
What she did was invent a new thing called "severity". She can't do any of the normal theoretical, simulation, real data work needed to analyze it, but she was able to test it out on the very simplest of stat 101 (basically mean of normal distribution stuff), where the "severity" measure is exactly identical to the Bayesian posterior.
Based on that example, which is really the Bayesian posterior, she goes through and shows how all the criticisms of frequentism are answered with the new severity measure (bayesian posterior)! If you look at the severity measure in any example where it's not equal to the bayesian posterior it's a complete joke and obviously doesn't work.
This has been hailed by many as the best defense of frequentist statistics out there. This appears to be true.
The concept of severity is not the same as a bayesian posterior. It's about the properties of the test not the probability of the hypothesis being tested. Being a philosopher might be thought to be an advantage for these sorts of underlying conceptual issues. But if it bothers you try reading Aris Spanos who is a distinguished econometrician.Delete
I think statisticians should do their thing, but we need to put it into context. Statistics are not a primary methodology for good economists, in my view.ReplyDelete
They're just a tool to provide evidence. I'm a big fan of the model as the primary tool. This is controversial today because there's a lot of money in statistical economics. But good economics shouldn't be about the money it produces.
Read more Andrew Gelman and you are likely to come around.ReplyDelete
Actually a lot of my thinking on this comes from reading Andrew Gelman!Delete
"most statistically insignificant findings are uninteresting, but the converse is not true. "ReplyDelete
which is why replication doesn't happen, no? The studies that get insignificant results don't get published so no one even submits them to journal. And the problem remains. Anyway, why exactly are they "uninteresting"?
(also, this actually depends on the field. For example, for awhile there in intl econ, finding insignificant results in PPP equations was all the rage. Best as I can tell, what path a particular lit takes depends on whether some famous person published a "look, data reject this theory" paper early on or not)
This is another area where economics can learn from the latest methods in analytics. Another tool to fight the issues with p-values is target shuffling. Essentially, the idea is that you randomize the target order and try to fit models against that shuffled target. What this tests is whether "anything" can be modeled with the driver set.ReplyDelete
Here is a good explanation of target shuffling by John Elder: http://semanticommunity.info/@api/deki/files/30744/Elder_-_Target_Shuffling_Sept.2014.pdf
That being said, for economics the problems are as much with the empirics as with the statistics. Statistics can't save you if what you are measuring as huge issues.
Yes, a tool is a tool and as such it can be useful if used properly. However, the question is not the tool but the practice. Saying that replication can help is meaningless when that actually doesn't happen since negative results don't get published. The big problem is the publishing filter, which is set to search for noise. More than that, the problem is the incentive structure in the academia, which pays premium for clever games with noise with little data discipline.ReplyDelete
Here is the test to spot anti-bayesian fanaticism. I wonder if Noah passes:ReplyDelete
Haha, that's great.Delete
So... a test not constructed to be uniformly-most-powerful attains a lower than possible degree of power (equal to 0.9), and so gives a false negative 10% of the time under H1. My, what a shocking surprise.Delete
Noah, in #2 you probably meant the inverse (negation of premise + conclusion) isn't true. Converse (most uninteresting results are stat insig) is probably true given the base rate.
Laplace, my strong prior is you are Anonymous at the Gelman blog.
As a stats consumer only, I see P value as an attractive nuisance, like a well into which kids will fall -- it *cannot* be popularized, paraphrased, or used for policy inference without error. Roughly half of "definitions" of P in "stats primers" are wrong, all glosses in newspapers are wrong.
Do you pass the test for anti-bayesian fanatic?Delete
I'm on your side here, Laplace/Anonymous.Delete
Kyle C: inverse <--> converse, so I'm all good.Delete
Laplace: There are much better things in life to go be a nutcase about.
Converse has the same (contrapositive) truth value as the inverse only in simple Boolean syntax. When you introduce a weight like "most" to a statement, the agreement disappears. Often important in law when weighing evidence.Delete
S: most statistically insignificant findings are uninteresting (true)
C: most uninteresting findings are statistically insignificant (true by base rate because most findings are statistically insignificant)
I: most interesting findings are statistically significant ....
Sure. But if we're talking about implementing a cutoff decision rule, we're back to Boolean syntax.Delete
Laplace, what do you think of this "marriage"? Unnatural? Disgusting?ReplyDelete
Well, I don't think the use of p-values should or will be widely banned, but lazy over-reliance on p-values does definitely belong in the junk heap of junk science.ReplyDelete
"The fact that we're only now getting the backlash [against p-values] means that the cause is something other than the inherent uselessness of the methodology."ReplyDelete
Well, the cause of this backlash is that there are now algorithms (such as Metropolis-Hastings) combined with MCMC and faster computers that enable us to do Bayesian statistics fairly easily as a more appropriate alternative to p-values. Because of this, p-values (and the null hypothesis testing, which goes with it) have become basically useless. Why would you want to use p-values when you can do Bayesian statistics, which gives you better results and is more flexible?
With Bayesian statistics spreading to all fields of science (I work with a bunch of ecologists who prefer it to p-values), I don't see "dissing p-values" as a "hipster fads", but as a realization to a better method of doing science and spreading the word around.
Because Bayesian statistics requires specification of a prior, and prior-hacking is even worse than p-hacking. It's not, in fact, a better way of doing statistics at all. It is simply different.Delete
That might seem reasonable, especially if you view 'Bayesian statistics' as just another collection of tools in the 'data science' toolbox, but it doesn't stand up to scrutiny:Delete
"[...] Ah, the bogey-man of priors is trotted out to scare the children, as if priors can be capriciously set to anything the analyst wants (insert sounds of mad, wicked laughter here) and thereby predetermine the conclusion. [...]" [penultimate paragraph].
I agree that 'Bayesian statistics' isn't a (better) way of 'doing statistics'. It's a way - the way - to do probabilistic inference. It's 'just' applied probability theory.
Most of the economics papers I've seen always discuss size of effects and their implications in terms of magnitudes that we care about. In most cases you can get around this whole debate by reporting confidence intervals and relying on the Bernstein-Von Mises theorem that says asymptotically maximum likelihood confidence intervals converge to Bayesian posterior probability intervals. In that case you can interpret classical confidence intervals as giving you approximate probability bounds. Of course, this doesn't work well in small samples, for which priors also matter a lot in bayesian estimation. Alternatively you can work with a flat prior, though that's sometimes tricky to define when we care about functions of a parameter instead of just the parameter.ReplyDelete
"Why Most Published Research Findings are False"ReplyDelete
by John Ionnidis
August 30, 2005
This Ionnidis paper gives a formula for the probability
a published statistical relationship is wrong,
calling it the Positive Predictive Value (PPV).
"After a research finding has been claimed based on achieving formal statistical significance,
the post-study probability that it is true is the positive predictive value, PPV."
PPV = (1 - beta) R / (R - beta * R + alpha)
= 1 / [1 + alpha / (1 - beta) R) ]
alpha = .05 usually -- the probability of a Type I error
beta is the probability of a Type II error -- (1 - beta is the power )
R is the ratio of true relationships to no [false] relationships in that field of tests
R / (1-R)
is the pre-study probability the relationship is true.
I will call this the "Background Probability" of a true relationship.
While the researcher/statistician can set alpha = 0.05,
and can get beta = 0.80,
their probability meaning is clouded by their frequentist interpretation.
What the statistician can't set, and what is never mentioned
-- the Background Probability -- is most important in most research!
"PPV depends a lot on the pre-study odds (R).
This becomes obvious when research seeks from 30,000 genes
the (at most) 30 genes that influence a genetic disease, for which
R = 30/30000 = 0.001
When the Background Probability is moderate,
a design with moderate power (1 - beta) can get good PPV.
But research often works in a field of previously unseen results,
where R does equal 0.01 or even 0.001.
In these many fields, the Background Probability swamps any statistical design's alpha and beta.
"Most research findings are false for most research designs and for most fields...
a PPV exceeding 50% is quite difficult to get."
Indeed, a look at the PPV formula shows that whatever alpha,
even a power of 1 (a little thought reveals why more power hardly helps here)
produces mostly false results if the Background Probability itself is less than alpha!
"Too large and too highly significant effects may actually be more likely
to be signs of large bias in most fields of modern research."
It is important to refine these ideas by bounding with "field" of study,
not to all research,
or even to biological research,
but maybe to research on cancer
-- involving a careful choice of bounds.
Here, the choice of "field" affects the Background Probability, equivalently R.
Points #1 and #2 were first prominently brought up by Deirdre McCloskey and Steve Ziliak in a number of papers and a book. McCloskey has been crusading for years about the misuse of p-value. Their book is a great read, highly recommended for those that want to learn more about this issue.ReplyDelete