Comments on Noahpinion: The backlash to the backlash against p-values

Points #1 and #2 were first prominently brought up...

2015-08-16T09:19:57.174-04:00

Points #1 and #2 were first prominently brought up by Deirdre McCloskey and Steve Ziliak in a number of papers and a book. McCloskey has been crusading for years about the misuse of p-value. Their book is a great read, highly recommended for those that want to learn more about this issue.

That might seem reasonable, especially if you view...

2015-08-15T07:34:25.382-04:00

That might seem reasonable, especially if you view 'Bayesian statistics' as just another collection of tools in the 'data science' toolbox, but it doesn't stand up to scrutiny:

"[...] Ah, the bogey-man of priors is trotted out to scare the children, as if priors can be capriciously set to anything the analyst wants (insert sounds of mad, wicked laughter here) and thereby predetermine the conclusion. [...]" [penultimate paragraph].

I agree that 'Bayesian statistics' isn't a (better) way of 'doing statistics'. It's a way - the way - to do probabilistic inference. It's 'just' applied probability theory.

"Why Most Published Research Findings are Fal...

2015-08-13T16:29:36.058-04:00

"Why Most Published Research Findings are False"
by John Ionnidis
at PlosMedicine.org
August 30, 2005
http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124
This Ionnidis paper gives a formula for the probability
a published statistical relationship is wrong,
calling it the Positive Predictive Value (PPV).
"After a research finding has been claimed based on achieving formal statistical significance,
the post-study probability that it is true is the positive predictive value, PPV."
PPV = (1 - beta) R / (R - beta * R + alpha)
= 1 / [1 + alpha / (1 - beta) R) ]
where
alpha = .05 usually -- the probability of a Type I error
beta is the probability of a Type II error -- (1 - beta is the power )
R is the ratio of true relationships to no [false] relationships in that field of tests
Writing
R / (1-R)
is the pre-study probability the relationship is true.
I will call this the "Background Probability" of a true relationship.
While the researcher/statistician can set alpha = 0.05,
and can get beta = 0.80,
their probability meaning is clouded by their frequentist interpretation.
What the statistician can't set, and what is never mentioned
-- the Background Probability -- is most important in most research!

"PPV depends a lot on the pre-study odds (R).
This becomes obvious when research seeks from 30,000 genes
the (at most) 30 genes that influence a genetic disease, for which
R = 30/30000 = 0.001
When the Background Probability is moderate,
a design with moderate power (1 - beta) can get good PPV.
But research often works in a field of previously unseen results,
where R does equal 0.01 or even 0.001.
In these many fields, the Background Probability swamps any statistical design's alpha and beta.
"Most research findings are false for most research designs and for most fields...
a PPV exceeding 50% is quite difficult to get."
Indeed, a look at the PPV formula shows that whatever alpha,
even a power of 1 (a little thought reveals why more power hardly helps here)
produces mostly false results if the Background Probability itself is less than alpha!

"Too large and too highly significant effects may actually be more likely
to be signs of large bias in most fields of modern research."

It is important to refine these ideas by bounding with "field" of study,
not to all research,
or even to biological research,
but maybe to research on cancer
-- involving a careful choice of bounds.
Here, the choice of "field" affects the Background Probability, equivalently R.

Most of the economics papers I've seen always ...

2015-08-12T01:58:59.869-04:00

Most of the economics papers I've seen always discuss size of effects and their implications in terms of magnitudes that we care about. In most cases you can get around this whole debate by reporting confidence intervals and relying on the Bernstein-Von Mises theorem that says asymptotically maximum likelihood confidence intervals converge to Bayesian posterior probability intervals. In that case you can interpret classical confidence intervals as giving you approximate probability bounds. Of course, this doesn't work well in small samples, for which priors also matter a lot in bayesian estimation. Alternatively you can work with a flat prior, though that's sometimes tricky to define when we care about functions of a parameter instead of just the parameter.

Sure. But if we're talking about implementing ...

2015-08-11T23:27:18.652-04:00

Sure. But if we're talking about implementing a cutoff decision rule, we're back to Boolean syntax.

So... a test not constructed to be uniformly-most-...

2015-08-11T18:17:04.540-04:00

So... a test not constructed to be uniformly-most-powerful attains a lower than possible degree of power (equal to 0.9), and so gives a false negative 10% of the time under H1. My, what a shocking surprise.

Because Bayesian statistics requires specification...

2015-08-11T17:26:12.753-04:00

Because Bayesian statistics requires specification of a prior, and prior-hacking is even worse than p-hacking. It's not, in fact, a better way of doing statistics at all. It is simply different.

"The fact that we're only now getting the...

2015-08-11T16:44:41.429-04:00

"The fact that we're only now getting the backlash [against p-values] means that the cause is something other than the inherent uselessness of the methodology."

Well, the cause of this backlash is that there are now algorithms (such as Metropolis-Hastings) combined with MCMC and faster computers that enable us to do Bayesian statistics fairly easily as a more appropriate alternative to p-values. Because of this, p-values (and the null hypothesis testing, which goes with it) have become basically useless. Why would you want to use p-values when you can do Bayesian statistics, which gives you better results and is more flexible?

With Bayesian statistics spreading to all fields of science (I work with a bunch of ecologists who prefer it to p-values), I don't see "dissing p-values" as a "hipster fads", but as a realization to a better method of doing science and spreading the word around.

Well, I don't think the use of p-values should...

2015-08-11T15:40:41.134-04:00

Well, I don't think the use of p-values should or will be widely banned, but lazy over-reliance on p-values does definitely belong in the junk heap of junk science.

As Tukey pointed out many years ago, the null hypo...

2015-08-11T11:53:24.809-04:00

As Tukey pointed out many years ago, the null hypothesis may never be true but the direction of an effect is often uncertain. A significant difference allows a confident conclusion about the direction of the effect.

Converse has the same (contrapositive) truth value...

2015-08-11T11:40:03.394-04:00

Converse has the same (contrapositive) truth value as the inverse only in simple Boolean syntax. When you introduce a weight like "most" to a statement, the agreement disappears. Often important in law when weighing evidence.

S: most statistically insignificant findings are uninteresting (true)
C: most uninteresting findings are statistically insignificant (true by base rate because most findings are statistically insignificant)
I: most interesting findings are statistically significant ....

Laplace, what do you think of this "marriage&...

2015-08-11T03:50:06.071-04:00

Laplace, what do you think of this "marriage"? Unnatural? Disgusting?

See Gelmans paper; Gelman, Andrew, and Eric Loken...

2015-08-11T00:17:42.118-04:00

See Gelmans paper;

Gelman, Andrew, and Eric Loken. "The garden of forking paths: Why multiple comparisons can be a problem, even when there is no ‘fishing expedition’or ‘p-hacking’and the research hypothesis was posited ahead of time."

Kyle C: inverse <--> converse, so I'm al...

2015-08-10T22:11:25.824-04:00

Kyle C: inverse <--> converse, so I'm all good.

Laplace: There are much better things in life to go be a nutcase about.

I'm on your side here, Laplace/Anonymous.

2015-08-10T21:52:46.412-04:00

I'm on your side here, Laplace/Anonymous.

Haha, that's great.

2015-08-10T19:37:15.831-04:00

Haha, that's great.

Do you pass the test for anti-bayesian fanatic? h...

2015-08-10T19:29:35.972-04:00

Do you pass the test for anti-bayesian fanatic?

http://www.bayesianphilosophy.com/test-for-anti-bayesian-fanaticism/

FWIW -- Noah, in #2 you probably meant the invers...

2015-08-10T16:53:35.523-04:00

FWIW --

Noah, in #2 you probably meant the inverse (negation of premise + conclusion) isn't true. Converse (most uninteresting results are stat insig) is probably true given the base rate.

Laplace, my strong prior is you are Anonymous at the Gelman blog.

As a stats consumer only, I see P value as an attractive nuisance, like a well into which kids will fall -- it *cannot* be popularized, paraphrased, or used for policy inference without error. Roughly half of "definitions" of P in "stats primers" are wrong, all glosses in newspapers are wrong.

Here is the test to spot anti-bayesian fanaticism....

2015-08-10T14:46:10.578-04:00

Here is the test to spot anti-bayesian fanaticism. I wonder if Noah passes:

http://www.bayesianphilosophy.com/test-for-anti-bayesian-fanaticism/

"In general, I struggle to think of reasons i...

2015-08-10T12:28:20.908-04:00

"In general, I struggle to think of reasons it would not be better to have a norm of reporting confidence intervals alone instead of reporting p-values."

Slowly working my way through the statistics Ph.D. coursework here, and that is in fact how the books are written and the material presented.

Yes, a tool is a tool and as such it can be useful...

2015-08-10T10:00:17.491-04:00

Yes, a tool is a tool and as such it can be useful if used properly. However, the question is not the tool but the practice. Saying that replication can help is meaningless when that actually doesn't happen since negative results don't get published. The big problem is the publishing filter, which is set to search for noise. More than that, the problem is the incentive structure in the academia, which pays premium for clever games with noise with little data discipline.

This is another area where economics can learn fro...

2015-08-10T09:09:28.998-04:00

This is another area where economics can learn from the latest methods in analytics. Another tool to fight the issues with p-values is target shuffling. Essentially, the idea is that you randomize the target order and try to fit models against that shuffled target. What this tests is whether "anything" can be modeled with the driver set.

Here is a good explanation of target shuffling by John Elder: http://semanticommunity.info/@api/deki/files/30744/Elder_-_Target_Shuffling_Sept.2014.pdf

That being said, for economics the problems are as much with the empirics as with the statistics. Statistics can't save you if what you are measuring as huge issues.

The concept of severity is not the same as a bayes...

2015-08-10T04:40:29.216-04:00

The concept of severity is not the same as a bayesian posterior. It's about the properties of the test not the probability of the hypothesis being tested. Being a philosopher might be thought to be an advantage for these sorts of underlying conceptual issues. But if it bothers you try reading Aris Spanos who is a distinguished econometrician.

Mayo is a philosopher not a statistician. She app...

2015-08-10T00:10:10.538-04:00

Mayo is a philosopher not a statistician. She appears to have never done any statistics or mastered any theory beyond the first couple of weeks of stat 101.

What she did was invent a new thing called "severity". She can't do any of the normal theoretical, simulation, real data work needed to analyze it, but she was able to test it out on the very simplest of stat 101 (basically mean of normal distribution stuff), where the "severity" measure is exactly identical to the Bayesian posterior.

Based on that example, which is really the Bayesian posterior, she goes through and shows how all the criticisms of frequentism are answered with the new severity measure (bayesian posterior)! If you look at the severity measure in any example where it's not equal to the bayesian posterior it's a complete joke and obviously doesn't work.

This has been hailed by many as the best defense of frequentist statistics out there. This appears to be true.

2015-08-10T00:04:59.539-04:00

This comment has been removed by the author.