## Thursday, July 24, 2014

### Why I like Frequentism

My post about "Bayesian Superman" wasn't actually intended to be a knock on Bayesianism - it was just about a quirk of rationality. I certainly don't think Bayesianism is a "dangerous religion that harms science"!

But reading this Andrew Gelman essay made me think about Bayesian inference in science, which then got me to thinking about Frequentist inference, and why I think Frequentism is a bit underrated these days.

Frequentist hypothesis testing has come under sustained and vigorous attack in recent years. It's arbitrary, it doesn't obey the likelihood principle, it throws away information, it can lead to silly results. All this is true. But there are a couple of good things about Frequentist hypothesis testing that I haven't seen many people discuss. Both of these have to do not with the formal method itself, but with social conventions associated with the practice. These are:

1. The unwritten rule that you "don't protect the null hypothesis" (or that you "penalize type I errors relative to type II errors"), and

2. The implicit three-valued logic of "hypothesis rejection".

The first of these is about what kind of prior (to use Bayesian language) the scientists should start with. It basically says that you should bias your conclusions against your own hypothesis. This is in contrast to, say, using flat or "uninformative" priors.

The second social convention is more amorphous and difficult to define. It's about what conclusions you draw from the hypothesis test. If you don't protect the null, and you reject the null in favor of your own hypothesis, Frequentism says you've found something interesting, and you should follow up on it. If you fail to reject the null, though, it doesn't mean you believe in the null any more than you did before - it means you shrug and move on. Frequentism implicitly assigns results to one of three categories: 1. "interesting evidence against a hypothesis", "interesting evidence for a hypothesis", and "nothing interesting". It's sort of like three-valued logic, in a way. Compare this to the implicit logic of the likelihood principle, in which you compare alternative hypotheses directly.

Why do I like these social conventions? Two reasons. First, I think they cut down a lot on scientific noise. "Statistical significance" is sort of a first-pass filter that tells you which results are interesting and which ones aren't. Without that automated filter, the entire job of distinguishing interesting results from uninteresting ones falls to the reviewers of a paper, who have to read through the paper much more carefully than if they can just scan for those little asterisks of "significance". Naturally, this filter also has a downside - it creates publication bias against "negative results" that may in fact be interesting. But that may be a small price to pay to avoid the flood of paper submissions that would result if everyone just wrote up and sent in the results of any estimation exercise.

Second, the discipline of the Frequentist social conventions acts against scientists' natural tendency to favor and promote and believe in their own theories. It tries to enforce the idea of scientific honesty. Feynman talks about this in his famous speech, "Cargo Cult Science":
It's a kind of scientific integrity, a principle of scientific thought that corresponds to a kind of utter honesty--a kind of leaning over backwards. For example, if you're doing an experiment, you should report everything that you think might make it invalid--not only what you think is right about it: other causes that could possibly explain your results; and things you thought of that you've eliminated by some other experiment, and how they worked--to make sure the other fellow can tell they have been eliminated.
Details that could throw doubt on your interpretation must be given, if you know them. You must do the best you can--if you know anything at all wrong, or possibly wrong--to explain it.
The kind of integrity Feynman is talking about concerns systematic error. The Frequentist social conventions are an attempt to do something similar for random error. This provides a natural defense against "scientific trolling", which is a term I just invented to mean "the tendency of unscrupulous researchers to report weak results to advance some ulterior agenda." Scientific trolling means that ulterior-motivated researchers will submit a flood of weak results, while scrupulous, scientifically-motivated researchers will voluntarily restrain themselves from reporting equally weak results that go in the opposite direction. That sort of reporting bias will tend to contaminate the beliefs of a neutral observer. (I can think of at least one very good real-world example of this, but I will be polite and not discuss it on a blog.)

Now, of course, the Frequentist social conventions are weak, inadequate defenses against subjectivism and noise. They have drawbacks, like discouraging the reporting of negative results. And they are subject to being gamed by unscrupulous researchers. But at least they are something.

Bayesian inference seems to me like a perfectly fine and good method of inference. It's more appealing in many ways than classical Frequentism. But I think that Bayesianism might want to get some standardized social conventions similar to (and hopefully superior to) the Frequentist ideas of "not protecting the null" and "statistical significance" (note: it may already have these, and I'm just not aware of them). These conventions would unavoidably be arbitrary, and would throw away some information in many cases. But they would help lean against the natural incentives of the scientific reporting and publishing process. Maybe there could be more than one set of conventions, for use in recognizably different situations.

Classical Frequentist hypothesis testing is probably on its way out in the long term. But the fact that it has survived and dominated scientific publishing as long as it has, in spite of all its well-known problems, might be a testament to the usefulness of the unspoken social conventions associated with it.

1. This is definitely a good post to get people thinking. I just hope your average readership isn't too entrenched in any one particular camp.

Gelman has written a couple very inspiring pieces (with total humility) about his interpretation of Bayesianism and how it fits into his personal philosophy. In Gelman's pieces, his appreciation for philosophy of science is actually what turned me on to Deborah Mayo's (and Aris Spanos') 'Error Statistics' philosophy which is frequentist-based.

I have a feeling I can predict half of the next ten comments that will appear here. And it's not going to be pretty.

1. Do you have links to those posts?

2. Anonymous10:49 AM

Take a look here:
http://larspsyll.wordpress.com/2013/08/03/andrew-gelman-vs-deborah-mayo-on-the-significance-of-significance-tests-wonkish/

3. Thanks!

2. Oh, I very much agree with most of what you say, Noah. Where I disagree is on your prediction about the future of frequentism. If it is on the way out, it is on a very, very slow boat. Particularly in the biomedical sciences, which is my own field where the depth of the overall statistical illiteracy would shock a lot of people. They tend to feel themselves so overwhelmed by their biological research problems, they don't have the patience to "learn" proper frequentist statistics, much less bayesian statistics. Yet frequentist tests are embedded in some popular, easy to use graphing software packages we use. And though everybody overuses t tests and mangles the P value, it is easily understood as a simple threshold, much like an expiration date on a food package.

3. Ronald Fisher (pictured) was not a bayesian, but he wasn't a frequentist either. He followed what he called the likelihood principle (which seems to me like bayes without acknowledging the prior). I just wanted to let you know so that Robert Murphy doesn't bite your head off for using the wrong picture again.

1. Wasn't he the guy who coin the term "significance test"?

2. If you're going to write a post praising frequentism, Ronald Fisher is as good an illustration as any. Works for me.

3. Fisher did lots of statistics that is associated with frequentist methods,but his writings were somewhat ambiguous as to his actual philosophical position was wrt statistical method. I would have chosen Neyman and/or Pearson. In any case it's not a terrible choice. I even liked your picture choice for the Star Trek brain worms blog. It wasn't literally correct, but it was impressionistically correct.

4. JohnR2:11 PM

"It wasn't literally correct, but it was impressionistically correct."

That may be one of the best descriptions of a lot of economic work that I've ever seen...

4. A couple of points about Bayesianism, if I may. One scholar in the field emailed me this morning reminding me that a datum does not strengthen a hypothesis of the datum in question is equally consistent with a contrary hypothesis. In the Superman case, a contrary hypothesis might be, "God, who meant to wipe me out for my impudence years ago, is a forgetful Being." Every day of survival supports H2 as fully as it supports H1 so, on Bayesian grounds it supports neither.

Another point concerns that Andrew Gelman essay to which you've linked us. For those who haven't followed the link: it is an article that appeared in a scholarly journal called BAYESIAN ANALYSIS in 2008. Gelman considers himself a Bayesian, as the article makes clear, so he doesn't consider the objections he outlines there decisive. The article describes itself as "A Bayesian's attempt to see the other side." I think those who read the quotes you've provided should also ask themselves why those same considerations didn't cause Gelman himself to identify with the "other side," why they left him Bayesian.

Having said that, I'll also pass along a link to something I wrote, posted on an alternative-investment blog yesterday. http://allaboutalpha.com/blog/2014/07/23/a-challenge-to-bayesian-probability/

1. I did make the point about the "equally likely contrary hypothesis" in my Superman post!

2. "God, who meant to wipe me out for my impudence years ago, is a forgetful Being." Every day of survival supports H2 as fully as it supports H1 so, on Bayesian grounds it supports neither.

I'm over my head in this discussion, BUT, how is it possible to make a scientific hypothesis about an idea, 'God,' for which there is little or no evidence? Aren't we garbled here before we even start?

3. Phil Koop12:50 PM

Personally, I find the later essay by Gelman and Shalizi more illuminating (http://www.stat.columbia.edu/~gelman/research/published/philosophy.pdf.)

This essay casts considerable doubt on the claim that Gelman does not in fact identify with "the other side". I think it would be more accurate to say that Gelman and Shalizi advocate Bayesian models to be employed by a "hypothetico-deductive" (aka "frequentist") modeler. In their own words, "the value of Bayesian inference as an approach for obtaining statistical methods
with good frequency properties."

You might come to the same conclusion by reading the exchange between Mayo and Gelman, but reading Mayo always gives me a headache.

5. Student10:34 AM

I disagree on a couple of levels.
1.) I dont see that the convention of arbitrarily specifying alpha upfront (1.) provides any benefit to science.
2.) It is not the case that there are no standards for hypothesis testing in bayesian inference.
I know you were specifically avoiding the issue of how the two schools interpret probability but that difference is inseparable from how the social conventions have come to be in the first place. IMHO, there is really no reason anymore to rely on the frequentist approach at all and it provides no benefit to science.

Look, what you are ultimately trying to do with hypothesis testing is to estimate which hypothesis is more probable, given the data right?
Technically, frequentist hypothesis tests do test hypotheses, given the data. Rather, they test the data, given the hypothesis. Bayesians, on the other hand, test the hypothesis, given the data. That is simply more appropriate and provides many advantages, such as allow Bayesians to compare model fit even when the various models rely on different methods (how is that not beneficial to science given the results are most often based on different approaches).
The differences between the two schools lead to results that can be virtually identical to quite different. Since Bayesian inference is coherent even in a frequentist sense while frequentist inference in incoherent in a Bayesian sense, a Bayesian approach is always preferable. Again, there is no good reason to be a frequentist anymore. Due to advances in MCMC methods and computers, it is possible now to estimate solutions to even highly complex problems that do not fit the frequentist interpration of probability but that fit in the Bayesian one.

1. What do you see as being the most widely used standards for hypothesis testing in Bayesian inference, just out of curiosity?

2. Student3:39 PM

oops, I responded below rather than here.

3. RaStudent12:03 AM

Come on, I expected a response or two. I read this blog for the comments (love the anonymity and smart readers). Am I wrong?

6. Anonymous12:48 PM

Why I don't like frequentism: in almost all scientific applications, frequentist methods are harder to do correctly than Bayesian analyses. In most cases, this leads to systematically incorrect results. An extremely simple example is fitting a straight line to data where the spread (noise) in the x-values is comparable to the width of the real linear signal. In a frequentist approach you cannot get the right answer without marginalizing over the noise distribution of the x-values. In fact you won't even be close; the slope will be wrong by 20% or more unless you do an extremely difficult integral over the distribution that the x-values are drawn from. And this integral is extremely difficult even if you assume only a simple gaussian prior. If you throw this problem into a bayesian analysis, it is incredibly simple to marginalize over any arbitrary prior, and get the right answer for the slope. I don't like frequentism because most people do it wrong, and doing it right is harder than doing a bayesian analysis.

1. Phil Koop12:54 PM

This is just as broad as it is long; there are also problems that are trivial to solve from a frequentist approach but for which it is extremely difficult to construct well-behaved priors. See Stone's Paradox, for example.

2. Anonymous1:22 PM

I should obviously state then that most of the problems I solve on a daily basis are much more easily solved correctly with Bayesian analysis. There are certainly scenarios where problems are trivial in either method but very difficult in the other. But the misuse of frequentist p-values is rampant; most studies based on p-values of 5% are probably wrong. http://arxiv.org/abs/1407.5296

I think there is a place and time for both methods. I think one should solve each problem with the simplest method possible, but not simpler (to paraphrase someone else's paraphrasing of Einstein). A significant fraction of frequentist analyses out there in academia today are just plain wrong; that's not to say that they're conclusions are wrong or that the effect isn't real, it's just to say that most of the time they have no real understanding of their error bars, which causes them to underestimate their p-value or over-estimate their significance.

Gelman's stats package "Stan" is gaining popularity because of it's simplicity for the user to build a proper bayesian model. If someone writes a package that allows a user to properly build an arbitrary frequentist model, then it will also be widely used.

7. Student12:57 PM

first, above I meant "technically, frequentist hypothesis tests do NOT test hypotheses, ..."

I would say the most widely used are bayes factors, the deviation info criteria, probability intervals based on the marginal posterior. However, you could conduct hypothesis testing just like frequentists if you wanted by evaluating the posterior a certain way. The only difference would be that frequentists treat the probabilities as fixed long run constants while bayesians treat the probabilities as parameters conditional on the data. That aside, you could use non-informative priors compute psuedo p-vals and conduct hypothesis testing to conform to the frequentist conventions if you really wanted to. I just dont see whats so great about doing that. I mean significance at the 95% level is ok, but at the 94% and its a non-finding? I dont see much benefit in that.

8. Thanks Noah: this is all new to me, and I find it very interesting. BTW, yesterday Jason Smith referenced you when he wrote "macro data is uninformative." Can you direct me to where you may have discussed that?

9. Noah, if I rewrite your Feynman quote above slightly:

"It's a kind of integrity amongst economists, a principle of economic thought that corresponds to a kind of utter honesty--a kind of leaning over backwards. For example, if you're evaluating a hypothesis, you should report everything that you think might make it invalid--not only what you think is right about it: other causes that could possibly explain your results; and things you thought of that you've eliminated by some other set of results, and how they worked--to make sure the other fellow can tell they have been eliminated.

Details that could throw doubt on your interpretation must be given, if you know them. You must do the best you can--if you know anything at all wrong, or possibly wrong--to explain it."

On a scale of 0 to 10, how well can those modified paragraphs actually be applied to economists? (give your guess for an average "score" averaged over whatever group you like: economists who write blogs, all economists, whatever). Now Ideally, over this same group, where should that number be in your opinion? Offhand it seems to me like it should be at 10, but perhaps you've got a good reason why it shouldn't.

1. BTW, for what it's worth, my very amateurish impression is that the current score (which you can take to be over blog writers, since that's the only way I encounter economists), is about 1. I have very very little sense that this "kind of integrity" has been adopted by economists. I never see blog writer X do multiple posts on all the things that might be wrong with interpreting the evidence as supporting his hypothesis. Quite the opposite actually... no "leaning over backwards" whatsoever. Lol. Of course I'm sure there's much I'm unaware of too.

10. You may underestimate the problem of publication bias. If you believe the meta-analyses of minimum wage research starting with Card & Krueger's AER article, virtually all the findings of major job loss (3% per 10% minwage increase) are GIGO. Another econ area where publication bias seems serious is in the spillover effects from foreign direct investment.

11. Frequentists' greatest triumph was when they proved that all swans are white.

12. IMHO the Bayesian prior serves to quantify the researchers bias, whereas the frequentist may go on acting as if they have none, all the while it lingers under the surface, undetected by the formal calculus. That bayesianism is subjective is a virtue because it acknowledges the way that most people actually form judgements and pulls all the subjectivity out into the open.

13. marcel12:23 PM

"Now, of course, the Frequentist social conventions are weak, inadequate defenses against subjectivism and noise. They have drawbacks, like discouraging the reporting of negative results. And they are subject to being gamed by unscrupulous researchers. But at least they are something."

The last sentence needs tweeking.

"Say what you will about the social conventions of frequentism, dude, at least it's an ethos"

14. Jonathan Goodman1:46 PM

There are more practical issues pointing toward Bayesian or frequentist approaches. Two negatives for frequentism (?):

1. Frequentist hypothesis testing isn't robust against modeling errors. You ask: "What is the probability of getting this data if H_0 is true?" The answer, if the model is at all complex, is: "almost zero". H_0 is not a single number (mean zero), but a complex model.

2. Frequentism gives little insight into error bars

Pro Bayesian: sampling the posterior gives deep insight into remaining uncertainty, given the data.

Anti Bayesian: the prior is completely made up -- "replacing ignorance with fiction".

15. JohnR2:19 PM

That's a useful point, Noah - I look at it that I'm more likely to take seriously people who find results that contradict their expectations. Most of us tend to (consciously or not) weight things that support what we like to believe. Some of us make a real effort to stay neutral, but few of us actually raise the bar on what we want to see. That way the normal human tendency to see what we prefer is often enough to lead us into error. It works the other way 'round, too, of course - many of us simply cannot see what we don't wish to. Supposedly, that's one thing statistics was invented to deal with. Being married to a statistician, however, I'm well aware that the human need to bend evidence to support a desired idea is perfectly able to handle statistics. That's why you need to see the raw data whenever you can. After all, trust, but verify...

16. Noah has had this discussion with Gelman before, and Eli riffed off it with Socrates

IEHO you need to have a pretty good idea about the answer to find a useful prior.

17. E.T. Jaynes, "Probability Theory: the logic of science" is surely the most entertaining way to learn Bayesian methods.

I prefer using the Bayes factor as a way of comparing two hypotheses, no bias toward either, given the data. One then uses AIC or even BIC to see if the difference between the two hypotheses is large enough. There are again three possible outcomes: H0 is better; H1 is better; equivalent explanatory power.

1. Dikran Marsupial9:20 AM

Following the common interpretation of the Bayes factor (http://en.wikipedia.org/wiki/Bayes_factor#Interpretation), we can easily get to the same sort of three valued logic: If K < -(3:1) then it is substantial (in the sense of having some existence) it is evidence against the hypothesis, if K > 3:1 then it is substantial evidence for the hypothesis, if it is in between there is no substantial evidence either way.

The real problem with frequentist hypothesis testing is that the underlying logic of the test is subtle and widely misunderstood, which leads to errors of understanding. Bayesian methods are conceptually easy to understand (and hence harder to misunderstand), but hard to apply, whereas frequentist methods are easy to apply, but conceptually hard to understand (and hence easy to misinterpret).

It is easy to find examples where the test is not biased in favour of H0, especially where someone is arguing in favour of H0 (e.g. "no significant global warming since 1998"). Most people don't understand that a lack of a statistically significant trend in global temperatures does not imply that there has been no warming (or even that warming has not continued at the same rate).