## Wednesday, January 28, 2015

### Priors and posteriors

A really wonderful blog post by Stephen Senn, head of the Methodology and Statistics Group at the Competence Center for Methodology and Statistics in Luxembourg, sums up the philosophical problem I've always had with Bayesian inference in scientific studies. Basically, the question is: Where does the prior come from? Senn argues that it can't be your real prior, since you can't quantify your real prior.

Where else could your prior come from? Here's the list I thought of:

1. You could use a "standard" prior that a bunch of other people use because it's "noninformative" in some sense (e.g. a Jeffreys prior). See this Larry Wasserman blog post on some of the potential problems with that approach.

2. You could use some prior that comes from empirical data. This is the foundation of the "empirical Bayes" approach.

3. You could choose a bunch of different priors and see how sensitive the posterior is to the choice of prior. This could be done in a haphazard or a systematic way, and it's not immediately clear if one of those is always better than the other. The drawback of this approach is that it's a bit cumbersome, and hard to interpret.

4. You could choose a prior that is close to the answer you want to get. The less informative your data is, the closer your prior will be to your posterior. This seems a bit scientifically dishonest. But I bet someone out there has tried it.

5. You can choose an "adversarial prior" that is similar to what you think someone who disagrees with your conclusion would say. (Thanks to Sean J. Taylor of Twitter for pointing this out.)

Have I missed any big ones?

Anyway, as always in stats, there's some element of intuition that can't be incorporated into the estimation in a systematic way.

Anyway, Andrew Gelman, one of the high priests of Bayesianism, so to speak, had this to say about Senn's post:
I agree with Senn’s comments on the impossibility of the de Finetti subjective Bayesian approach. As I wrote in 2008, if you could really construct a subjective prior you believe in, why not just look at the data and write down your subjective posterior. The immense practical difficulties with any serious system of inference render it absurd to think that it would be possible to just write down a probability distribution to represent uncertainty. I wish, however, that Senn would recognize “my” Bayesian approach (which is also that of John Carlin, Hal Stern, Don Rubin, and, I believe, others). De Finetti is no longer around, but we are!
I have to admit that my own Bayesian views and practices have changed. In particular, I resonate with Senn’s point that conventional flat priors miss a lot and that Bayesian inference can work better when real prior information is used. Here I’m not talking about a subjective prior that is meant to express a personal belief but rather a distribution that represents a summary of prior scientific knowledge. Such an expression can only be approximate (as, indeed, assumptions such as logistic regressions, additive treatment effects, and all the rest, are only approximations too), and I agree with Senn that it would be rash to let philosophical foundations be a justification for using Bayesian methods. Rather, my work on the philosophy of statistics is intended to demonstrate how Bayesian inference can fit into a falsificationist philosophy that I am comfortable with on general grounds.
Cool.

Update: Thinking about it a little more, I don't think Senn's point really has any implications for study design. But it does seem to have implications for how a "client" (or reader of a paper) should treat a "Bayesian" researcher's results. Basically, a researcher doing Bayesian inference is not the same as a Bayesian agent in a model. A Bayesian agent in a model always uses her own prior, and thus always uses information optimally. A researcher doing Bayesian inference cannot use his own prior, and so may not be using information optimally. So using Bayesian inference shouldn't be a free ticket to respectability for research results.

1. If you can't formulate a prior, you don't understand the implications of your model, and have no business estimating it.

1. Sure you can formulate *a* prior. Obviously you have to do so in order to carry out a Bayesian estimation. But what Senn and Gelman are saying is that you can't formulate *your own subjective prior*, like a Bayesian agent does.

2. My alternate answer is that the prior comes form the same place that the likelihood comes from.

2. Anonymous9:51 PM

3. You could choose a bunch of different priors and see how sensitive the posterior is to the choice of prior.

Bootstrapping?

3. I don't understand what Gelman means by this,
"my work on the philosophy of statistics is intended to demonstrate how Bayesian inference can fit into a falsificationist philosophy that I am comfortable with on general grounds."

The term "falsificationist philosophy" has an alarmingly glib connotation. Please tell me why I a wrong, for peace of mind?

1. Oops, that should be "...why I AM wrong" (terribly embarrassing sentence in which to have a typo).

Also, to be on topic, Deborah Mayo has an entire series of posts about that Normal Deviate here:
http://errorstatistics.com/?s=deconstructing+larry+wasserman

4. Excellent post! And some radical reconception of the whole field (relativity, quantum mechanics) should force a re-evaluation of priors, not a Bayesian updating! If you look at Bayes work itself, it is clear he is talking about situations in which we understand the experimental setup well enough to set priors!

5. Noah, have you come across the [maximum entropy principle] (http://en.wikipedia.org/wiki/Principle_of_maximum_entropy) for choosing priors advocated by E.T. Jaynes?

6. Who is the cutie in the "nested statistician/ hall of mirrors" image? Is that Stephen Senn? I'm not certain, as the linked post had this rather different 2012 photograph:
https://errorstatistics.files.wordpress.com/2012/01/photo.jpg

7. Corkscrew4:48 AM

So... is Senn's blog post basically saying that frequentist approaches are what we use when we can't agree on a Bayesian prior? This makes considerable sense to me, and it does indeed bridge the gap between Bayesian inference and frequentist hypothesis testing quite nicely.

David: If you're looking at approaches to selecting priors, you might be interested in Solomonoff induction. I'm pretty sure that the principle of maximum entropy, or something very similar, could be deduced as a special case of this.

8. Phil Koop7:56 AM

"I don't think Senn's point really has any implications for study design"

That is obviously not true, because his view is not compatible with the strong likelihood principle. Maybe you never accepted SLP in the the first place, but Bayesians of the sort to whom he refers mostly do.

9. Every few weeks Noah goes back to Bayes! I thought my links to Prof. Brown at Trinity University amplified that priors don't matter as long as...

3.2 Washing Out of the Priors
The idea that P(T) could be based on a mere hunch may seem unsettling. After all,
different people may have very different hunches about the truth of a theory, and so may
begin the process with very different values for P(T)! In a way, however, this does not
matter very much. This is because of the phenomenon of the washing out of the priors.
If you and I begin with very different evaluations of T, but we agree on P(E|T) and
P(E|¬T), then our posterior probabilities will get closer and closer to each other the
more evidence we investigate. In the long run, we will end up with the same assessment
of T even if we started out with very different guesses.

The operating constraint is:

...If you and I begin with very different evaluations of T, but we agree on P(E|T) and
P(E|¬T),

If Noah researches on the web, in the comment sections of his own blog, finds the article of Prof. Brown, and, sits down at one place and reads 9 page paper of Prof. Brown, he would not have these posterior farts. Better yet, just call up Prof. Brown and learn.

1. Washing out the prior happens when the data are so informative that the likelihood function converges to a sharp Gaussian. In this situation, frequentist and Bayesian answers basically always agree. It's the fact that this doesn't always happen that makes the whole question of which approach to use an interesting and important one.

2. I am not Bayesian fan. I think Bayes wanted to convert nonbelievers into believers and what a way to do that. One can make guess on P(T) (believe or not believe), but one can not refute the miracles P(E|T)! And, if one keeps believing in miracles, we can easily recruits folks to do crazy things, like chopping heads, burning alive etc. Once there, it does not matter what likelihoods are.

10. Do a post about what happens to money when markets crash . does a conservation of wealth exist?

1. Let's make a deal. I'll do that if you stop constantly showing up here and at BV and posting "race and intelligence" crap! Deal?

11. Anonymous11:31 AM

Whenever Frequentists comment on Bayes, it tells you something about Frequentists and nothing about Bayes.

Drop the idea that probabilities are frequencies. Real world frequencies are physical Facts, like the temperature of the sun's surface. They can be predicted, estimated and measured.

Probabilities are tools for modeling uncertainty. If our Evidence doesn't determine some Fact exactly then there's some uncertainty which we model using P(Fact | Evidence). Our evidence may not tell us the exact temp of the sun's surface, but it gives us a range of reasonable values that are consistent with everything we do know about it.

Prob distros model this uncertainty by placing their high probability region (probability mass) over the region of reasonable values. So if the sun's surface temp is reasonably between 5000k and 6000k, then you could use a N(5500, 500) to model that uncertainty. A Bayesian Credibility Interval formed from N(5500, 500) would contain the true temp. The smaller the interval, the less uncertainty naturally.

The goal in modeling uncertainty is to find consequences which are true for almost every reasonable value of the true temp. If you use that N(5500, 500) to determine that some “hypothesis A” has very high probability, then that means A holds true for almost every potential temp in the range 5000k-6000k. Since the true temperature is in that range, then that's strong reason to think A is actually true.

The best way to think of it is as a very intuitive and simple sensitivity analysis. Basically the high probability of A shows that hypothesis A is insensitive to the exact temperature as long as the it's in the high probability region of N(5000,500).

So to select prob distributions take what you know and find a distribution that covers the right area. For example, if you're trying to create a prior used to search for a downed aircraft, you could use the max radius of travel based on how much fuel they had and place a uniform distribution over the entire disk. That way the true location is bound to be within that area. That defines your overall search region before considering anything else.

Alternatively, you could ask an expert for their intuition and create a prior out of that. Of course, if the true location isn't in the high probability region of the expert's prior, then the prior is “wrong” and will lead to a failed search effort.

Generally, creating prob distros is a competition between two competing goals. You want to the make the distribution as “informative” as possible about the true value, but if you shrink it's high probability area (region of plausible values) smaller than your evidence allows then you run the serious risk of missing the true value.

Note, there is no difference between sampling distributions and priors. They both model uncertainty in the exact same way. When you use a NIID model for errors you're not saying that future errors will have a NIID frequency histogram. Statisticians almost never have that kind of knowledge and most of the time it's not true or even meaningful.

What you're really doing is defining a region of plausible values for the one set of errors that actually exist in the data you actually took. You're saying the data you just measured had an error vector which resides somewhere in the high probability region of the NIID (a n-sphere with a radius of a few stdev'). As long as that's where those errors actually are everything is fine; future unrealized errors are irrelevant and can be absolutely anything.

That's why NIID assumptions work better than they “should”. Errors of real measuring devices almost never have frequencies that look NIID over a long period of time. But they don't have to, because that's not what that assumption means and it's not what determines success!

P.S. I respect Senn, but his description of Jaynes's views in that paper was so wrong he lost all credibility when it comes to identifying real Bayesians.