calibration – Bas van Fraassen's philosophy blog

The second approach to an agent’s self-assessment was presented by Philip Dawid in his “The Well-Calibrated Bayesian” (1982 — much discussed at the time).

The idea of calibration as an assessment of probabilities is now, I think, a very familiar one. If a weather forecaster gives his probability for rain tomorrow as 80%, then the next day it either rains or not, so what? The obvious response is that the forecaster can only be graded on performance over a period of time. We say that the forecasting is perfectly calibrated over a given year, say, if it rains on 80% of the days for which the forecast was ‘chance of rain is 80%’; and similarly for all the other announced numbers. Scores less than perfect can be appropriately defined.

Calibration is not a proper scoring rule, but it is still an important, and straightforward, assessment of ‘right prediction’.

To give us a hold on this notion, let’s first establish that there is in principle a way the forecaster can proceed so that the forecasting will be perfectly calibrated. The idea is simple: when the forecast probabilities match the actual relative frequencies. Since the forecaster does not have clairvoyance or a crystal ball, this could only be partly due to knowledge of past statistics, would have to involve a good measure of luck — but the point is that it is possible. And it will provide us with a stepping stone to self-assessment by the forecaster.

So imagine this. Just after midnight the forecaster draws on the meteorological data and classifies the new day as belonging to a reference class B. This is one of a family of ways in which the forecaster (or model) classifies days on the basis of incoming data. Together they form the class X, a partition of all the possible ways the day can be. Let’s say that for day x he chooses reference class Bx, and he announces the probability of, say, rain for this day in the form

Px(rain)

What will this number be? Well, let m be the measure (the size, the number of days in) a given part of the relevant range of days. This is a simple additive function, and the actual proportion of rainy days in reference class Bx is, defined in the usual way,

m(rain | Bx)

Now, on the dream scenario in which the forecaster has (by luck or insight) latched onto the actual proportions in his reference classes, his announcement for the day will be

Px(rain) = m(rain | Bx)

for every day x on which he makes an announcement at all.

And we ask immediately: in that case, what is the proportion of rainy days, among those days on which the forecast probability of rain = r?

The answer, by a little theorem about perfect calibration is:

m(rain | {x: Px(rain) = r}) = r

and similarly of course for every other proposition subject to forecast (hail, storm, temperature above 80, what have you) and for every number that is ever given as a forecast probability.

(For proof see Appendix below.)

So this would be success: perfect calibration. No surprise here! For this is success on a dream scenario in which the forecaster is latching onto the actual relative frequencies.

We ask now for a self-assessment: what is the probability that your forecast probabilities will match the actual proportions of rain in the days for which you make forecasts?

It is actually quite clear how the agent can, and has to, answer this question. Without a crystal ball, he does not have direct access to proportion m(rain | Bx), but he has his own estimate, which is his subjective probability P(rain | Bx), and that is the actual Px(rain), his announcement on day x.

And here is the point: the proof of the little theorem about perfect calibration assumes only that m is an additive function. Subjective probability P is also an additive function! So the entire argument goes through with P replacing m.

Conclusion: in reply to the above request for self-assessment the probabilist forecaster proves:

P(rain | {x: Px(rain) = r}) = r

In other words, by his self-assessment, which is to say, by his own lights, this forecasting is perfectly calibrated.

But this self-assessment is truly and amazingly orgulous!

We expect that, again, the response to the non-plussed frequentist will be: so, should I assess myself by someone else’s lights? Of course, someone with different opinions from mine will think that I won’t be well calibrated … a trivial truth!

But how can the probabilist square this orgulity with the simple, common sense insight that there are infinitely many ways in which any person’s forecasting can be fail to match the actual frequencies? Or with the knowledge that even expert weather forecasters, using advanced models and data gathering procedures, are not in fact perfectly calibrated?

But on the other hand, how could one consistently say that by one’s own lights one is well calibrated but may actually not be?

We haven’t even gotten to Philip Dawid’s paper yet! But now we are well prepared, so that is the subject for the next post.

Appendix. Proof of the little theorem about perfect calibration

We are interested in the set of days A(r) = {x: Px(rain) = r} in which the forecast probability of rain was r. That includes by hypothesis all and every day x which belongs to a reference class B such that m(rain |B) = r. So that set is precisely the union of all those references classes — let us call them B1, …., Bk –:

A(r) = B1 v B2 v … v Bk

where I am using “v” as the sign for set union and will use “&” for set intersection (due to type limitations here). Thus

m(rain & A(r)) = m( (rain & B1) v … v (rain & Bk) )

= m(B1)m(rain |B1) + … + m(Bk)(m(rain | Bk)

= r[ m(B1) + … +m(Bk) ]

=r[m(B1 v … vBk)]

= rm(A(r))

that is to say:

m(rain & A(r)) = rm(A(r))

that is, m(rain &A(r)) divided by m(A(r)) equals r, which means

m(rain |A((r)) = r,

or equivalently, and recalling what A(r) is:

m(rain | {x: Px(rain) = r} ) = r

provided of course that the denominator does not equal zero.

Only the additivity of function m is important to this proof (given that the reference classes form a partition of the space). The proof would go through as well for a denumerable partition and a sigma-additive function.

Orgulity is the opposite of humility. Not being a native speaker, I had to open a dictionary when I saw Gordon Belot’s paper “Bayesian Orgulity” (Philosophy of Science 2013).

This putative orgulity concerns two sorts of theorems about subjective probability, frequencies, and calibration. Informally put, the first sort shows that a Bayesian agent will, indeed must, be sure that his probabilities are the right ones, and that statistics would bear that out in the long run. “Sure” means here that in his self-assessment, he will give zero probability to the contrary. Equally informally put, the other sort of theorem shows that Bayesian agents will, in an overwhelming majority of possible cases, be wrong in just that respect.

All the makings of a true paradox!

Need we take it as a paradox? It could just stand as an indictment of Bayesian epistemology. The main arguments concern an orthodox Bayesian agent with a numerically precise prior probability function, whose sole updating means is conditionalization on data coming in with certainty (like the voice of an angel). So there is much to be complained about here already.

But the arguments are in some respects very general and would seem to indict much more liberal forms of probabilism as well.

I want to offer, at the end of the discussion, a resolution of the paradox, based on an idea which I already know to encounter a lot of resistance.

I’ve heard that early missionaries in Polynesia faced first of all the onerous task of convicting the natives of sin, before they could preach salvation. So, first of all, I’ll make a good effort to show that we are not dealing with narrow technical issue here, but with a paradox eminently worthy of taking seriously.

My plan for the posts to follow:

First, how is subjective probability related to actual frequencies? (How does a probabilist agent — whose opinion is represented by a subjective probability function — reply when asked about the relative frequencies of actual occurrences ‘in the world’?)

Second, what about self-assessment by such an agent? (If asked whether, or to what extent, his or her own opinions concerning the future are well-calibrated, how must s/he answer?)

Third, how does Belot present the Bayesian’s orgulity, and how badly do they fare given his results? (Spoiler: Belot takes the wind out of their sails if they seek refuge in the leeway between zero probability and impossibility.)

Fourthly, most importantly, are there different ways to read the results? (Must we take them as an indictment of probabilism, or of the idea of subjective probability, or can we, on the contrary, offer a narrative on which everything makes sense?)

Tag: calibration

What is Bayesian orgulity? (3) The two-minute self-assessment proof

What is Bayesian orgulity? (1)