The second approach to an agent’s self-assessment was presented by Philip Dawid in his “The Well-Calibrated Bayesian” (1982 — much discussed at the time).
The idea of calibration as an assessment of probabilities is now, I think, a very familiar one. If a weather forecaster gives his probability for rain tomorrow as 80%, then the next day it either rains or not, so what? The obvious response is that the forecaster can only be graded on performance over a period of time. We say that the forecasting is perfectly calibrated over a given year, say, if it rains on 80% of the days for which the forecast was ‘chance of rain is 80%’; and similarly for all the other announced numbers. Scores less than perfect can be appropriately defined.
Calibration is not a proper scoring rule, but it is still an important, and straightforward, assessment of ‘right prediction’.
To give us a hold on this notion, let’s first establish that there is in principle a way the forecaster can proceed so that the forecasting will be perfectly calibrated. The idea is simple: when the forecast probabilities match the actual relative frequencies. Since the forecaster does not have clairvoyance or a crystal ball, this could only be partly due to knowledge of past statistics, would have to involve a good measure of luck — but the point is that it is possible. And it will provide us with a stepping stone to self-assessment by the forecaster.
So imagine this. Just after midnight the forecaster draws on the meteorological data and classifies the new day as belonging to a reference class B. This is one of a family of ways in which the forecaster (or model) classifies days on the basis of incoming data. Together they form the class X, a partition of all the possible ways the day can be. Let’s say that for day x he chooses reference class Bx, and he announces the probability of, say, rain for this day in the form
Px(rain)
What will this number be? Well, let m be the measure (the size, the number of days in) a given part of the relevant range of days. This is a simple additive function, and the actual proportion of rainy days in reference class Bx is, defined in the usual way,
m(rain | Bx)
Now, on the dream scenario in which the forecaster has (by luck or insight) latched onto the actual proportions in his reference classes, his announcement for the day will be
Px(rain) = m(rain | Bx)
for every day x on which he makes an announcement at all.
And we ask immediately: in that case, what is the proportion of rainy days, among those days on which the forecast probability of rain = r?
The answer, by a little theorem about perfect calibration is:
m(rain | {x: Px(rain) = r}) = r
and similarly of course for every other proposition subject to forecast (hail, storm, temperature above 80, what have you) and for every number that is ever given as a forecast probability.
(For proof see Appendix below.)
So this would be success: perfect calibration. No surprise here! For this is success on a dream scenario in which the forecaster is latching onto the actual relative frequencies.
We ask now for a self-assessment: what is the probability that your forecast probabilities will match the actual proportions of rain in the days for which you make forecasts?
It is actually quite clear how the agent can, and has to, answer this question. Without a crystal ball, he does not have direct access to proportion m(rain | Bx), but he has his own estimate, which is his subjective probability P(rain | Bx), and that is the actual Px(rain), his announcement on day x.
And here is the point: the proof of the little theorem about perfect calibration assumes only that m is an additive function. Subjective probability P is also an additive function! So the entire argument goes through with P replacing m.
Conclusion: in reply to the above request for self-assessment the probabilist forecaster proves:
P(rain | {x: Px(rain) = r}) = r
In other words, by his self-assessment, which is to say, by his own lights, this forecasting is perfectly calibrated.
But this self-assessment is truly and amazingly orgulous!
We expect that, again, the response to the non-plussed frequentist will be: so, should I assess myself by someone else’s lights? Of course, someone with different opinions from mine will think that I won’t be well calibrated … a trivial truth!
But how can the probabilist square this orgulity with the simple, common sense insight that there are infinitely many ways in which any person’s forecasting can be fail to match the actual frequencies? Or with the knowledge that even expert weather forecasters, using advanced models and data gathering procedures, are not in fact perfectly calibrated?
But on the other hand, how could one consistently say that by one’s own lights one is well calibrated but may actually not be?
We haven’t even gotten to Philip Dawid’s paper yet! But now we are well prepared, so that is the subject for the next post.
Appendix. Proof of the little theorem about perfect calibration
We are interested in the set of days A(r) = {x: Px(rain) = r} in which the forecast probability of rain was r. That includes by hypothesis all and every day x which belongs to a reference class B such that m(rain |B) = r. So that set is precisely the union of all those references classes — let us call them B1, …., Bk –:
A(r) = B1 v B2 v … v Bk
where I am using “v” as the sign for set union and will use “&” for set intersection (due to type limitations here). Thus
m(rain & A(r)) = m( (rain & B1) v … v (rain & Bk) )
= m(B1)m(rain |B1) + … + m(Bk)(m(rain | Bk)
= r[ m(B1) + … +m(Bk) ]
=r[m(B1 v … vBk)]
= rm(A(r))
that is to say:
m(rain & A(r)) = rm(A(r))
that is, m(rain &A(r)) divided by m(A(r)) equals r, which means
m(rain |A((r)) = r,
or equivalently, and recalling what A(r) is:
m(rain | {x: Px(rain) = r} ) = r
provided of course that the denominator does not equal zero.
Only the additivity of function m is important to this proof (given that the reference classes form a partition of the space). The proof would go through as well for a denumerable partition and a sigma-additive function.