We tend to have wrong beliefs about many things. The criteria for having a belief do not stop at introspection and so we may be wrong also about what beliefs we have. We are not fully self-transparent, and so it may not be right to blame us for such mistakes.
But it is still appropriate to point out debilitating forms of error, just as we would for a distracted or forgetful accountant. After all, the success of our practical projects may depend on the beliefs we had to begin.
ACriterion for Minimal Consistency
As a most minimal norm, beyond mere logical consistency, I would propose this:
our belief content should not include any awowal of something we have definitely disavowed.
We can avow just by asserting, but to disavow we need to use a word that signifies belief in some way. For example, to the question: “Is it raining?”, you can just say yes. But if you do want to demur, without giving information that you may not have, the least you must do is to say “I don’t believe that it is raining”.
Definition. The content of someone’s beliefs is B-inconsistent if there it includes some proposition p and also the proposition that one does not believe that p.
B-consistency is just its opposite.
I am modeling belief content as a set of propositions, and minimally consistent belief contents are B-consistent sets of propositions. I will also take it that belief can be represented as a modal operator on propositions: Bp is the proposition that encodes, for the agent, the proposition that s/he believes that p.
Normal Updating
Now the study of belief systems has often focused on problems of consistency for updating policies. Whatever you currently believe, it may happen that you learn, or have warrant to add, or just an impulse to add, a new belief. That would be a proposition that you have not believed theretofore. The updating problem is to do without landing in some inconsistency. That is not necessarily easy, since the reason that you did not believe it theretofore is because you had contrary beliefs. So there is much thought and literature about when such a new belief can just be added, and when not, and if not, what to do.
However, responses to the updating problem generally begin by mapping out a safe ground, where the new belief can just be added. Under what conditions is that unproblematic?
A typical first move is just to require consistency: that is, if a new proposition p is consistent with (consistent) belief content B then adding p to B yields (perhaps part of) a (consistent) belief content. I think we had better be more conservative, and so we should require that the prior beliefs include an explicit disavowal of any belief both of p and of it its contraries.
So here is a modest proposal for when a new belief can just be added without courting inconsistency of any sort:
Thesis. If a belief system meets all required criteria of consistency, and it includes disavowal of both p and not-p, then the result of adding p while removing its disavowal, does not violate those criteria of consistency.
We might think of the Thesis as articulating the condition for a system of belief to be updatable in the normal way under the best of circumstances.
A pertinent example then goes like this:
I have no idea whether or not it is now raining in Peking. I do not have the belief that it is so, nor the belief that it is not so. For all I know or believe, it is raining there, or it is not raining there, I have no idea.
The Thesis then implies that if I were to add that it is raining in Peking to my beliefs (whether with or without warrant) the result would not be to make me inconsistent in any pertinent sense.
The Dilemma
But now we have a problem. In that example, I have expressed my belief that I do not believe that it is raining in Peking – that part is definite. But whether it is raining in Peking, about that I have no beliefs. Let’s let p be the proposition that it is raining in Peking. In that case it is clear that I neither believe nor disbelieve the following conjunction:
p & ~Bp
So according to the thesis I can add this to my beliefs, while removing its disavowal, and remain consistent.
But it will take me only one step to see that I have landed myself in B-inconsistency. For surely I believe this conjunction only if I believe both conjuncts. I will be avowing something that I explicitly disavow.
Dilemma: should we accept B-consistency as a minimal consistency criterion for belief, or should we insist that a good system of beliefs must be one that is updatable in the normal way, when it includes nothing contrary, and even disavows anything contrary, to the new information to be added?
(It may not need mentioning, but this dilemma appears when we take into account instances of Moore’s Paradox.)
Parallel for Subjective Probability Conditionalization
If we represent our opinion by means of a subjective probability function, then (full) belief corresponds to probability 1. Lack of both full belief and full disbelief corresponds to positive probability strictly between 0 and 1.
Normal updating of a prior probability function P, when new information E is obtained, consists in conditionalizing P on E. That is to say, the posterior probability function will be
So this is always the normal update, whenever one has no full belief either way about E.
In a passage famous in certain quarters David Lewis wrote about the “class of all those probability functions that represent possible systems of beliefs” that:
This class, we may reasonably assume, is closed under conditionalizing. (1976, 302)
In previous posts I have argued that probabilistic versions of Moore’s Paradox raise the same problem for this thesis, that a class of subjective probability functions represent possible systems of belief only if it is closed under conditionalization.
A Moore Statement (one that instantiates Moore’s Paradox) is a statement that could be true, but could not be believed. For example, “It is raining but I don’t believe that it is raining”.
We find interesting new varieties of such statements when we replace the intuitive notion of belief with subjective probability. Then there are two kinds of Moore Statements to be distinguished:
An Ordinary Moore Statement is one that could be true, but cannot have probability one. A Strong Moore Statement is one that could have positive probability, but could not have probability one.
When we introduce statements about objective chance there are Moore Statements in our language. Consider first the following (not a Moore statement) said when about to toss a die:
[1] The number six won’t come up, but the chance that six will come up is 1/6.
On this occasion both conjuncts can be true. The die is fair, so the second conjunct is true, and when we have tossed the die we may verify that our prediction (the first conjunct) was true as well.
Moreover, [1] can be believed, perhaps by a gambler who bet that the outcome will be odd and is feeling lucky. Or at least he could say, even with some warrant, that it seems likely (or at least a little likely) that [1] is the case. The gambler could even say (and who could disagree, if the die is known to be fair?) that the probability that [1] is true is 5/6!
The way I will symbolize that is: P(~Six & [ch(Six) = 1/6]) = 5/6.
In this sort of example we express two sorts of probability, one subjective and one objective. Are there some criteria to be met? Is there to be some harmony between the two?
Like so much else, there are some controversies about this. I propose what I take to be an absolutely minimal constraint:
Minimal Harmony. P(ch(A) > 0) = 1 implies P(A) > 0 If I am sure that there is some positive chance that A then it seems to me at least a little likely that A.
I really cannot imagine someone seriously, and rationally, saying anything like
“I am certain that there is some chance that the six will come up, but I am also absolutely certain that it will not happen”.
Except a truly deluded gambler, with a gambling strategy sure to lead to eventual ruin?
To construct a Moore Statement we only need to modify [1] a little:
[2] The number six won’t come up, but the chance that six will come up is not zero
~Six & ~[ch(Six) = 0]
That [2] could be true we can argue just like we did for [1]. But [2] is a Moore Statement for it could not have subjective probability 1, by the following argument.
Assume that P([2]) = 1. Then:
P(~Six) = 1
P(Six) = 0
P(~[ch(Six) = 0]) = 1
~[ch(Six) = 0] is equivalent to [ch(Six) > 0]
P(ch(Six) > 0) = 1
contradiction between 2. and 5, violation of principle Minimal Harmony.
Here 1. and 3. follow from the assumption directly. For 4. note that the situation being modeled here is the tossing of a die with chance defined for the six possible outcomes of that toss.
Not closed under conditionalization
This means also that [2] is a statement on which you cannot conditionalize your subjective probability, in the sense that if you do, your posterior opinion will violate Minimal Harmony.
So we have here another case where the space of admissible probability functions is not closed under conditionalization.
I will make all this precise in the Appendix.
REFERENCE
My previous post called ‘Stalnaker’s Thesis → Moore’s Paradox’
APPENDIX. Semantic analysis: language of subjective probability and assessment of chance
As an intuitive guiding example we can think of a model of a tossed die. There is a set of possible worlds, and in each there is a die (fair or loaded in some fashion) that is tossed and a number that is the outcome of the toss. To represent the die we need only the corresponding chance function, e.g. the function that assigns 1/6 to the set of worlds in which the outcome is x (for x = 1, 2, 3, 4, 5, 6). Then, a special feature of this sort of model, there is the set of probability functions on these worlds, representing the different subjective probabilities one might have for (a) what the outcome is, and (b) in what fashion the die is loaded.
Definition. A probability space M is a triple <K, F, PP> where K is a non-empty set, F is a Borel field of subsets of K, and PP is a family of probability measures with domain F.
The members of K we call “worlds” and the members of F, the ‘measurable sets’, we call propositions.
Definition. A subset PP* of PP in probability space M = <K, F, PP> is closed under conditionalization iff for all P in PP* and all elements A of F, P( -|A) is in PP* if p(A) > 0
Definition. A probability spacewith chance M is a quadruple <K, ch, F, PP> where <K, F, PP> is a probability space and ch is a function that assigns to each world w in K a probability function ch(w) defined on F.
Definition. For world w in K, GoodProb( w) = {P in PP: for all A in F, if P(ch(w)(A) > 0) = 1 then P(A) > 0}.
Theorem. GoodProb( w) is not closed under conditionalization.
Proved informally the Moore Paradox way, in the body of this post.
The relevant language has as vocabulary a set of atomic sentences, connectives & and ~, propositional operators (subnectors, in Curry’s terminology) P and ch, relational symbols = and >, and a set of numerals including 0.
There is no iteration or nesting of P or ch, which form terms from sentences.
Simultaneous inductive definition of the set of terms and sentences:
An atomic sentence is a sentence
If A is a sentence then ~A is a sentence
If A, B are sentences then (A & B) is a sentence
If A is a sentence and no terms occur in A then ch(A) is a term
If A is a sentence and P does not occur in A then P(A) is a term
If t is a term and n is a numeral then (t = n) and (t > n) are sentences.
Truth conditions for sentences:
For M = <K, ch, F, PP> a probability space with chance, and P a member of PP, a P-admissible interpretation ||…. || of the language in M is a function that maps the sentences to propositions, and numerals to numbers (with 0 mapped to 0), subject to the conditions:
||A & B|| = ||A|| ∩ ||B||
||~A|| = K – ||A||
||ch(A) = n|| = {w in K: ch(w)(||A||) = ||n||}
||ch(A) > n|| = {w in K: ch(w)(||A||) > ||n||}
||P(A) = n|| = {w in K: P (||A||) = ||n||}
||P(A) > n|| = {w in K: P (||A||) > ||n||}
Note that ||P(A) = n|| is in each case either K or empty, and similarly for ||P(A) > n||.
We call a sentence A true in world w exactly if w is a member of ||A||.
For example, if A is an atomic sentence then there is no constraint on ||A|| except that it is a proposition. And then sentence P(A & ch(A) > 0) = n is true under this interpretation (in all worlds) exactly if P assigns probability ||n|| to the intersection of set ||A|| and the set of worlds w such that ch(w)(||A||) is greater than zero. And otherwise that sentence is not true in any world.
Thinking about odds brings new insights into how we deal with probabilities. It illuminates puzzles about confirmation, conditionalization, and Bayes’ Theorem, as I illustrated informally in the earlier posts tends to take a helpfully simpler and more intuitive form when it is in terms of odds. Now I’ll explore the ins and outs of odds in a natural mathematical setting. (With examples and exercises.)
Odds more general than probabilities
Odds are ratios of probabilities. For example if the odds of snow against no-snow are 3 to 5 then the probabilities of snow and no-snow are 3/8 and 5/8 respectively. And vice versa.
But that example is special: it allows the deduction of the probabilities from the odds. Sometimes we know the odds but not the probabilities. Suppose four horses are running: I might say that I don’t know how likely it is that Table Hands will win, but he is twice as likely to win as True Marvel. The odds for Table Hands against True Marvel are two to one.
So odds provide a more general framework for reasoning and deliberation.
To move smoothly from our intuitions to more precise notions, let’s begin with a finite probability space, and redescribe it in terms of odds.
M= <K, F, PP> is a probability space iff K is a non-empty set, F is a field of subsets of K, and PP is a non-empty set of probability functions with a domain that includes F. M is a simple probability space if P has only one member.
The elements of K are variously called points, outcomes, events, possibilities, or – as I will do here – worlds. For example these worlds could be the outcomes of tossing a die, or the points at which a team can wipe out in the World Cup. The elements of F are sometimes called events too, or – as I will do here – propositions.
A field of sets is a Boolean algebra of sets, with ⊆, ∩, ∪, and ~.
In this post, to keep things simple, I will take K to be the finite set {x1, …, xn}, F the family of all subsets of K, and PP to be the set of all probability functions defined on F.
A probability vector p is a function which assigns a number to each world, these numbers being non-negative and summing to 1. We write it in vector notation: p = < p1, p2, …, pn>. A probability function P defined on the propositions is determined entirely by a certain probability vector p: P(A) = Σ{p(x): x is in A}, or equivalently, p(x) = P({x}), for each world x. So we can stick to just the probability vectors in examples.
Let’s spell out odds in the same way. An odds vector is like a probability vector: the values it assigns to worlds are non-negative real numbers. But the sum of the assigned numbers need not be 1. For example if x1, …, xn are the outcomes of the toss of a fair die, that is represented by odds vector <1,1,1,1,1,1>.
Odds vector v satisfies the statement that the odds of A to B are a : b exactly if a : b = Σ{v(x): x is in A} : Σ{v(x): x is in B}. (I’ll make this precise below.)
Note 1, the null vector. There is no practical use for a locution like “odds of 0 to 0”, and so it would be reasonable to exclude the the null vector, which assigns 0 to each world. It certainly does not correspond to any probability vector. But sometimes it simplifies equations or calculations to let it in, so I will call it an odds vector too, for convenience, and by courtesy.
Note 2, certainty. That proposition A is certain means that the odds of A to ~A are 1:0, that is, infinite. This makes sense in the extended real number system, with ∞ the symbol for (positive) infinity. When A is certain its odds against anything incompatible with A are infinite.
It may be convenient to put it in a negative way. The odds of (It both won’t and will rain tomorrow) to (It either will or will not rain tomorrow) are 0 : 1. That is a well-defined ratio, and means that the first proposition is certainly not true (ranked as such by the odds vector in question). Equivalently, of course, its negation (the second proposition) is certain.
Probability vectors are odds vectors. But now we have a redundancy. Two odds vectors are equivalent if one is a positive multiple of the other. For example, the odds of 4:2 are the same as the odds of 2:1.
If P and P’ are probability functions defined on F then so is P* = xP + (1-x)P’, provided x is in [0,1]. P* is a mixture (convex combination) of P and P’. [1] It is an important feature of the model that the set PP of probability functions defined on F is closed under the formation of mixtures.
Mutatis mutandis for odds: here the method of combination is not convex but linear. The restriction on the coefficients in the mixing equation for probability is not needed.
Definition. A mixture of odds vectors is a linear combination: v* = av + bv’, provided a, b are non-negative real numbers.
Note well that you cannot just replace an odds vector by an equivalent vector anywhere. Addition is linear, that means that v + v’ is equivalent to k(v + v’) = kv + kv’. Even though v’ is equivalent to kv’, v + v’ is not in general equivalent to v + kv’.
EXAMPLE 1. K is the set {1, 2, 3, 4, 5, 6} of outcomes of the toss of a die. This die is one of two dice. One of them is fair, the other is loaded in such a way that the higher numbers 4, 5, 6 come up twice as often as the lower ones. And the die that is tossed is equally likely to be the one or the other.
We have vector v = <1,1,1,1,1,1> to give the odds on the assumption that the die is fair. Similarly, vector v’= <1,1,1, 2,2,2> represents the case of a loaded die with the higher numbers twice as likely to come up as the lower ones. We are unsure whether our die is fair or loaded in that particular way, with no preference for either, so for our betting we adopt an equal mixture:
v* = v + v’ = <2,2,2, 3, 3, 3>
and now our odds for any outcome – e.g that the outcome is an odd number – will be halfway in between. For example, the odds of the outcome being ‘high’ to its being ‘low’ is half-way between what it is for the two dice, that is, one and a half ( i. e. 3/2, as you can easily see).
EXERCISE 1. A hiker is at point A and would like to get to point X. Being ignorant of the trail system he chooses at random at each juncture. From A there go 6 trails: one goes to B1, two go to B2, and three go to B3. From each of these B points, there go 3 trails. From B1, one goes to X, from B2 two go to X, and from B3 all three go to X. What are the odds that the hiker reaches X (and what is the probability that he does so)? Reason with odds.
Answer. For the model, K = {X, ~X}. At the B points the odds vectors are <1, 2>, <2, 1>, <3, 0> respectively. At A, there are different odds to reach the B points, 1:2:3. So the correct odds vector for this situation is the mixture:
1<1,2> + 2<2,1> + 3<3,0> = <14, 4>.
The odds are 14: 4, or 7:2, the probability of reaching X is 14/18 or 7/9. (Check this by also doing the exercise reasoning with probabilities.)
Truth conditions for odds statements[2]
A proposition is true in a world in model M (the world satisfies that proposition) iff that world is a member of that proposition. Notation:
M, w╞A iff w is in A
We will similarly say that an odds vector satisfies an odds statement under the appropriate conditions.
Our language has a classical sentential syntax with &, v, ~ and one special sentential operator O, numerals, and atomic sentences. I will use the same capital letters for sentences as for propositions, and K for the tautology. The sentence formed by applying connective O to A, B, in that order and the numerals x and y I will write, to be reader-friendly as O(A : B) = x : y. It is read as “the odds of A to B are x : y”, and it is called a (simple) odds statement.
I’ll use the same symbol for satisfaction, and write “v╞ E” for “ odds vector v satisfies odds statement E”. The truth conditions for simple odds statements are then, as you would expect:
M, v╞ O(A : B) = a : b if and only if a : b = Σ{v(x): x is in A} : Σ{v(x): x is in B}
EXAMPLE 2. In EXAMPLE 1, v* = <2,2,2,3,3,3>. If A = (outcome is odd) and B = (outcome is even) then
M, v╞ O(A : B) = 7 : 8.
For Σ{v*(x): x = 1, 3, 5 } : Σ{v*(x): x = 2, 4, 6 }
= (2 + 2 + 3) : (2 + 3 + 3) = 7 : 8
To complete this part we have to look at the more general concept of odds relative to a ‘given’ in general.
As I discussed in an earlier post, conditionalizing an odds vector on a proposition A consists just in assigning 0 to the elements not in A. We can make it precise the following way.
Let JA be the indicator function of A, that is, the function that gives value 1 to elements of A and value 0 to all other elements. For example, with x and y in K, if A = {x, y}, JA(x) = 1 = JA(y), and if z is neither x nor y then JA(z) = 0
Definition. (IAv)(x) = def JA(x)v(x), where v is any vector, with real number components and x is a world in K
It may be puzzling why I give this definition for vectors with negative components at the same time, although they are not odds vectors. But it will simplify things later to do that.
So IA is an operator, it operates on vectors, it is a linear operator. The most visual way to display that, in the case of finite vectors, is by representing the operator by a matrix. Suppose K = {x1, x2, x3, x4}, that v = <2, 3, 4, 5> and A = {x1, x3}. Then the matrix representation of the action of IA on v is this:
(Briefly, the rows in the matrix are vectors too. To get the first component of the new vector multiply the top row of the matrix and the old vector. And so forth. The multiplication is the inner product of the two vectors, which I will discuss properly below.)
Truth conditions for conditional odds statements
Let’s extend the language. The sentence formed by applying connective O to A, B, C in that order and the numerals x and y, I will write, to be reader-friendly as O(A : B|C) = x : y. It is read as “given C, the odds of A to B are x : y”, and it is called a conditionalodds statement.
A conditional odds statement has the following truth conditions:
M, v╞ O(A : B|C) = x : y if and only if x : y = Σ{v(x): x is in A ∩ C} : Σ{v(x): x is in B ∩ C}
which is equivalent to
M, v╞ O(A : B|C) = x : y if and only if x : y = Σ{ ICv (x): x is in A} : Σ{ ICv (x): x is in B ∩ C}
and to the more intuitive
M, v╞ O(A : B|C) = x : y if and only if ICv╞ O(A : B) = x: y
It is easily seen now that we can define the binary O, in simple odds statements, in terms of the ternary O:
Definition. ‘O(A : B) = x : y’ for ‘O(A : B|K) = x : y.
Here the ‘given’ is a tautology, so imposes no constraint.
EXAMPLE 3. We have our die loaded so as to favor the higher numbers, represented by the odds vector v = <1,1,1, 2, 2, 2>. What are the odds of throwing a 5, conditional on the outcome being an odd number?
Here C is {1, 3, 5} so ICv = <1, 0, 1, 0, 2, 0> while A and B are {5} and {1, 2, 3, 4, 6}. The odds in question are therefore 2: (1 +1) = 2:2, i.e. fifty-fifty, as they say.
EXERCISE 1. Olivia and Norman play a game: they have an urn with 35 black balls and 1 red ball. They take turns drawing without replacement, and the one to get the red ball wins. But they are interested only in getting a Superwin: get the red ball on your first draw. Norman chivalrously offers Olivia the choice to go first if she wishes. She thinks she could end the game at once with a Superwin (chance 1 in 36). But if she doesn’t then Norman will have an advantage: 1 out of 35 to get a Superwin. Would Olivia going first be to Norman’s advantage?
Answer. There are three worlds: OW (Olivia wins), NW (Norman wins), NL (Both lose). Suppose Olivia chooses to go first. About the correct odds vector v = <v(OW), v(NW), v(NL)> for this situation, we know its conditionalizations on OW and on not-OW:
IOW = <1, 0, 0> I~OW = <0, 1, 35>
v = IOWv + I~OWv
= <1, 0, 0> + <0, 1, 35>
= <1, 1, 35>
From this we can see that, even if Olivia goes first, the odds of Norman winning are the same as for her, namely 1: (1 + 35) = 1: 36.
The example that follows was in an earlier post, with a discussion about how reasoning by Bay’s’ Theorem amounts to finding the Bayes Factor, which is the number by which the prior odds are multiplied to yield the final odds. I’ll repeat a small part here to illustrate how we now see conditionalization on evidence, taken from tests with known error probabilities.
EXERCISE 2. There is a virus on your college campus, and the medical team announces that 1 in 500 students have this virus. There is a test, it has 1% false negatives, and 1% of the positive results are false positives. You are one of the students, with normal behavior, and reckon that your odds of having the virus are accordingly 1:499. You take the test and the result is positive. What are your new odds of having the virus?
Answer. Let’s say that all told there are 50,000 students and they are all tested. There are 100 students who have the virus. 99 of them test positive and 1 tests negative. There are 49,900 students who do not have the virus, and 499 of them test positive anyway. So you are one of 99+ 499 = 598 students who test positive, and only 99 of them have the virus and 499 did not. So the odds for you to have the virus is 99 : 499. Your odds for having the virus have been multiplied by 99.
That was the intuitive frequency argument, told in terms of odds. But what exactly was the manipulation of odds vectors that was involved?
There are four worlds: x1 (positive & virus), x2 (positive & no virus), x3 (negative & virus), x4 (negative & no virus). The posterior odds vector, which we can read off from the narrative, is
v = <99, 1, 499, [50,000 – 99-1-499]>
But you tested positive, so let’s conditionalize on that:
Putting mixtures and conditionalization together we can define Jeffrey Conditionalization. I call a Jeffrey shift the following operation on a probability function designed to change the value of a given proposition A to a specified number x:
(A → x)P = xP(. |A) + (1 – x)P( . |~A), where 0 ≤ x ≤ 1 and P(A) > 0 < P(~A)
Informally: while A gets the new probability x, the ratios of the probabilities of subsets of A to each other remain the same as they were, and similarly for the ratios of the probabilities of subsets of ~A.
I’ll use a similar notation (A → x : y) for the corresponding operator on odds vectors, which changes the odds of A to ~A to x : y.
Definition. (A → x : y)v = xIAv + yI~Av, with x, y non-negative
(If x = y = 0 this is a Jeffrey shift in odds only by courtesy notation.)
When we use the matrix representation it is clear how Jeffrey Conditionalization is a straightforward generalization of ordinary conditionalization.
EXAMPLE 4. Suppose K = {x1, x2, x3, x4}, that v = <2, 3, 4, 5> and A = {x1, x3}. The current odds of A to ~A are 6 : 8 or 3 : 4 or to make the generalization more obvious, 1 : (4/3). Now if you want to double the odds for ~A, instead of multiplying by 0 for the ~A worlds, multiply by 2!
and the new odds for A against ~ A are 6 : 16 or 3 : 8 or 1 : (8/3).
EXAMPLE 5. We thought we had a fair die, and so adopted odds vector v = <1,1,1,1,1,1>. Then we learned that the even outcomes A = {2, 4, 6} are twice as likely to come up as the odd numbers. So we update to the odds vector <1, 2, 1, 2, 1, 2>. What was that? It was the Jeffrey shift:
A partition in model M = <K, F, PP> is an set of mutually disjoint propositions which is exhaustive, that is, its union is K. If S = {B1, … , Bm} and P is a probability function then the law of total probability says:
P = P(B1)P(. |B1) + … + P(Bm)P(. |Bm)
The components P(. |Bj), j = 1, …, m are mutually orthogonal, by the following definition:
Definition. If P and P’ are probability functions defined on the same algebra F then P is orthogonal to P’ if and only if there is a proposition A in F such that P(A) = 1 and P’(A) = 0.
Notation: P ⊥ P’. This relation is symmetric and irreflexive.
The corresponding definition for odds vectors is:
Definition. If v and v’ are odds vectors defined on the same set K then v is orthogonal to v’ if and only if, for each member x of K, either v(x) = 0 or v’(x) = 0 or both.
Clearly two probability vectors are orthogonal iff the probability functions which they determine are orthogonal.
Using the same symbol for this relation, we note that in mathematics, there is for vectors in general there a standard definition of orthogonality:
v ⊥ v’ iff Σ {v(x)v’(x): x in K} = 0
Since the numbers in odds vectors are all non-negative, this sum equaling 0 is the case if and only if for any x in K, at least one of v(x) and v’(x) equals zero. So suppose that E ={x: v(x) = 0}. Then for v, E is certainly not true, while for v’, E is certainly true (by the definition in Note 2 above). So this corresponds exactly to the condition of orthogonality for probability functions. We can also put it a third way:
v and v’ are orthogonal exactly if there is a proposition A such that v = IAv and v’ = I~Av’
Now we also have a neater way to give, parallel to the law of total probability, the law of total odds:
v = IBv + I~Bv
If T is a partition then v = Σ{ IBv: B a member of T}
and this is an orthogonal decomposition of odds vector v.
Odds’ natural habitat in mathematics
The odds vectors are part of a finite-dimensional vector space. A vector space over the real numbers is a set of items (‘vectors’) closed under addition and scalar multiplication by real numbers. When the vectors are sequences of numbers (as they are in our context) the odds vectors are singled out by having no negative number components.
The dimensions correspond to the worlds – the worlds are the dimensions, you might say. With the worlds numbered as above, world x1 is represented by the unit vector v(x1)= <1, 0, 0, …, 0>, v(x2) = <0, 1, 0, …, 0> and so forth. The unit vector that corresponds to world x is the one a which ranks world x – or more precisely, the proposition {x} — as certain. These unit vectors are mutually orthogonal and span the space in this sense: each vector in that space is a linear combination of these unit vectors.
Propositions correspond to subspaces. If A = {x1, x2, x3} then A corresponds to the subspace [A] spanned by {v(x1), v(x2), v(x3)}. Proposition A is ranked as certain by precisely those vectors which are in [A].[3]
The operator IA is a projection operator, it is the projection on subspace [A]. If v is any vector then IAv is the vector that is exactly like v except for having 0s for worlds not in A.
So let’s make it official. The finite probability space M = <K, F, PP> has an associated vector space V(M). Most of its description is already there in the discussion above.
The Boolean algebra of propositions F has as counterpart in V(M) a Boolean algebra of subspaces of V(M). (Note well: that is not the algebra of all subspaces of V(M), which is not Boolean – I will illustrate below.)
Each proposition A in F is a set of worlds {xj, …, xk}
Definition. [A] = the subspace spanned by the vectors v(y): y in A.
Call [A] the image of A in V(M).
Notation: if X is any set of vectors, [X] is the least subspace that contains X, and we say that X spans that subspace. In the case of a unit set {v}, I’ll abbreviate [{v}] to [v].
Define the algebra of subspaces [F] to be the set of images of members of F, with the following operations:
meet: [A] ∧ [B] = [A ∩ B]
join: [A] ⊗ [B] = the least subspace that contains both [A] and [B]
orthocomplement: [A]⊥ = {v in V(M): v ⊥ v’ for all v’ in [A]}
order: [A] ≤ [B] iff A ⊆ B
First of all, the order is just set inclusion A] ≤ [B] iff [A] ⊆ [B]. Secondly, the meet is just set intersection: A] ∧ [B] = [A] ∩ [B], for the largest subspace contained in two subspaces is their intersection.
The other two operations are less obvious. [A] ⊗ [B] does not just contain [A] and [B] but also the linear combinations of vectors in [A] and vectors in [B].
Clearly [A] ⊗ [A]⊥ = [K], but the vectors that belong to neither [A] nor [A]⊥ are not to be ignored.
That [F] is isomorphic to F, though the algebra of subspaces is not Boolean
The point is, first, that [F] is indeed Boolean, isomorphic to F, but second that there are subspaces that are not images of propositions, and because of these, there are violations of the Boolean law of distributivity.
To take the second point first, let v = av(x1) + bv(x2), with both a and b positive. Since v(x1) = <1, 0, …> and v(x2) = <0, 1, 0, …> we see that v = <a, b, …>. Suppressing the extra zeroes we can picture it like this:
[v] is not an image of any proposition. The least subspace that contains v is [v] = {kv: k a real number}, the ray (one-dimensional subspace) spanned by v. Note that v is a mixture of those two unit vectors, so [v] is part of ([{x1}] ⊗ [{x2}]). Denoting the null space (subspace that contains only the null vector) as f:
In other terminology: the lattice of (all) subspaces of a vector space is a non-distributive lattice.
Why is the smaller algebra of subspaces [F] nevertheless Boolean, and isomorphic to F? The reason is that the unit vectors corresponding to worlds are all mutually orthogonal. That makes them images of propositions mutually compatible in the sense in which this word is used in quantum mechanics.[4] We need only verify:
[C] ≤ [A] ⊗ [B] iff C ⊆ A ∪ B
That is so because the right hand side is equivalent to [C] ⊆ [A B] = [A] ⊗ [B], and the order in [F] is set inclusion
[C] ≤ [A]⊥ iff C ⊆ ~A
That is so because ~ A contains precisely those worlds x such that v(x) is orthogonal to all vectors v(y) such that y is in A.
The General Reflection Principle demands that your current opinion (represented by a probability or expectation function) is within the range (convex closure) of the future opinions you foresee as possible. How does that idea look with odds?
The simplest case is obvious. Suppose the worlds are the possible outcomes of an experiment (e.g. toss of a die) and you are sure that the outcome will be one of the first three. Then your current opinion must assign 0 to the other dimensions, i.e. be in the subspace spanned by those first three corresponding unit vectors v(x1), v(x2), v(x3).
EXAMPLE 6. We are conducting an experiment with set of possible outcomes being the partition T = {B1, … , Bm}. Our current opinion for the outcome is vector v, so we know our possible posterior opinion will one of the vectors in the orthogonal decomposition {IB1v, … , IBmv). This corresponds to conditionalization in the case of probability – that the outcome of an experiment is a projection on a subspace is called the Projection Postulate in discussions of quantum mechanics.
It is a bit more complicated when you have a more arbitrary set of foreseen possible posteriors, say a set X of odds vectors of some sort. Then the principle should demand that your current opinion is represented by an odds vector that lies within the least subspace that contains X. What is that?
The answer appeals to ‘double negation’. First take the set of all vectors that are orthogonal to all members of X, which is the orthocomplement X⊥ of X. Those are the opinions certainly ruled out by the principle. Then take the orthocomplement of that: X⊥⊥.
It is a theorem that, whatever X is, X⊥⊥ is a subspace, and it is the least subspace that contains X.
The Reflection Principle then demands that your current opinion is constrained to be an odds vector that lies in that subspace.
What are called quantities elsewhere statisticians call random variables. A random variable on the model is any function that assigns a real number to each world in K. For example, K might be the days this week and function r assigns each day its precipitation amount. So a random variable r, in this case, is representable by a vector r = <r(x1), r(x2), …, r(x7)>.
Definition. The expectation value Ep(r) of r for p is Σ{ p(xj)r(xj) : j = 1, …, 7}, provided p is a probability vector.
But that is exactly the definition of the inner product (also, called scalar product) on our vector space:
Definition. The inner product of vectors v and v’ is the number (v, v’) = Σ{v(x)v’(x) : x in K}.
So the expectation value of quantity r for probability p is the inner product of their representing vectors.[5]
Since the values of the random variable, e.g. the amounts of precipitation, are absolute values on a chosen scale (e.g. inches) the expectation value is not something comparative, and there is no advantage is adapting the concept of expectation value from probability to odds.
But when this subject was originally created in the 17th century, before the technical concepts had solidified in our culture, we can read the texts as discussing the matter in terms of odds, quite naturally. (Translations tend to do so in terms of as probabilities and expectation values, that is, in terms of the concepts we mainly employ today, but I suggest that this may be anachronistic.)
For example, here is Huyghens’s Proposition III:
If I have p chances to get a, and the number of chances I have to get b is q, then (assuming always that each chance can occur equally easily): that is worth (pa + qb)/(p + q) to me. (My translation from the Dutch.)
Here p and q can be any natural numbers, say 17 and 51. The division by their sum points us to reading his text as, in effect, ‘If the probability to get a equals p/(p + q) …”. I am not saying that is wrong, I agree that if values to me are described in absolute rather than comparative terms, that is natural as well.
But think of this in a larger context:
I have 17 chances to get a, 51 chances to get b, 101chances to get c, …
You want to buy from me the chances to get a and to get b
How much do you owe me?
Three remarks:
the first line is most easily read as displaying two vectors, namely an odds vector <17. 51, 101, …> and a random variable vector <a, b, c, …>;
to calculate the fair price, reference to all the other contracts or lottery tickets that I have can be omitted,
the price must be an appropriate fraction of (a + b), with proportions of a and of b in the ratio 17 : 51, that is, 1 : 3.
So this is a way of reading the text, I think very naturally, in terms of odds thinking.
Admittedly these three remarks do not yet, taken together, yield Huyghens’ result. The gap is filled by his symmetry argument about games in the proof of his Proposition III. (See my post “Huygens’ probability theory: a love of symmetry” of April 2023.)
ENDNOTES
[1] The term “mixture” is common in physics, not in mathematics, but I like it because it is a term that aids visual imagery.
[2]I’m going to fuzz the use/mention distinction a bit from here on. As my friend Bob Meyer used to say, we are following the conventions of Principia Mathematica.
[3] Think about quantum logic. As introduced by von Neumann: subspaces are identified as the propositions for that logic. Various intuitive motivations have been offered for this.
[4] The unit vectors that correspond to worlds, in the way indicated, and which form a basis for the space, are the eigenvectors of a single observable. Propositions correspond to statements to the effect that the eigenvalues of this observable are within a certain range.
[5] Geometrically, the inner product measures the angle between the two vectors, and the inner product of a vector with itself measures its magnitude. Notation:
||v|| = square root of (v,v)
𝜙 is the angle v^v’ between vectors v and v’ iff the cosine of 𝜙 = (v, v’)/(||v||.||v||).
Equivalently, (v,v’) = ||v||.||v’||cos(v^v’).
Note that the cosine varies inversely with the angle.
Stalnaker’s Thesis that the probability of a conditional is the conditional probability, of the consequent given the antecedent, ran quickly into serious trouble, in the first instance (famously) by David Lewis.
When I took issue with David Lewis’s triviality results, Robert Stalnaker wrote me a letter in 1974 (Stalnaker 1976). Stalnaker showed that my critique of Lewis did not save his Thesis when applied to his (Stalnaker’s) own logic of conditionals (logic C2).
Stalnaker proved, without relying on Lewis’ special assumptions:
If the logic of conditionals is C2, and for all statements A and B, P(A → B) = P(B|A) when defined, then there are at most two disjoint propositions with probability > 0.
At first blush this proof must raise a problem of a result I had presented, namely:
Theorem. Any antecedently given probability measure on a countable field of sets can be extended into a model structure with probability, in which Stalnaker’s Thesis holds, while the field of sets is extended into a probability algebra.
This theorem does not hold for a language of which the logic is Stalnaker’s C2. Rather, it can be presented equivalently as a result for a language that has the same syntax as C2, but has a weaker logic, that I called CE.
While Stalnaker acknowledged that his proof was specifically for C2, and did not claim that it applied to CE, neither he nor I showed then just how the difference between the two logics resolves the apparent tension.
Here I will show just how Stalnaker’s triviality argument does not hold for CE, with a simple counterexample.
2. Stalnaker’s Lemma
Stalnaker’s argument relies on C2 at the following point, stated without proof, which I will call his Lemma.
Definition. C = A v (~A & (A → ~B))
Lemma. ~C entails C → ~(A & ~B)
We may note in passing that these formulas can be simplified using principles that hold in both C2 and CE, for sentences A and B that are neither tautologies nor contradictions. Although I won’t rely on this below, let’s just note that C is then equivalent to [A v (A → ~B)] and ~C to [~A & (A → B)].
3. The CE counter-example to the Lemma
I will show that this Lemma has a counter-example in the finite partial model of CE that I constructed in the post “Probabilities of Conditionals: (1) Finite et-ups” (March 29, 2021).
The propositions are sets of possible outcomes of a tossed fair die, named just by the numbers of spots that are on the upper face. To begin we take propositions
p = {1, 3, 5} “the outcome is odd”
q = {1, 2, 3} “the outcome is low”
The probability of (p → q) will be P(q|p) = P(1, 3)/P(1, 3, 5) = 2/3. That is the clue to the construction of the selection function s(x, p) for worlds x = 1, 2, 3, 4, 5, 6.
In this model the choices are these. First of all if x is in p then s(x, p) = x. For the other three worlds we choose:
s(2, p) = 1, s(4, p) = 3, s(6, p) = 5
Thus (p → q) is true in 1 and 3, which belong to (p ∩ q), and also in 2 and 4, but not in 5 or 6.
Hence (p → q) = {1, 3, 2, 4}, “if the outcome is odd then it is low”, which has probability 2/3 as required.
Similarly we see that (p → ~q) = {5, 6}.
To test Stalnaker’s Lemma we define:
c = p ∪ (~p ∩ (p → ~q))
= {1, 3, 5} ∪ ({2,4, 6} ∩ {5, 6})
= {1,3, 5} ∪ {6}
= {1,3,5, 6} “the outcome is odd or 6” or “the outcome is neither 2 nor 4”
~c = {2, 4} “the outcome is 2 or 4” (the premise of the Lemma)
Now proposition c has four members, and that means that in the construction of the model we need to go to Stage 2. There the original 6 world model is embedded in a 60 world model, with each possible outcome x replaced by ten worlds x(1), …, x(10). These are the same as x, except that the selection function can be extended so as to evaluate new conditionals. The previously determined choices for the selection function carry over. For example, s(4(i), p) = 3(i), so (p → q) is true in each world 4(i), for i = 1, …, 10.
We refer to the set {x(1), …, x(10)} as [x]. So in this stage,
c = [1] ∪ [3] ∪ [5] ∪ [6]
The conclusion of the Lemma is:
c → ~(p ∩ ~q} = c → ~[([1] ∪ [3] ∪ [5]) ∩ ([4] ∪ [5] ∪ [6])]
= c → ~[5] “If the outcome is either odd or 6 then it is not 5”
What must s(x, c) be? The way to determine that is to realize again that each member of c must have probability ¼ conditional on c. Probability ¼ equals 15/60 so for example (c → {1}) must have 15 members.
Since [1] is part of c, we must set s(1(1), c) = 1(1), and so forth, through s(1(10), c) = 1(10). Similarly for the other members of c.
To finish the construction we need to get up to 15, so we must choose five worlds y not in [1] such that s(y, c) = 1. Similarly for the rest. To do so is fairly straightforward, because we can divide up the members of [2] and [4] into four bunches of five worlds each:
S(2(i), c) = 1(i) for i = 1, .., 5
S(2(j), c) = 3 (j) for j = 6, .., 10
S(4(i), c) = 5(i) for i = 1, .., 5
S(4(j), c) = 6 (j) for j = 6, .., 10
Now each conditional c → [x] is defined for each of the 60 worlds, and has probability ¼ for x = 1, 3, 5, 6.
The Lemma now amounts to this, in this model:
~c implies c → ~{[5]}
or, explicitly,
[2] ∪ [4] ⊑ [[1] ∪ [3] ∪ [5] ∪ [6]] → ~[5]
For a counter-example we look at a specific world in which ~c is true, namely world 4(1). Above we see that s(4(1), c) = 5(1). Therefore in that world the conditional c → {5(1)} is true, and hence also c → [5], which is contrary to the conclusion of the Lemma.
4. Conclusion
To recap: in this finite partial model of CE the examined instance of Stalnaker’s Lemma amounts to:
Premise. The outcome is either 2 or 4
Conclusion. If the outcome is neither 2 nor 4 then it is not 5 either
And the counter-example is that in this tossed coin model, there is a certain world in which the outcome is 4, but the relevant true conditional there is that if the outcome is not 2 or 4 then it is 5.
Of course, given that the Lemma holds in C2, this partial model of CE is not a counter-example to Stalnaker’s argument as it applies to his logic C2 or its extensions. It just removes the apparent threat to CE.
REFERENCES
Stalnaker, Robert (1976) “Stalnaker to van Fraassen”. Pp. 302-306 in W. L. Harper and C. A. Hooker (eds.) Foundations of Probability Theory, Statistical Inference, and Statistical Theories of Science. Dordrecht: Reidel.
APPENDIX. The ‘Orthodox’ Representation of Vague Opinion p. 5
This puzzle was devised by Roger White (2010: 175ff.) in support of an argument against the very idea of vague probability judgements. (See e.g. Topey 2012 for discussion.)
To begin I will take up the puzzle itself, in its general form, as it applies also to precise opinion, and the fallacy it tends to evoke. Then I’ll discuss decisions under vague uncertainty, and end with an Appendix for open problems of a technical sort.
1. A Version Of The Puzzle
Time 0. Jack has a coin that you know to be fair. There is a certain proposition p about which you are uncertain (in one way or another), but you know that Jack knows whether p. Jack paints the coin so that you can’t see which side is Heads and which side is Tails, then writes ‘p’ on one side and ‘~p’ on the other. Jake tells you that he has placed whichever is true on the Heads side, and its contradictory on the Tails side. Jake will toss the coin so that you can see how it lands.
Time 1. Jack tosses the coin, and you see that it has landed with the side marked ‘p’ facing up
What does this do to your opinion about how likely it is that p is true?
Now we may be inclined to reason as follows:
[ARG] “This coin is fair, so the probability is 0.5 that it landed Heads up. But given that p is showing, p is true iff the coin landed Heads up. Therefore the probability that p is true is 0.5.”
Notice that it does not matter what p is, except that you are uncertain about it. Also note that your prior probability for p (whether precise or vague) makes no difference, to what your posterior probability becomes.
Notice also that if you had seen that the coin had landed showing ~p, you would have come to the posterior probability 0.5 for ~p, and hence also for p,by an exactly similar argument. Therefore, it is predictable beforehand that your posterior probability for p is 0.5, regardless of which proposition p is, and regardless of your prior probability for it. As soon as Jake has told you what he is going to do, if you believe you will look at the coin when it has landed, you know how likely you will take p to be at the end.
White dismisses this argument, with the words “But this can’t be right. If you really know this in advance of the toss, why should you wait for the toss in order to set your credence in p to 1 /2?”.
And dismiss it he should! For form of argument [ARG] quickly leads to total incoherence.
EXAMPLE
Jake has three cousins, called Jack, Jim, and Jules. They approach Mark, offering the same procedure as Jake’s, but for specific propositions. They have looked at die, and have recorded which face is up. Jack tells Mark he has a fair coin, and will write “The face up was either 1 or 2” on the Heads side if that was true, and on the Tails side otherwise, with the negation on the other side. Then he will toss that coin and Mark can see the result.
Mark remembers the entire discussion following Jake’s procedure, he accepts [ARG] as the proper reasoning, and so concludes that after seeing the result, whatever it is, he will have probability 0.5 that the outcome was either 1 or 2.
Jim now gets into the act, in precisely the same way, with the proposition “The outcome was 3, or 4”. Then Jules, of course, for the proposition “The outcome was 5 or 6”. Each is referring to the same coin record as Jack.
After they are done Mark has probability 0.5 that the outcome was either 1 or 2, and 0.5 that it was 3 or 4, and 0.5 that it was 5 or 6. So his probability that the outcome was 1, 2, 3, 4, 5, or 6 is now 1.5. His opinion is now completely incoherent.
To press the point home: a little theorem
It takes just a couple of lines to prove that for any probability function P and any propositions p and q in its domain, P(p) is between P(p|q) and P(p|~q), when defined.
So applications of [ARG] will be invalid whenever P(p) is a (sharp) probability other than 0.5.
For example, suppose p = (BvF won the lottery), and for me, p has a probability of less than a million (as it does). Then there does not existany proposition q such that I must assign 0.5 to p, by conditionalization, both when my evidence is q and when it is ~q.
2. Diagnosis
Argument [ARG] is spurious.
When Jake sets his procedure in motion, the question I must ask myself is this:
when Jake goes to place p on one side of the coin, how likely is he to place it on the Heads side?
Well, he will do so only if p is true. And how likely is that?
Suppose I bought one ticket in the lottery and Jake has checked whether it was the winning ticket. For p he selects You have won a million dollars.
Well, how likely is it that p is true?
For me, it has probability less than one in a million. So if I see that sentence on top, I say: this coin landed Heads up on this particular occasion only if Jake wrote p on the Heads side. And this he did only if I turned out have had the winning ticket. So the probability that the coin landed on the Heads side, on this particular occasion, is the probability that I won a million dollars, which is less than one in a million
Landed Heads implies I won a million. So Prob(Landed Heads) ≤ Prob(I won a million)
This does not deny for a moment that the coin is fair, and that it certainly was the case that the probability was 0.5 that the coin would land Heads up on that particular toss. But now that the coin is lying there, we have to go with what we know about Jake’s procedure.
3. What About Vague Probabilities?
Let’s first discuss vague probability taken as a general subject, setting aside for now any questions about the ‘orthodox’ representation of vague opinion (which is by means of families of probability functions).
Suppose then that I have no precise opinion at all, let alone sharp probabilities, for proposition p. In that case, when I see p displayed on top of the coin, I can’t reason with myself about how likely Jake was to place p on the Heads side of the coin. Thus what Jake has told me about how he would proceed, depending on whether p is true, has given me no usable information at all. There is nothing for me to process.
So I am at a loss, in the extreme case where I have no opinion at all. But what that sort of case can be is not easy to grasp, and I will give a concrete example below.
Vague opinion is not usually so totally vague as all that. In a more practical case, e.g. that the weatherman’s forecast was that it will rain tomorrow, I do have some opinion. For example, I may say that this is at least as likely as not. That is, my probability is at least 0.5, or equivalently (if we want to put it that way) the interval [0.5, 1].
What if I am offered a bet on this, with prize 1 utile? There is one highly conservative policy I could follow: if buying the bet, pay no more than 0.2, if selling take no less than 1. As to any other offer, just say no.
Well, that is fine with such a cozy bet on an innocuous subject, but what if a great deal depends on it? What if, in William James’ terms, the choice is forced, so that not betting is itself a choice with possibly awful consequences? To jump the chasm may cost you your life or it may save you, but if you do not jump you are almost certain to suffer debilitating exposure.
The other, highly permissive policy is to say: if you want, buy the bet at any price between 0.5 and 1, inclusive. None of these choices has anything to favor it over the others, but each has the merit that you may prefer it to inaction, although you cannot calculate a higher expectation value.
THE GAMBLE = AN OPINION UPDATE?
Suppose that in the above illustration I am offered a bet on it will rain tomorrow, with payoff 1if true (and 0 if false), for 0.6 utiles. Suppose I buy the bet. Am I irrational?
If that is irrational, then we are all irrational all the time, when we go into stores and buy things.
What did I do?
(I) Taking into account all the information I have, and judging it at least as likely as not that it will rain tomorrow, though not above nine times as likely as not, I know that I take a risk by buying the bet for 0.6, a risk that I cannot quantify.
Now, there is a longstanding idea that my opinion is whatever it is that is exhibited in my willingness to bet. If we apply that idea here, directly and uncritically, we arrive at:
(II) The act of betting 0.6 for a 1/0 option on rain tomorrow, at that point, shows that I have just updated my probability for rain tomorrow to the sharp probability 0.6.
Plausible in view of the tradition, concerning credence or subjective probability, that we are all part of, certainly. But (II) contradicts (I). For if (II) is correct, then the agent, me, has quantified the risk.
(I) says in effect that I am not changing my opinion about rain tomorrow at all. Rather, my opinion does not suffice to determine my decision. Note that there was clearly no opinion updating going on, for between the formulation of my opinion and the offer of the bet there was no new information to update on!
To show what my opinion is, I will continue to counsel anyone who asks that I can say no better than that the probability of rain tomorrow is at least 0.5. Then they can decide for themselves whether to take a risk with bets that cost more than 0.5, or not.
To me this is common sense.
A concrete example of ‘no opinion at all’
Roger White has an objection to (I), arguing that the permissive policy would lead to financial ruin. The policy would permit you to bet the same 0.6 each time, which would ignore all that is left open by that vague opinion. Although we do not know this, the chance might in each case be 0.5, while the agent keeps buying the bet for 0.6.
But this just ignores learning. Even an uneducated but reasonable gambler will keep lowering his bets if he is consistently losing. To be more concrete, since such a repetition of chances of rain is not plausible, suppose that the Jake puzzle example has been set up with proposition p identified so as to make sure of our total ignorance.
An experiment has been set up with a coin of unknown bias, it is tossed, and p is the proposition that it landed Heads up. Then Jake, who knows the result, continues the process with his fair coin, as in the puzzle.
What does it mean that the first coin is a coin with unknown bias? The probability that this coin lands Heads up is x, and x could equally be any number in [0,1]. Well, what is “equally”? What is it for x to be a random selection from [0,1]? There are different answers we could give here, but let’s take this one: for any two sub-intervals of [0,1] that are of equal length, the probability that x belongs to them is the same.
Then Jake’s procedure is in effect a two-coin process with unknown bias in the smaller interval [0, 0.5].
On the liberal policy if I am now asked to bet on whether both coins landed Heads up on a specific occasion, I could for example choose to buy the bet for 0.2. White’s argument implies that this liberal policy permits me to make that same choice each time if the experiment is endlessly repeated, and that this strategy would lead to financial ruin with certainty.
Is that so?
If the experiment is repeated, there are two possibilities that will merit the “unknown bias” label. First, it may be repeated each time with the same coin (or coin with the same bias). Second, the choice of bias in the tossed coin may be randomized as well.
In the first case, if the real bias is below 0.2 then I will lose more often than by chance. White ignores the information gained from this: in fact the results will allow me to learn, to modify my betting behavior, so as to converge on the real bias, whereafter I will not be consistently losing. If on the other hand the real bias is above 0.2 then I am making money! More power to me.
The second case is not so different, for to make this precise we must again specify what the randomness, in the successive choices of coins, amounts to. And depending on what it is, there will typically be in effect an average bias. The gambler can learn from the results, and depending on the gains or losses may be consistently lowering his bets, or else, be happily raking in the money!
But we still have a question. What can updating vague opinion be like, in a case where there is genuine new information? Nothing in the above discussion touches that question as yet.
There is more than one answer in the literature, I will mention some in the NOTES. White targets the ‘orthodox’ probabilistic representation of vague opinion (“mushy credence”), so let us look at that. But since phenomena are all and theories are creatures of the imagination only, I am isolating the technical questions from the general discussion.
4. APPENDIX. The ‘Orthodox’ Representation Of Vague Opinion
Take it that the agent’s opinion is stored as a coherent set S of judgments of the following forms:
P(p) ≤ x, P(p) ≥ y
with p belonging to a specific Boolean algebra, the domain of P. That will in effect include P(p) = x, when S includes both P(p) ≤ x and P(p) ≥ x.
The agent’s representor is the set of all probability functions on that algebra which satisfy all members of S. So for example, if the agent’s opinion is that rain is as likely as not, then all the members of the representor assign 0.5 to rain.
As an example to illustrate the main difficulty, suppose that p and q are logically independent propositions, and that the agent judges that each of them is as likely as not.
For example, p = it will rain tomorrow and q = I mislaid my hat.
Now the agent gets evidence q.
The orthodox recipe for updating is this: replace each member by its conditionalization on q if that is well-defined, and eliminate that member if not.
What is the result? Well, for each number y in [0,1] there is a function Q belonging to this representor S such that Q(p|q) = y. So after this updating, there is for each number y in [0, 1] a function in the posterior representor which assigns y to p. So after updating, the opinion about rain tomorrow, which was entirely irrelevant to my mislaying my hat, is now totally vague.
Updating in this way is debilitatingly destructive.
Two options
The above result, with examples, and surrounded by both informal and technical discussions, was in the literature well before Roger White’s paper.
The first idea we can try out is that we could prevent this disaster by putting constraints on the representor, by additions to state of opinion S. We can add judgments of expectation value, rather than just probability, and these allow us to add judgments of conditional probability. But the problem recurs at that level, any fix that remains with linear relations does not suffice. We’d have to add non-linear constraints, in some way, for independence and correlation are not expressible in any other way.
Anyone have suggestions? Constructive attempts to find a better representation of vague opinion?
The second idea is that it is conditionalization that is at fault, and that indeed the fault lies with the idea that the representor is to be updated point-wise. Updating the representor needs to be a holistic action, an action that preserves certain important structure of the representor as a whole.
How can we think about this? The representor is a convex structure: for if P(p) ≤ x and P’(p) ≤ x then so does any convex combination of P and P’. (Similarly for expectation value constraints.)
That suggests looking at the theory of convex structures taken as wholes. Anyone have suggestions?
NOTES
Originally I subscribed to the ‘orthodox’ representation of vague probability, with conditionalization as updating method. But looking at the dilation effect (cf. Seidenfeld and Wasserman 1993) I found that it ran into the trouble with conditionalization that I described above (see my papers listed below).
I mentioned that we could look into non-linear constraints on the representor. Difficult probably, but there is a study by Halpern, Fagin, and Megiddo (1990) that could be a resource for this idea.
As I said above, there are different answers in the literature, for questions about how to represent vague probability. One that is quite different from the ‘orthodox’ way is by Fagin and Halpern (1991).
For the weakness of arguments for updating pointwise by conditionalization, and the possibility of alternatives, the place to begin is Grove and Halpern (1998).
As to the different policies for decision making under vague uncertainty, an important technical discussion is by Teddy Seidenfeld (2004). Isaac Levi’s concept of E-admissibility is a candidate for the precise form of the liberal policy. Levi himself is easier to read. The quickest introduction though is section 3 of Seidenfeld’s retrospective on Levi’s work.
REFERENCES
Fagin, R. J. Y. Halpern, and N. Megiddo (1990) “A Logic for Reasoning about Probabilities”. Information and Computation87.1,2: 78-128.
Fagin, R. and J. Y. Halpern (1991) “Uncertainty, belief, and probability”. Computational Intelligence7: 160-173
Seidenfeld, T. (2004) “A contrast between two decision rules for use with (convex) sets of probabilities”. Synthese 140:69-88.
Seidenfeld, T., and Wasserman, L. (1993). “Dilation for Sets of Probabilities”. Annals of Statistics 21: 1139-54.
Topey, Brett (2012) “Coin flips, credences, and the Reflection Principle”. Analysis 72: 478-488.
van Fraassen, Bas C. (2005) “Conditionalizing on violated Bell’s inequalities”. Analysis 65.1: 27-32.
van Fraassen, Bas C. (2006) “Vague Expectation Loss”. Philosophical Studies127: 483–491.
White, Roger (2010) “Evidential symmetry and mushy credence”. In Oxford Studies in Epistemology, Vol 3. Ed. T. S. Gendler and J. Hawthorne, 161-188. New York: Oxford U Press.
Christiaan Huygens is well-known for his use of symmetry arguments in mechanics. But he also used symmetry arguments when he set out the foundations of the modern theory of probability, in delightfully easy form. That is my reading, I’ll explain it here.
Note: I rely on Freudenthal’s translation from the Dutch where possible, my own otherwise, and will list the English translation in the Bibliography,for comparison.
1. What is a symmetry argument?
An object or situation is symmetric in a certain respect if relevant sorts of changes leave it just the same in that respect. For example if you hang a square painting upside down it is still square, though in other respects it is not the same. If you say it is symmetric, you are referring to certain characteristics and ignoring certain differences
A symmetry argument exploits the differences that remain after the selection of what counts as relevant. Suppose we want to solve a problem that pertains to a given situation S. We state the same problem for a different situation S*, mutatis mutandis, and solve it there. Then we argue that S* and S are essentially the same, that is, the same in all respects relevant to the problem. On this basis we take the solution for this problem for S to be the same as the solution we found for S*.
Whatever be the problem at issue, it is not well posed except in the presence of a prior selection of which aspects will count as the relevant respects. So that prior selection must be understood as a given in each case.
2. Huygens’ fundamental postulate
Postulate. For a decision to be made under uncertainty, in a given situation S, there is an equitable (just, fair, “rechtmatig”) game of chance S* that is in all relevant respects the same as S.
More specifically, if I am offered an opportunity for an investment, what that opportunity is worth equals what it would be worth for me to enter the corresponding equitable game.
A game is equitable if no player is at a disadvantage compared to the others (“daer in niemandt verlies geboden is”). That is a symmetry: the players all have the same role. If roles are reversed the game is still the same. For example if a Bookie offers a bet to a Player, the bet is an equitable bet if the chances and amounts of gain and loss would remain the same for each if their roles were reversed.
The Netherlands was a mercantile nation, financiers would get together to outfit ships for trade i the Orient. When this sort of situations, with their opportunities for investment, the merchant must determine what those opportunities are worth. The relevant respects are just the chances the players have of gaining the various possible gains or losses, and the amounts of those gains or losses. Note well that the selection of the respects which alone count as relevant is a matter of choice, of the participants. Given that Huygens addresses the case in which the relevant respects are the gains and chances alone, the Postulate is not a substantive assertion, nor an empirical claim.
Given this postulate and the concept of an equitable game, everything is in place for a typical symmetry argument.
3. A chancy set-up with two outcomes with equal chances
Proposition I. If I have the same chance to get a or b it is worth as much to me as (a + b)/2.
The situation might be the offer of an investment opportunity. What is the corresponding equitable game?
In an equitable game each player places the same stake, and the winner will get the total stake, but my have side-contracts with other players (as long as the relevant symmetries are not broken).
I play against another person, we each put up the same stake x (what we take the game to be worth for us to play), and we agree that whoever wins will give a to the other. The event (e.g. a coin toss) has two possible outcomes, with equal chance. What must be the stake x so that the game situation is perfectly symmetric for us, that is, for each of us to have equal chances to receive either a or b?
We have equal chances to be winner or loser. The winner gets the total stake, which is 2x, but gives a to the loser. The result must be that the winner ends up with b. So b = (2x – a). So solving for x, we arrive at the solution that the stake is x = (a + b)/2.
That is straightforward, and all the relevant details are explicit. It is not so straightforward when the proposition is generalized to a number of possible outcomes.
4. A chancy set-up with three outcomes with equal chances
Proposition II. If I have the same chance to get a or b or c it is worth as much to me as (a + b + c)/3.
Huygens’ argument here makes the corresponding equitable game to be one with three players. Each will enter with the same stake x. Let me be player P1. With player P2 I agree that each of us, if winning, will give the other b. With player P3 I agree that each of us, if winning, will give the other c.
The event is one with three outcomes, with equal chances, and the winner being P1 on the first outcome, P2 on the second, and P3 on the third. If P2 wins, I get b. If P3 wins, I get c. What must the stake x be to complete the picture, with me getting a if I win?
If I win I get the total stake 3x, but pay out b to P2 and c to P3. So what is required is that a = (3x – b – c), or equivalently, that the stake x = (a + b + c)/3.
At first sight this is again straightforward. It answers what my stake should be.
But what if the other players didn’t want to put up that stake? Is player P2 in the same position in this game as P3, given that they have made different contracts with me?
What was not explicit in Huygens’ proof is that players P2 and P3 must do something ‘just like’ what I did by entering those special contracts.
Suppose all players do place stake x = (a + b + c)/3. Player P3 will get a if he wins, and will get c if I win. To complete the picture he must get b if P2 wins. And similarly, P2 needs to get c if P3 wins. So P2 and P3 would have to make an un-symmetric contract.
Is that not to the disadvantage of one, depending on which of b or c is the greater? No, for whoever wins will get a, and if they do not win the have an equal chance of getting b or c.
WHAT THEY RECEIVE
If the winner is:
P1
P2
P3
P1
3x – b- c
b
c
P2
b
3x-b-c (c to P3)
c
P3
c
b (from P3)
3x – b – c (b to P2)
In the description of the game, the roles I gave myself and the roles the others play are not the same. But in fact the game is equitable, because the different roles are the same in the relevant respects. The difference does not affect the chances that each of us have to get any of the three amounts a, b, c, which are the same for all of us. Therefore if we reverse roles, if P3 and I change chairs, so to speak, our chances for receiving any of those good outcomes are not changed. Think of a game of cards in which one of the players holds the bank. If the game is equitable, it does not matter who holds the bank, and makes no difference if the banker changes roles with one of the others.
What matters once again is the prior selection of what will count as relevant.
As Huygens points out, this argument is easily extended to 4, 5, … different outcomes with equal chances.
5. A chancy set-up with unequal chances
So far we have looked at games in which the deciding event is something like a toss with a fair coin or with a fair die, or with several of them, or some other such device. What about games in which the decision about who wins is made with a biased coin or biased die? What if the possible gains are a and b but their chances are different?
The words “chance” and “chances”. There is some ambiguity in how we use the word “chance” which appears saliently in Proposition III. If I buy a ticket in a 1000-ticket lottery, the chance I have of winning is 1/1000. What if I buy five tickets? Then I have five chances of winning! Or, we also calculate, then the chance of winning I have is 5/1000.
In the vernacular these are two ways of saying the same thing, but these two ways of speaking do not go together. Putting them together we get nonsense: If I say that I have five chances of winning, and that the chance of winning is 5/1000, can I then ask which of those five chances is the chance of winning?
Proposition III. If the number of chances I have for a is p, and the number of chances I have for b is q, then assuming that every chance can happen as easily, it is worth as much to me as (pa + qb)/(p + q).
Huygens uses the first way of speaking, and we can understand as follows, with an example. For each game with a biased coin or die there is an equivalent game with cards. Consider a game in which the deciding event is the toss which a biased coin, so that the chance of Heads is 2/3 and chance of Tails is 1/3. This game is equivalent to a 3-card game, with equal chances, in which two cards are labeled H and one is labeled T, and the prize is a if a card with H is drawn and is b if a card with T is drawn.
This assertion of equivalence is an appeal to symmetry. The two games are the same in all relevant respects, thereforereasoning about the one is equally relevant reasoning about the other.
Just by choosing the first way of speaking, Huygens has reduced the problem of the biased coin or die to the previous case. The game with unequal chances is equivalent to a game with equal chances, and this Proposition III is the straightforward generalization of Proposition II to arbitrarily large cases.
NOTES
Expectation value. What Huygens calls ‘what it is worth for me” (e.g. “dit is mij zoveel weerdt als …” in Proposition I) matches what we now call the expectation value. We nevertheless read Huygens’ monograph as a treatise on probability, for the two notions are interdefinable. For example the probability of proposition A is the expectation value of a gamble on A with outcome 1 if true and 0 if false.
Finitude. Huygens arguments go only as far as cases with a finite number of outcomes, with probabilities that are all rational numbers.
Translation. Christiaan Huygens’ little 1657 monograph was the first modern book on probability theory. It was first written in Dutch with the title Van Rekeningh in Spelen van Geluck. We note in passing that the Dutch word (“geluk”, in modern spelling) means “chance” in this context but it is the same word for fortune and for happiness. (Does that reveal something about the Dutch character?) Van Schooten translated this into Latin and published it as part of a larger workd. Hans Freudenthal (1980) offered a number of criticisms of how the text is understood (the French translation was excellent, he says, and the German abysmal) so provided his own English translation of various parts. The English translation of 1714 is precious for its poetic character, and still easily available.
The original Dutch version and information on the translations is available online from the University of Leiden:
My previous blog post on this subject was quite abstract. To help our imagination we need to have an example.
Result that we had. Let A→ be a Boolean algebra with additional operator →. Let P(A→) be the set of all probability measures on A→such that m(a → b) = m(b|a) when defined (“Stalnaker’s Thesis’ holds). Then:
Theorem. If for every non-zero element a of A→ there is a member m of P(A→) such that m(a) > 0 then P(A→) is not closed under conditionalization.
For the example we can adapt one from Paolo Santorio. A fair die is to be tossed, and the possible outcomes (possible worlds) are just the six different numbers that can come up. So the proposition “the outcome will be even” is just the set {2, 4, 6}. Now we consider the proposition:
Q. If the outcome is odd or six then, if the outcome is even it is six.
For the probability function m we choose the natural one: the probability of “the outcome will be even” is the proportion of {2, 4, 6} in {1, 2,3, 4, 5, 6}, that is, 1/2. And so forth. Lets use E to stand for “the outcome is even” and S for “the outcome is six”. So Q is [(~E ∪ S)→ (E → S)].
PLAN. What we will do is first to determine m(Q). Then we will look at the conditionalization m# of m on the antecedent (~E v S), and next on the conditionalization m## of m# on E. If everything goes well, so to speak, then the probability m(Q) will be the same as m##(S). If that is not so, we will have our example to show that conditionalization does not always preserve satisfaction of Stalnaker’s Thesis.
EXECUTION. Step One is to determine the probability m(Q). The antecedent of Q is (~E ∪ S), which is the proposition {1, 3, 5, 6}. What about the consequent, (E → S)?
Well, E → S is true in world 6, and definitely false in worlds 2 and 4. Where else will it be true or false?
Here we appeal to Stalker’s Thesis. The probability of (E →S) is the conditional probability of S given E, which is 1/3. So that proposition (E → S) must have exactly two worlds in it (2/6 = 1/3). Since it is true in 6, it must also be true in precisely one of {1, 3, 5}. Which it is does not affect the argument, so let it be 5. Then (E → S) = {5, 6}.
Now we can see that the probability of Q is therefore, by Stalnaker’s Thesis, the probability of {5,6} given {1, 3, 5, 6}, that is, 1/2. (Notice how often Q is false: if the outcome turns out to be 1 then the antecedent is true, but there is no reason why “if it is even it is six” would be true there, etc.)
Step Two is to conditionalize m on the antecedent (~E ∪ S), to produce probability function m#. If m# still satisfies Stalnaker’s Thesis then m#(E → S) = m(Q). Next we conditionalize m# on E, to produce probability function m##. Then, if things are still going well, m##(S) = m(Q).
Bad news! That is greater than m(Q) = 1/2. So things did not go well, and we conclude that conditionalization has taken us outside P(A→).
Why does that show that conditionalization has taken us outside P(A→)? Well suppose that m# obeyed Stalnaker’s Thesis. Then we can argue:
m##(S) = 1, so m#(S|E) = 1. Therefore m#(E → S) = 1 by Stalnaker’s Thesis. Hence m(E → S | ~E v S) = 1. Finally therefore m((~E v S) → (E → S)) = m(Q) = 1. But that is false, as we saw above m(Q) = 1/2.
So, given that m obeys the Thesis, its conditionalization m# does not.
Note. This also shows along the way that the Extended Stalnaker’s Thesis, that m(A → B|X) = m(B|A ∩ X) for all X, is untenable. (But this is probably just the 51st reason to say so.)
APPENDIX
Just to spell out what is meant by conditionalization, let’s note that it must be defined carefully to show that it is a matter of adding to any condition already present (and of course to allow that it is undefined if the result is a condition with probability zero).
So m(A|B) = m(A ∩ B)/m(B), defined iff m(B) > 0. Hence m(A) = m(A|K), where K is the tautology (unit element of the algebra).
Then the conditionalization m# of m on B is m(. | K ∩ B), and the conditionalization m## of m# on C is m#(. |K ∩ C) = m(. | K ∩ B ∩ C), and so forth. Calculation:
m##(X) = m#(X|C) = m#(X ∩ C)/m#(C) = m(X ∩ C |B) divided by m(C|B),
that is [m(X ∩ C ∩B)/m(B)] divided by [m(C ∩ B)/m(B)],
In his new book The Meaning of If Justin Khoo discusses the inference from “Either not-A or B” to “If A then B”. Consider: “Either he is not in France at all, or he is in Paris”. Who would not infer “If he is in France, he is in Paris”? Yet, who would agree that “if … then” just means “either not … or”, the dreaded material conditional?
I do not want to argue either for or against the validity of the ‘or to if’ inference. The curious fact is that just thinking about it brings out something very unusual about conditionals. Perhaps it will have far reaching consequences for the concept of logical entailment.
To set out the traditional concept of entailment let A be a Boolean algebra of propositions and P(A) the set of all probability measures with domain A. I will use “&” for the meet operator. Then entailment, as a relation between propositions, can be characterized in three different ways, which are in fact, in this case, equivalent:
(1) the natural partial ordering of A, with (a ≤ b) defined as (a&b) = a.
(2) For all m in P(A), if m(a) = 1 then m(b) = 1
(3) For all m in P(A), m(a) <= m(b)
The argument for their equivalence, which is spelled out in the Appendix, requires just two facts about P(A):
P(A) is closed under conditionalization, that is, if m(a) > 0 then m(. |a) is also in P(A), if defined.
If a is a non-zero element of A then there is a measure m in P(A) such that m(a) > 0.
Enter the Conditional: the ‘Or to If’ Counterexample
The Thesis, aka Stalnaker’s Thesis, is that the probability of conditional (a → b) is the conditional probability of b given a, when defined:
m(a →b) = m(b|a) = m(b & a)/m(a), if defined.
Point: if the special operator→ is added to A with the condition that m(a → b) = m(b|a) when defined, then these three candidate definitions are no longer equivalent. For:
(4) For all m in P(A), if m(~a v b) = 1 then m(b|a) = 1
(5) For many m in P(A), m(~a v b) > m(b|a)
For (4) note that if m(~a v b) = 1 then m(a & ~b) = 0 so m(a) = m(a&b). Therefore m(b|a) = 1. So on the second characterization of entailment, the “if to or” inference is valid. If you are sure of the premise you will be sure of the consequent.
But not so for the third characterization of entailment. For (5) take this example (I will call it the counterexample): we are going to toss a fair die:
Probability that the outcome will be either not even or six (i.e. in {1, 3, 5, 6}) = 4/6 = 2/3.
Probability that the outcome is six, given that the outcome is even = 1/3.
So in this context the traditional three-fold concept of entailment comes apart.
Losing Closure Under Conditionalization
Recalling that to prove the equivalence of (1) –(3) for a Boolean algebra, we needed just two assumptions, we can use that, together with the counterexample, to draw a conclusion that holds for every and any logic of conditionals with Stalnaker’s Thesis.
Let A→ be a Boolean algebra with additional operator →. Let P(A→) be the set of all probability measures on A→such that m(a → b) = m(b|a) when defined. Then:
Theorem. If for every non-zero element a of A→ there is a member m of P(A→) such that m(a) > 0 then P(A→) is not closed under conditionalization.
I was surprised. Previous examples of such lack of closure were due to special principles like Miller’s Principle and the Reflection Principle.
I do not think this result looks really bad for the Thesis, though it needs to be explored. It does mean that from a semantic point of view, there are in the same set-up two distinct logics of conditionals.
However, it seems to look bad for the Extended Thesis (aka ‘fully resilient Adams Thesis’):
(*) m(A → B| E) = m(B | E & A) if defined
For if we look at the conditionalization of m on a proposition X, namely the function m*(. | ..) = m( . | .. & X), then if m is well defined and satisfies (*) we get
m*(A → B| E) = m(A → B| E & X) = m(B | E & A & X) = m*(B| E & A)
that is, m* also satisfies the Extended Thesis. So it appears that the Extended Thesis entails or requires closure under conditionalization for the set of admissible probability measures.
But it can’t have it, in view of the ‘or to if’ counterexample.
Appendix.
That (1) – (3) are equivalent for a Boolean algebra (with no modal operators).
Clearly, if (a & b) = a then m(a) <= m(b), and hence also that if m(a) = 1 then m(b) = 1. This includes the case of a = 0.
So I need to show that if the first relation does not hold, that is, if it is not the case that a ≤ b, then neither do the other two.
Note: I will make use of just two features of P(A):
P(A) is closed under conditionalization, that is, if m(a) > 0 then m(. |a) is also in P(A), if defined.
If a is a non-zero element of A then there is a measure m in P(A) such that m(a) > 0.
Lemma. If it is not the case that (a&b) = a then there is a measure p such that p(a & ~b) > 0 while p(b & ~a|) = 0.
For if (a & b) is not a then (a & ~b) is a non-zero element. Hence there is is a measure m such that m(a & ~b) >0, and so also m(a) > 0. So m(.|a) is well defined. And then m(a & ~b|a) >0 while m(b & ~a| a) = 0.
Ad condition (3): Suppose now that (a & b) is only part of a, and m(a & ~b) > 0). Then m(a) > 0, so m(. |a) is well defined and in P(A). Now m(b|a) = m(b & a)/[m(b & ~a) + m(b & a)] hence < 1, hence < m(a|a) = 1.
Ad condition (2): All we have left now to show is that if (a & b) is not a, and a is not 0, then condition (2) does not hold either. But that follows from what we just saw: there is then a member m of P(A) such that m(a) > m(b & a). So consider the measure m(.|a), which is also in P(A): m(b|a) < 1, while of course m(a|a) = 1.
(Puzzle *) We, Able and Baker, A and B for short, are two propositions. Baker does not imply the negation of Able. Yet our conjunction is a self-contradiction. Who are we?
In any first or even second year logic course the right answer will be “you do not exist at all!” For if Baker does not imply the negation of Able then their conjunction could be true.
But the literature on epistemic modals furnishes examples, to wit:
“It is raining, but it might not be” cannot be true. Yet, “it might not be raining” does not imply “It is not raining”.
Such examples do rest on assumptions that may be challenged – for example, the assumption that the quoted sentences must all be true or false. But let that go. The interesting question is how such a logical situation as depicted in (Puzzle *) could be represented.
That sort of situation was studied in quantum logic, with its geometric models, where the propositions are represented by the subspaces.
A quantum mechanics model is built on a finite-dimensional or separable Hilbert space. In quantum logic the special properties of the infinite-dimensional, separable space do not play a role till quite late in the game. What matters is mainly that there is a well-defined orthogonality relation on this space. So it suffices, most of the time, to think just about a finite-dimensional Hilbert space (that is, a finite-dimensional inner product vector space, aka a Euclidean space).
For illustration think just of the ordinary 3-space of high school geometry but presented as a vector space. Draw the X, Y, Z axes as straight lines perpendicular to each other. The origin is their intersection. A vector is a straight line segment starting at the origin and ending at a point t, its tip; we identify this vector by its tip. The null vector 0 is the one with zero length. Vectors are orthogonal iff they are perpendicular, that is, the angle between them is a right angle.
In the diagram, the vectors drawn along the axes have tips (3, 0, 0), (0,5,0), and (0,0,2). The vector with tip (3, 5, 2) is not orthogonal to any of those.
If A is any set of vectors, its orthocomplement ~A is the set of vectors that are orthogonal to every vector in A. The subspaces are precisely the sets A such that A = ~~A. In this diagram the subspaces are the straight lines through the origin, and the planes through the origin, and of course the whole space. So the orthocomplement of the X-axis is the YZ plane. The orthocomplement of the solid arrow, with tip (3, 5, 2) is thus a plane, the one to which it is perpendicular.
About (Puzzle *). Our imaginative, intuitive picture of a 3-space provides an immediate illustration to solve (Puzzle *). In quantum logic, the propositions are the subspaces of a Hilbert space. Just let A and B be two lines through the origin that are not orthogonal to each other. Their conjunction (intersection) is {0}, the ‘impossible state’, the contradiction. But neither is in the other’s orthocomplement. In that sense they are compatible.
That the propositions are taken to be the subspaces has a rationale, introduced by von Neumann, back in the 1930s. The vectors represent physical states. Each subspace can be described as the set of states in which a certain quantity has a particular value with certainty. (That means: if that quantity is measured in that state, the outcome is that value, with probability 1.)
Von Neumann introduced the additional interpretation that this quantity has that value if and onlyif the outcome of a measurement will show that value with certainty. This became orthodoxy: here truth coincides with relevant probability = 1.
Given this gloss, we have:
subspace A is true in (the state represented by) vector v if and only if v is in A.
We note here that if vector u = kv (in our illustration, that they lie on the same straight line through the origin) then they belong to all the same subspaces. As far as truth is concerned, they are distinct but indiscernible. (For the textbook emphasis on unit vectors see note 1.)
Since the subspaces are the closed sets for the closure operation ~~ (S = the ortho complement of the orthocomplement of S), they form a complete lattice (note 2).
The self-contradictory proposition contains only the null-vector 0 (standardly called the origin), the one with zero length, which we count as orthogonal to all other vectors. Conjunction (meet) is represented by intersection.
Disjunction (join) is special. If X is a set of vectors, let [X] be the least subspace that contains X. The join of subspaces S and S’, denoted (S ⊕ S’), is [S ∪ S’]. It is a theorem that [S ∪ ~S] is the whole space. That means specifically that there is an orthonormal basis for the whole space which divides neatly into a basis for S and a basis for ~S. Thus every vector is the sum of a vector in S and a vector in ~S (one of these can be 0 of course).
One consequence is of course that, in traditional terms, the Law of Excluded Middle holds, but the Law of Bivalence fails. For v may be in A ⊕ B while not being either in A or in B.
The term “orthologic” refers to any logic which applies to a language in which the propositions form an an orthocomplemented lattice. So orthologic is a generalization of quantum logic.
The idea, once advanced by Hilary Putnam, that the logic of natural language is quantum logic, was never very welcome, if only because learning quantum logic seemed just too hefty a price to pay.
But the price need not be so high if most of our discourse remains on the level or ‘ordinary’ empirical propositions. We can model that realm of discourse by specifying a sufficiently large Boolean sublattice of the lattice of subspaces.
For a non-trivial orthocomplemented lattice, such as the lattice of subspaces of a Hilbert space, has clearly identifiable Boolean sublattices. Suppose for example that the empirical situations that we can discern have only familiar classical logical relations. That means that, in effect, all the statements we make are, precise or vague, attributions to mutually compatible quantities (equivalently, there is a single maximal observable Q such that all humanly discernible quantities are functions of Q).
Then the logic of our ‘normal’ discourse, leaving aside such subtleties as epistemic modals, is classical, even if it is, only a (presumably large) fragment of natural language. For the corresponding sublattice is Boolean.
Quantum states are variously taken to be physical states or information states. The paper by Holliday and Mandelkern (henceforth H&M) deals with information, and instead of “states” they say “possibilities” (note 3). Crucial to their theory is the relation of refinement:
x is a refinement of y exactly if, for all propositions A, if y is in A then x is in A.
I will use x, y, z for possibilities, which in our case will be quantum states ( those, we’ll see below, are not limited to vectors).
If we do take states to be to be vectors and propositions to be subspaces in a vector space, then the refinement relation is trivial. For if u is in every subspace that contains t then it is in [t], the least subspace to which t belongs (intuitively the line through the origin on which t lies) and that would then be the least subspace to which u belongs as well. So then refinement is the equivalence relation: u and t belong to the same subspaces. As far as what they represent, whether it is a physical state or an information state, there is no difference between them. They are distinct but indiscernible. Hence the refinement relation restricted to vectors is trivial.
But we can go a step further with Holliday and Mandelkern by turning to a slightly more advanced quantum mechanics formalism.
When quantum states are interpreted as information states, the uncertainty relations come into play, and maximal possible information is no longer classically complete information. Vectors represent pure states, and thought of in terms of information they are maximal, they are as complete as can be. But it is possible, and required (not only for just practical reasons), to work with less than maximal information. Mixtures, or mixed states, can be used to represent the situation that a system is in one of a set of pure states, with different probabilities. (Caution: though this is correct it is, as I’ll indicate below, not tenable as a general interpretation of mixed states.)
To explain what mixtures are we need to shift focus to projection operators. For each subspace S other than {0} there is the projection operator P[S]: vector u is in S if and only if P[S]u = u, P[S]u = 0 if and only if u is in ~S. This operator ‘projects’ all vectors into S.
For the representation of pure states, the job of vector u is done equally well by the projection operator P[u], which we now also refer to as a pure state.
Mixed states are represented by statistical operators (aka density matrices) which are, so to speak, weighted averages of mutually orthogonal pure states. For example, if u and t are orthogonal vectors then W = (1/2)P[u] + (1/2)P[t] is a mixed state.
Intuitively we can think of W as being the case exactly if the real state is either u or t and we don’t know which. (But see below.)
W is a statistical operator (or density matrix) if and only if there are mutually orthogonal vectors u(i) (other than 0) such that W = Σb(i)P[u(i)] where the numbers b(i) are positive and sum to 1. In other words, W is a convex combination of a set of projections along mutually orthogonal vectors. We call the equation W = Σb(i)P[u(i)] an orthogonal decomposition of W.
What about truth? We need to extend that notion by the same criterion that was used for pure states, namely that the probability of a certain measurement outcome equals 1.
What is certain in state W = (1/2)P[u] + (1/2)P[t] must be what is certain regardless of whether the actual pure state is u or t. So that should identify the subspaces which are true in W.
But now the geometric complexities return. If u and t both lie in subspace S then so do all linear combinations of u and t. So we should look rather to all the vectors v such that, if the relevant measurement probability is 1 in W then it is 1 in pure state v. Happily those vectors form a subspace, the support of W. If W = Σb(i)P[u(i)], then that is the subspace [{u(i)}]. This, as it happens, is also the image space of W, the least subspace that contains the range of W. (Note 4.)
It is clear then how the notion of truth generalizes:
Subspace S is true in W exactly if the support of W is part of S
And we do have some redundancy again, because of the disappearance of any probabilities short of certainty, since truth is construed following von Neumann. For every subspace is the support of some pure or mixed state, and for any mixed state that is not pure there are infinitely many mixed states with the same support.
While a pure state P[u] has no refinements but itself, if v is any vector in the support of W then P[v] is a refinement of W. And in general, if W’ is a statistical operator whose support is part of W’s support, then W’ is a refinement of W.
So we have here a non-trivial refinement relation.
Note: the geometric complexities. I introduced mixed states in a way seen in text books, that for example W = (1/2)P[u] + (1/2)P[t] represents a situation in which the state is either u or t, with equal probabilities . That is certainly one use (note 5).
But an ‘ignorance interpretation’ of mixtures in general is not tenable. The first reason is that orthogonal decomposition of a statistical operator is not unique. If W = (1/2)P[u] + (1/2)P[t] and W = (1/2)P[v] + (1/2)P[w] then it would in general be self-contradictory to say that the state is really either u or t, and that it is also really v or w. For nothing can be in two pure states at once. Secondly, W has non-orthogonal decompositions as well. And there is a third reason, having to do with interaction.
All of this has to do with the non-classical aspects of quantum mechanics. Well, good! For if everything became classical at this point, we’d lose the solution to (Puzzle *).
7. An open question
So, if we identify what Holliday and Mandelkern call possibilities as quantum states, we have ways to represent such situations as depicted in (Puzzle *), and we have a non-trivial refinement relation.
But there is much more to their theory. It’s a real question, whether continuing with quantum-mechanical states we could find a model of their theory. Hmmm ….
NOTES
In textbooks and in practice this redundancy is eliminated by the statement that pure states are represented by unit vectors (vectors of length 1). In foundations it is more convenient to say that all vectors represent pure states, but multiples of a vector represent the same state.
See e.g. page 49 in Birkhoff, Garrett (1948) Lattice Theory. Second edition. New York: American Mathematical Society. For a more extensive discussion see the third edition of 1967, Chapter V section 7.
Holliday, W. and M. Mandelkern (2022) “The orthologic of epistemic modals”. https://arxiv.org/abs/2203.02872v3
For the details about statistical operators used in this discussion see my Quantum Mechanics pages 160-162.
See P. J. E. Peebles’ brief discussion of the Stern-Gerlach experiment, on page 240 of his textbook Quantum Mechanics, Princeton 1992. Peebles is very careful, when he introduces mixed states starting on page 237 (well beyond what a first year course would get to, I imagine!) not to imply that an ignorance interpretation would be generally tenable. But the section begins by pointing to cases of ignorance in order to motivate the introduction of mixtures: “it is generally the case …[that] the state vector is not known: one can only say that the state vector is one of some statistical ensemble of possibilities.”
Is this a valid inference? Or better: under what conditions, if any, is this a valid inference?
It may well seem a natural, intuitive inference in certain cases. For example, if I am certain that the coin is fair then I am certain that the probability of Heads in a fair coin toss is 0.5. If instead I am certain that the coin is biased 3:1 in favor of Heads than I am certain that the probability of Heads in a fair coin toss is 0.75. And in both cases, my conditional probability for a Heads outcome of a fair toss is the corresponding probability, 0.5 in the one case and 0.75 in the other.
But that is one sort of example, and not all examples are so simple.
First I will show a very general form of modeling probabilistic situations in which this inference is indeed valid.
Secondly I will show, with reference to Miller’s Principle for objective chance and the Reflection Principle for subjective probability, that there are important forms of modeling probabilistic situations in which the inference is not valid at all.
And thereby hangs a tale.
One. Simple chance or simple factual opinion
We can think about the coin tossing example in either of two ways. The first is to equate the probabilities with objective chances, resulting from the structure of the coin and of the coin tossing mechanism. The second is to equate the probabilities with the subjective probabilities of a person who has certain odds for the coin toss outcomes. In both cases the set of probability functions that can represent the situation is simply all those that can be defined on the space {Heads, Tails}.
That set, call it PR, has a feature which could remain in similar models of more complex or more sophisticated forms: PR is closed under conditionalization. That is, if P is in PR, and A in the domain of P, then P( . | A) is also in PR.
Assumption I: the set of all probability functions that could represent the probabilities of the propositions in a certain possibility space S = <S, F> is closed under conditionalization.
Explanation: S is the set of possible states of affairs, F is a field of subsets of S (including S) — the members of F we call propositions. A model satisfying the Assumption is a couple M = <S , PR> where PR is a set of probability functions defined on S, which is closed under conditionalization (where defined).
Theorem 1. If it is the case that for all P in PR that if P(A) = 1 then P(B) = y, then it is the case that for all P in PR that P(B | A) = y when defined.
Proof. Suppose that for all P in PR that if P(A) = 1 then P(B) = y.
Suppose per absurdum that for some member P’ of PR it is the case that P’(B |A) = z, and it is not the case that z = y.
This implies that P’(A) > 0. Let Q = P’(. |A), the conditionalization of P’ on A.
Then Q(A) = 1 and Q(B) = z. So there is a member of PR, namely Q, such that Q(A) = 1 and it is not the case that Q(B) = y.
Two. Enter higher order probabilities
It may be tempting to think that the theorems for probability when higher order probabilities are not admitted all remain valid when we extend the theory to higher order probabilities. Here we have a test case.
Sometimes one and the same formula plays a role in the modeling of very different situations, and sometimes a formula’s status in various roles ranges from audacity to triviality, from truism to absurdity. All of that happens to be the case with the formula
(*) P(A | pr(A) = x) = x
first appearing as Miller’s Principle (connecting logical and statistical probability, now usually read as connecting measure of ignorance P with objective chance pr) and later as the Reflection Principle (connecting present opinion about facts with present opinion about possible future opinions about those facts). Both principles have a very mixed history (see Notes)
To model probabilities connected with probabilities of those probabilities we will not assume Assumption I (indeed, will show it running into trouble) but rather principle (*), which I will refer to by the name of its second role, the Reflection Principle.
Assumption II. There is a function pr which maps S into PR. For each number r and member A of F we define [pr(A) = r] = { x in S: pr(x)(A) = r}. For all A and r, [pr(A) = r] is a member of F.
(For most numbers, perhaps even for all but a finite set of numbers, [pr(A) = r] will be the empty set.)
(This looks audacious, but it is just how Haim Gaifman sets it up.)
The Reflection Principle is satisfied exactly if for all P in PR, and all A in F and all numbers r,
P(A | pr(A) = r) = r when defined
Theorem 2. If the Reflection Principle is satisfied then Theorem 1 does not hold for PR.
Proof. Suppose P(A) = 1. The Reflection Principle implies that P(A | pr(A) = 0.5) = 0.5 if defined, that is, if P(pr(A) = 0.5)>0.
But given that P(A) = 1, P(A | pr(A) = 0.5) = 1 also.
Therefore, if P(A) = 1 then P(pr(A) = 0.5) = 0.
So, with B = [pr(A) = 0.5] we see that for all P in PR,
if P(A) = 1 then P(B) = 0.
However, it is not always the case that for all P in PR, P(pr(A) = 0.5) | A) = 0.
This last point I will not prove. Think back to the example: the probability that the chance of getting outcome Heads = 0.5, given that the actual outcome will be Heads, is certainly not zero. For the actual outcome does not determine the chance that outcome had of occurring. Similarly, if I am ignorant of the outcome, then my personal probability for that outcome is independent of what that outcome actually is.
Corollary. If Assumption II holds and the Reflection Principle is satisfied, then the appropriate set PR is not closed under conditionalization.
Of course that corollary was something already known, from the probabilist version of Moore’s Paradox.
NOTES
[1] This is a recap, in more instructive and more general form of three preceding posts: “Moore’s paradox”, “Moore’s Paradox and Subjective Probability”, and “A brief note on the logic of subjective probability”.
[2] I take for granted the concept of a probability function P defined on F. As to conditional probability, P(B | A) this is a binary partial function defined by P(B | A) = P(B ∩ A)/P(A), provided P(A) > 0.
[3] David Miller introduced what came to be called Miller’s Principle in 1966, and produced a paradox. Dick Jeffrey pointed out, in effect, that this came by means of a modal fallacy (fallacy of replacing a name by a definite description in a modal context). Karl Popper, Miller’s teacher, compounded the fallacy. But there was nothing wrong with the principle as such, and it was adapted, for example, by David Lewis in his theory of subjective probability and objective chance.
[4] When I say that the Reflection Principle too has a mixed history I am referring to fallacies by its critics.
BIBLIOGRAPHY
Jeffrey, Richard C. (1970) Review of eight discussion notes. Journal of Symbolic Logic35, 124–127.
Miller, David (1966) A paradox of information. The British Journal for the Philosophy of Science, vol. 17, no. 1, 59-61.