Consistency of the Reflection Principle for Subjective Probability

A recent article by Cieslinski, Horsten, and Leitgeb, “Axioms for Typefree Subjective Probability” ends with a proof that the Reflection Principle cannot be consistently added to the axiomatic untyped probability theory which they present. 

On the other hand, Haim Gaifman’s “A Theory of Higher Order Probabilities” can be read, despite the glaring difference in interpretation, as establishing the consistency of the Reflection Principle.  

Gaifman’s theory is not untyped, and Gaifman’s approach is not axiomatic but model-theoretic. Thus it stays much closer to the original, informal presentation of the Reflection Principle.  But it is still noticeably abstract.  We can think of his models roughly like this:  certain sets of possible worlds are propositions, and there is a function pr which serves to select those propositions that can express factual statements of form “My (or, the agent’s) probability for A equals r”.

What I would like to do here is present a similar theory, staying in closer touch with the original presentation of the Reflection Principle, and entirely explicit about the way the opinion I currently express (about A, say) is constrained to harmonize with my opinions about how that opinion (about A) could change in time to come.

Introduction

The Reflection Principle purports to be an additional criterion of synchronic coherence: it relates current opinion to other current opinions.  The principle has both a general form (the General Reflection Principle), but also a form specifically for agents who have opinions about their own (current and/or future) doxastic states. The latter was the original formulation, but should now properly be called the Special Reflection Principle.  I will formulate both forms precisely below.

Satisfying Reflection does not require any relation between one’s actual opinions over time.  Nevertheless it is pertinent also for diachronic coherence, because it is a constraint on the agent’s current expectation of her future opinions, and because a policy for managing one’s opinion must preserve synchronic coherence.  

So a minimal probability model, of an agent whose opinion satisfies Reflection, will consist of a probability function P with a domain that includes this sort of proposition:

(Q)   A & my opinion at (current  or future) time t is that the probability of A equals r.

I symbolize the second conjunct as pt(A) = r.  Hence, symbolically,

            (Q) A & pt(A) = r.

Statement pt(A) = r is a statement of fact, true or false, about the agent’s doxastic state at time t.  The agent can express opinions about this, as about any other facts.  

In contrast I  will use capital P to stand for the probability function that encodes the agent’s opinion.  This is the opinion that she expresses or would express with statements like “It seems twice as likely as not (to me) that it will snow tonight”.  So the sentence P(A) = r is one the agent uses to express such an opinion, and she does this in first-person language.  

The (special) Reflection Principle implies a constraint on the opinion expressed in form P(A & pt(A) = r), which relates the opinion expressed about A to the factual statement that the agent has that opinion. 

There is in the corresponding language no nesting: nothing of form P( … P …).  Whenever the agent expresses an opinion, it is an opinion about matters of fact.

We can proceed in two stages.  The first is just to see what the more modest General Reflection Principle is, and how it is to be satisfied. Then we can build on that to do the same for the Special Reflection Principle.  I will focus on modeling, and — except at one point — just take it that the relation to a corresponding language will be sufficiently clear.

Stage 1: General Reflection

My current probability for A must lie within the range spanned by the probabilities for A that I may have or come to have at any time t (present or future), as far as my present opinion is concerned.

To illustrate:  I am a weather forecaster and realize that, depending on whether a certain storm front moves in during the night, my forecast tomorrow morning will be either 0.2 or 0.8 chance of rain.  Then my present forecast for rain must be a chance x of rain tomorrow with x a number in the open interval (0.2, 0.8).

basic model  to represent an agent who satisfies the General Reflection Principle will be the quadruple M = < S, F, TPROB, Pin>, with its elements specified as follows.

T, the set of times,  is a linearly ordered finite or countable set with first member (the times).  For each t in T, TPROB(t) is a finite set of probability functions.  These are functions defined on a field F of sets in space S, with F having S itself as a member.  The members of F represent propositions about which, at any time t, I have an opinion, and the members of TPROB(t) are the opinions I could have at time t. 

= <S, F> I will call the basic space.  I will use A, B, … for members of F, which I will also call the elementary propositions.  The set of probability functions defined on the space  = <S, F> I will call Sp.

At the initial time the agent expresses an opinion, which for now I designate as Pin, consisting in probabilities both for the events represented in space S and about how likely she is to have at time t the various opinions represented in TPROB(t).

The General Reflection Principle requires that for all A in F, Pin(A) is within the span (convex closure, convex hull) of the set {p(A): p is in TPROB(t)}. I will designate that convex closure as [TPROB(t)].  The members of TPROB(t) are the vertices of [TPROB(t)].

Since Pin assigns probabilities to the members of TPROB(t) which are defined on the domain of Pin itself.  General Reflection then implies that Pin is a mixture (convex combination) of those members, with the weights thus assigned:

Pin(A) = ∑ {Pin(p)p(A): p in TPROB(t)}

Equivalently, <S, F, Pin > is a probability space, and as it happens, for each t in T, there are appropriate weights such that Pin is a convex combination of the members of TPROB(t).  

Pin cannot be more than one thing, so those convex combinations must produce, for each time t, the same initial opinion.  We can ensure that this is possible by requiring that for all t and t’,  [TPROB(t’)] = [TPROB(t)].  Of course these sets TPROB(t) can be quite different for different times t; the vertices are different, my opinions are allowed to change.  And specifically, I will later on have some new certainties, for example after seeing the result of an experiment.  What this constraint on the span of foreseen possibilities about my opinion implies for certainties is this:  

if today I am not certain whether A,  then, if I foresee a possibility that I will become certain that A at a later time, then I foresee also a possibility that I will become certain of the opposite at that time.

SUMMARY:  In this construction so far we have Pin defined on a large family of distinct sets, namely the field F of elementary propositions, and each of the sets TPROB(t), for t in T.  

The construction guarantees that Pin, in basic model M = <S, F, TPROB, Pin> satisfies the General Reflection principle.  

But we have not arrived yet at anything like (Q), and we have not yet given any sense to ‘pt(A) = r’.  This we must do before we can arrive at a form in which the Special Reflection Principle is properly modeled.

Stage 2: Special Reflection

The function Pin cannot do all that we want from it, for we need to represent opinions that relate the agent’s probabilities for events in space S to the probabilities assigned to those events by opinions that the agent may have at various (other) times.

Intuitively, (pt(A) = r) is the case exactly if the ‘actual’ opinion at time t is represented by a function p in TPROB(t) such that p(A) = r.  In general there may either no, or one, or many members of TPROB(t) which assign probability r to A.  

So the proposition in question is thus:

(pt(A) = r)  =   {p in TPROB(t): p(A) = r}

Since Pin is defined for each p in TPROB(t), Pin assigns a probability to this proposition:

            Pin(pt(A) = r)  = ∑{Pin(p): p(A) = r and p is in TPROB(t)}.  

But what is not well-defined at this point is a probability for the conjunction (*), mentioned above,  since A is a member of field F and (pt(A) = r) is a member of a quite different field, of subsets of TPROB(t). 

We must depart from the minimalist construction in the preceding section, and extend the function Pin  to construct a function P which is well-defined, for each time t, on a larger space.  This process is what Dick Jeffrey called Superconditioning. 

I have explained its relevant form in the preceding post, with an illustration and intuitive commentary.  So I will here proceed a bit more formally than in the preceding post and without much intuitive explanation.  

NOTE.  At this point we should be a bit more explicit about how the model relates to a corresponding language.  Suppose L is a language of sentential logic, and is interpreted in the obvious way in model M:  the semantic value [[Q]] of a sentence Q in L is an elementary proposition, that is, a subset of S, a member of field F.  

As we now build a larger model, call it M*, by Superconditioning, I need to have a notion of something in M* being ‘the same proposition’ as a given elementary proposition in M.  I will use the * notation to do that:  there will be relation * between M and M* such that a sentence Q which has value [[Q]] in M  has semantic value [[Q]]* in M*.  

Quick overview of the final model, restricted to a specific time t:  

Given: the basic model defined above, to which we refer in the description of final model M*.  

M*(t) = <S*, F*, TPROB*(t), P>, with

S* = S x TPROB(t)

If A is in F then A* = {<x, p>:  x is in  A, p is in TPROB(t)}, 

equivalently, A* = A x TPROB(t)

F* is a field of subsets of S* which includes {A*: A is in  F}

TPROB*(t) and P, defined on F*, are such that for all A in F, P(A*) = Pin(A)

Construction of the final model, for specific time t

 We focus on a specific time t, but the procedure is the same for each t in T.  Let TROB(t) = {p1, …, pn}.    Each of these probability functions is defined on the space S.  

But now we will think instead about the combination of each of these probability functions with as a separate entity.

For each j, from 1 to n, there is a set Sj = {<x, pj>}: x in S}.  Equivalently, Sj =  S x {pj}

We define:  

            for A in F, A= {<x, pj>:  x is in A},  

            the field Fj = {Aj : A is in F}.  

Clearly  Sj = <Sj, Fj> is an isomorphic copy of S, disjoint from Sk unless j = k..

            S* = <S*, F*> is the sample space with S* = ∪{S: j = 1, …, n}.  

Equivalently, S* = S x TPROB(t)

            F* is the least field of subsets of S* that includes S* and includes ∪{Fj: j = 1, …, n}.  

The sets Sj therefore belong to F* and are the cells in a partition of S*.  (These cells represent the distinct situations associated with the different probability functions pj, j = 1, …, n.)

Equivalently, F* is the closure of ∪{Fj: j = 1, …, n} under finite union.  This is automatically closed under finite intersection, since each field Fj is closed under intersection, and these fields are disjoint.  F* has S* as a member, because S* is the union of all the cells.  And the infimumof F*  is Λ because that is a member of each cell; note also that  Λx TPROB(t) is just  Λ.

Clearly, all members of F* are unions of subsets of those cells, specifically finite unions of sets Asuch that A is in F, for  certain numbers k between 1 and n, inclusive.

For A in F, we define A* = ∪{A:  j = 1, …, n}.  Clearly, A* = {<x, p>: x in A, p in TPROB(t)}

 The function f: A –> A* is a set isomorphism between F and F*.  For example,

A* ∩ B*    = [∪{A:  j = 1, …, n}] ∩ [ ∪{B:  j = 1, …, n}]  

                  = ∪ { Aj ∩ B:  j = 1, …, n}] 

                  =  (A ∩ B)*

Now we come to the probabilities.

Definition.   pj* is a probability function on Sj defined by pj*(Aj) = pj(A) for each proposition A in S.  

            TPROB* = { pj*| j = 1, …, n}

Looking back once again to our basic model we recall that there are positive numbers bj for j = 1, …, n, summing to 1 such that Pin = ∑{bjpj: j = 1, …, n}.  

We use these same numbers to define a probability function P on sample space S* as follows:

            For j = 1, …,n

  1. P(Sj) = bj
  2. for each A in F, P(Aj|Sj) = pj*(Aj).  
    1. Equivalently, for each A in F, P(A* ∩ Sj) = P(Aj) = P(Sj)pj*(Aj).  
  • P is additive: if A and B are disjoint members of F* then P(A ∪ B) = P(A) + P(B)

Since all members of F* are finite unions of members of the cells Sj, j = 1, …, n it follows that P is defined by this on all members of F*

It is clear that 3. does not conflict with 2. since pj* is additive.  Since the weights bj are positive and sum to 1, and each function pj* is a probability function which assigns 1 to Sj it follows that P is a probability function with domain F*, and is the appropriate convex combination of the functions pj*.

P(A*) = ∑ {P(A* ∩ Sj): j = 1, …, n} 

= ∑{P(Aj): j = 1, .., n}

= ∑bjpj*(Aj)

= ∑bjpj(A)

= Pin(A)

About the Special Reflection Principle

Define:

(pt(A) = r) = ∪{Sj : P(A*|Sj) = r}

Equivalently,

(pt(A) = r) = ∪{Sj : pj*(Aj) = r}

Since TPROB*(t) is finite, we can switch to a list:

(pt(A) = r)  =    ∪{Sj : j = k, .., m}

            P(pt(A) = r)  =   ∑ P{Sj : j = k, .., m} = ∑ {bj: j = k, …,m}

With this in hand we now calculate the probability of the conjunction (Q)

A* ∩ (pt(A) = r)  =  A* ∩ ∪{Sj : j = k, .., m}

                                = ∪{A ∩ Sj : j = k, .., m}

                                = ∪{Aj : j = k, .., m}

            P(A* ∩ pt(A) = r)  =  ∑{ P(Aj ): j = k, .., m}

                                              = ∑ {P(Sj)pj*(Aj): j = k, .., m}

                                            =  ∑{bjpj*( Aj): j = k, .., m}

                                          =  r∑{bj: j = k, .., m}  

because for each j = k, …, m, pj*( Aj) = r.

Given both these results, and the definition of conditional probability, we arrive at:

            P(A* | pt(A) = r) = r, if defined, that is, if P(pt(A) = r) > 0.

the Special Reflection Principle.

NOTES

1]  The same formalism can have many uses and interpretations — just like, in physics, the same equation can represent many different processes.  Of course, here “the equation” refers just to the mathematical form, with no reference to meaning or interpretation.

In that sense the Reflection Principle appeared first (as far as I can remember) as Miller’s Principle, connecting subjective probability with objective chance, and used in that sense by David Lewis in his theory thereof.  

Then Haim Gaifman, who uses the notation and pr,  gave Miller’s Principle the interpretation that the person expressing her opinion P takes pr to be the opinion of someone(s)  or something(s) recognized as expert(s), to which she defers.  I have drawn on Gaifman’s theory with that interpretation elsewhere, to give a sense to acceptance of a scientific theory. 

2] But the possibility of this sort of reading, which I had mentioned in “Belief and the Will” only to dismiss it for the issue at hand, did promote a misreading of the Reflection Principle.  (As David Christensen did, for example.)  It would clearly be irrational for me to defer to my future opinion except while supposing that I will then be both of sound mind and more knowledgeable than I am now.  But it is not irrational even now to expect myself to be both of sound mind and more knowledgeable, as a result of the sort of good management of my opinion over time, on the basis that I am committed to do so.  And this, all the while knowing that I may either be interrupted in this management by events beyond my control or by interrupting myself, in the course of gaining new insights.   

This is exactly of a piece with the fact that I can morally promise, for example, to protect someone, and expect myself to keep my promise, and morally expect others to rely on my promise, while knowing — as we all do —  the general and irremediable fact that, due to circumstances presently unpredictable, I may fail to do so, either because of force majeure or because of overriding moral concerns.  In epistemology must strive for the same subtlety as in ethics.

3] See previous post, “Conditionalizing on a combination of probabilities” for Jeffrey’s concept of Superconditioning and its relation to the informal Reflection Principle.

REFERENCES

Cieslinski, Cezary,  Leon Horsten, and Hannes Leitgeb (2022) “Axioms for Typefree Subjective Probability”.  arXiv:2203.04879v1

Gaifman, Haim (1988)  “A Theory of Higher Probabilities”.  Pages 191-219 in Brian Skyrms and William L. Harper (eds.) Causation, Chance and Credence.  Dordrecht: Kluwer, 1988.

Van Fraassen, Bas C. (1995)  “Belief and the Problem of Ulysses and the Sirens.”  Philosophical Studies 77: 7–37.

A Rudimentary Algebraic Approach to the True, the False, and the Probable

A motivation for this, which will show up in Application 2, is to show that it is tenable to hold that in general, typically, conditionals are true only if they are certain. I do not propose this for conditionals in natural language. But I think it has merits in certain contexts in philosophy of physics, notably interpretation of the conditionals that appear in Einstein-Podolsky-Rosen and Bell Inequality arguments.

[1] The algebra page 1

[2] The language: first step in its interpretation page 1

[3] The algebra:  filters and ideals page 2

[4]  The language and algebra together: specifying a truth-filter page 2

[5] The language, admissible valuations, validity and the consequence relation page 3

APPLICATION 1:  probability space models page 4

APPLICATION 2: algebraic logic of conditionals with probability page 5

NOTE:  I will explain this approach informally, and just for the simple case in which we begin with a Boolean algebra.  

The languages constructed will in general not be classical, but in this case validity of the classical sentential logic theorems will be preserved, even if other classical features are absent.

But this approach can be applied starting with some other sort of algebra.

[1] The algebra

Let us begin with a Boolean algebra A, with the operations ∩,  ∪, -, relation ⊆, top K, and bottom Λ.  From my choice of symbols you can see that I find it useful to think of it as an algebra of sets.  That will be characteristic of some applications.  But this play no role for now, it just helps the imagination.

I will use little letters p, q, r, … to stand for elements of A.

I have left open here whether there are other operations on this algebra, such as modal operators.  Application 2 will be to a Boolean algebra with modal operator ==>.

[2] The language: first step in its interpretation

As far as the algebra is concerned, all elements have the same status.  But we can introduce distinctions from outside, by choosing a language that can be interpreted in that algebra.  When we do that each sentence E has a semantic value [[E]], which is an element of A, and we call it the proposition expressed by that sentence.

So let us introduce a language L.  It has atomic sentences, the classical (‘Boolean’) connectives &, v, ~.  It may have a lot more.  The interpretation is such that

[[~E]] = -[[E]] (the Boolean complement)

[[E & D]] = [[E]] ∩ [[D]]

[[E v D]] =  [[E]] ∪ [[D]]

and there will of course be more clauses if the language has more resources for generating complex sentences.

The atomic sentences, together with those three classical connectives, form a sub-language, which I will call Lat.  This this is a quantifier-free, modal operator free, fragment of L.  I tend to think of the members of Lat as the empirical sentences, the language of the data, but again, that is at this point only a mnemonic.

The set of propositions expressed by sentences in Lat I will call A0, that is {[[E]]: E is in Lat}, and it is clearly a Boolean algebra too, a sub-algebra of A.  In general A will be much larger than A0.

[3] The algebra:  filters and ideals

What about truth and falsity?  I will take it that the true sentences in the language together form a theory, that is, a set closed under the language’s consequence relation — which clearly includes the consequence relation of classical sentential logic.  I take it also that this theory is consistent, but do not assume that must be complete.

The algebraic counterpart of a theory is a filter: a set F of elements of A such that, if p ⊆ q and p is in F then so is q,  and if r, q are both in F then so is (r ∩ q).  A filter is proper  exactly if it does not have  Λ as a member.  That corresponds to consistency.

The filter that consists of the propositions expressed by the members of a consistent theory is a proper filter.  Obviously all filters contains K.

A set of elements of A is an ideal exactly if: if p ⊆ q and q is in G then so is p,  and if r, q are both in G then so is (r ∪q).  The ideal is proper if K is not in it.  Obviously any ideal contains  Λ.

Filter F has as counterpart an ideal G = {-p: p is in F}, where -p is the complement of p in A.  This corresponds to what the theory rules out as false.  

[4]  The language and algebra together: specifying a truth-filter

Now we are ready to talk about assigning truth-values.  Remember that the language L already has an interpretation [[.]] into the algebra of propositions A.  What we need to do next then is to select the propositions that are true, and then assign value T to the sentences that express those propositions.

Well, I will show a way how we can do that; but there are many ways.  I would like the ‘empirical sentences’ all to get a truth-value.  In addition there may be a class of sentences that also should get truth-values, for some reason.  They could be selected syntactically (in the way Lat is), or they could be selected as the ones that express a certain sort of proposition.  The latter would be a new way of doing the job, so that is what I will outline.

Step 1 is to specify a proper filter T on A, which will be the set of propositions that we will specially specify as true, regardless of whether they belong to A0.  Its corresponding ideal U is then the set of propositions that we will specially specify as false.

Step 2  is to specify a filter T0 on A0, as the set of true propositions which are values of ‘empirical sentences’, and indeed we want T0 to be a maximal proper filter on A0.  Then its corresponding ideal U0 on A0 is a maximal proper ideal, and A0 is the union of T0 and U0.  So every proposition in A0  is classified as true or false.

There is one important constraint on this step.  Clearly we do not want any proposition to be selected as true in one step and false in the other step.  So the constraint is:

                        Constraint on Truth Filtering.   T0  does not overlap U.  

It follows then also that U0 does not overlap T.

The final step is this: T* is the smallest filter that contains both and T0. We designate  T* as the set of true propositions in A.  This is the truth-filter.  Its corresponding ideal U* is the set of false propositions in A.

This is an unusual way of specifying truth conditions, not least because there will in general be propositions that belong neither to T* nor to U*: in general, bivalence fails.

We need to show that T* is a proper filter.  

Lemma. For every proposition p in T* there is a proposition q in and a proposition r in T0  such that q ∩ r ⊆ p.

It is easiest to prove this via the relation between filters and theories.  Let Z be the least theory that contains theories X and Y:  thus Z  is the set of sentences implied by X ∪ Y.  Implication, in our context, is finitary, so if A is in Z then there is a finite set of sentences belonging to X ∪ Y whose conjunction implies A.

Suppose now that  T* is not proper.  Then there is a proposition p such that both p and -p are in T*.  They cannot both be in nor both in T0.  The Constraint on Truth Filtering implies that if p is in T0 then -p is not in T, so -p must a proposition that is not in either T or T0.  Similarly if p is in then -p cannot be in T0  so it must be in neither  nor T0.  So we see that either p or -p belongs to neither T nor T0, but must be in the part of T* that is ‘implied’ by meets of elements taken from and from T0.

By the Lemma there must be propositions q and r in T  and T0 respectively such that (q ∩ r) ⊆ p, and also q’ and p’ in and T0 respectively such that (q’ ∩ r’) ⊆ -p .  But then there is a proposition s = (q ∩q’) in T and a proposition t = (r ∩ r’)  in T0 such that (s ∩ t) ⊆ (p ∩ – p) =  Λ. 

In that case t ⊆ -s, while t is in T0  and -s belongs to U.  And that is not possible, given the Constraint on Truth Filtering.

Therefore T* is a proper filter.

[5] The language, admissible valuations, validity and the consequence relation

Time to look into the logic in language L when the admissible assignments of truth-values are all of this sort!

What we have described informally now is the class of algebraic models of language L.  The sentences E in have as semantic values propositions [[E]] in A.  is a Boolean algebra with a designated filter T* and designated ideal U* = {-p: p is in  T*}.  An admissible valuation of L is a function v such that for all sentences E of L:

  • v(E) = T if and only if [[E]] is in T*
  • v(E) = F if and only if [[E]] is in U*

This function is not defined on other sentences: those other sentences, if any, do not have a truth-value.

So an admissible valuation is in general a partial function on the set of sentences of L.

Validity

Boolean identities correspond to the theorems of classical sentential logic.  If E is such a theorem then [[E]] = K, which belongs to every filter, and hence E is true.  

This holds for any model of the sort we have described, so all theorems of classical sentential logic are valid.

Deductive consequence

E1, …, En imply F exactly if, in each such model, if [[E1]], …, [[En]] are all in T* then [[F]] is in T*.

In classical sentential logic E1, …, En imply F exactly if (E1 & … & En)  (E1 & … & En & F) is a theorem.  So then ([[E1]] ∩ …∩ [[En]]) = ([[E1]] ∩ …∩ [[En]] ∩ [[F]]).  

It follows that if  [[E1]], …, [[En]] are all in a given filter then so is [[F]].

Therefore all such classically valid direct inferences (such as Modus Ponens) are valid in L.

Natural deduction rules

Those which involve sub-arguments can be expected to fail.  For example, (E v ~E) is valid, but it is possible that E lacks a truth-value, and so we would expect the Disjunctive Syllogism to fail.

We’ll see examples below.

 APPLICATION 1:  probability space models

The structure S = <K, F, P> is a probability space exactly if K is a non-empty set, F is a field of subsets of K (including K), and P is a probability function with domain F.  

A field of sets is a Boolean algebra of sets.  So we can proceed as above.

First there is a language LS, and if E is a sentence of LS then [[E]] is a measurable subset of K, that is to say, a set in F, a member of the domain of P.  And as before we have a fragment LSat which is the closure of the set of atomic sentences under the Boolean connectives.  The range of [[.]] restricted to LSat is a subfield — a Boolean subalgebra — F0 of  F.

The set TS = {p is in : P(p) = 1} is a proper filter.  That is so because P( Λ) = 0, P(p) is less than or equal to  P(q) if p ⊆ q, and P(p ∩ q) = 1 if and only if P(p) = P(q) = 1.

Similarly, there is a corresponding proper ideal US = {p is in : P(p) = 0}.

Just as above, TS is the beginning, so to speak, of the set of true propositions.  To determine an appropriate set of true propositions in F0 we begin with X = US  F0 That is a proper ideal as well, within that subalgebra.  Every such proper ideal can be extended (not uniquely) to a proper maximal ideal US0 on F0.  This we choose as the set of false propositions in that subalgebra, and the corresponding maximal filter TS0 on F0 is the set of true propositions there.

And now, to complete the series of steps we are following, we define TS* to be the least filter on F which contains both TS and TS0. The general argument above applies mutatis mutandis to show that TS* is a proper filter — our truth filter in this setting.

Unless LSat is the whole of LS we will now have truth-value gaps:  there will be non-empirical sentences that receive some probability intermediate between 0 and 1, and these are neither true nor false.

As before, there is no doubt that the axiomatic classical sentential logic is sound here.  However there are natural deduction rules which are not admissible.  For example, if something follows from each of P(p) = 1 and P(q) = 1 it may still not follow from P(p v q) = 1. For example, if we are going to toss a coin then Probability(Heads) = 1 entails that the coin is biased, and Probability(Tails)= 1 also entails that the coin is biased. But Probability( Heads or Tails) =1 is true also if the coin is fair.

APPLICATION 2: algebraic logic of conditionals with probability

This is an example of a probability space model, in which the algebra is a Boolean algebra with a binary modal operator ==>.  It begins with a ‘ready to wear’, off the shelf, construction, which I’ll describe.  And then I will apply the recipe developed above to give a picture of a language in which conditionals, typically, are true only if they have probability 1, and and false only if they have probability 0.

I am referring to the logic CE, which is like Stalnaker’s logic of conditionals, but weaker (van Fraassen 1975; see also my preceding blogs on probabilities of conditionals).  

The language has the Boolean connectives plus binary connective –>.  A structure M = <K,F, s> is a model of CE exactly if K is a non-empty set (the worlds), F is a field of subsets of K (the propositions), and s, the selection function, is a function which maps K x F into the subsets of K, with these properties:

  • s(x,A) ⊆ A
  • if x is in A then s(x,A) = {x}
  • s(x, A) has at most one member
  • s(x, A) =  Λ only if A =  Λ

The truth conditions for &, v, ~ are as usual, and for –> it is:

          A –> B is true in world x if and only if s(x,A) ⊆ B

          equally:  [[A –> B]] = {x is in K: s(x, [[A]]) ⊆ [[B]]}

and we can see that there is therefore an operator on , for which I’ll use the symbol ==>:

          [[A –>B]] =  [[A]] ==> [[B]].

This differs from Stalnaker’s semantics only in not imposing the further restriction on the selection function that it must derive from an ordering.  We may intuitively refer to s(x, A) as the world nearest to x that is in [[A]], but this “nearest” metaphor has no content here.

When this language is thus interpreted in model M, the propositions form a Boolean algebra with operator ==>, which has the properties:

(I)      [p ==> (q ∪ c)] = [(p ==> q) ∪ (p ==> c)]

(ii)     [p==> (q ∩ c)] = [(p ==> q) ∩ (p ==> c)]

(iii)    [p ∩ (p ==> q)] = (p ∩ q)

(iv)    (p ==> p) = K                                        ( “necessity” )

(v)     (p ==> -p) =  Λ unless p =  Λ                 (“impossibility”)

Let us call this a CE algebra.

probability model for CE is a structure <K, F, s, P> such that <K, F, s > is a model for CE and P is probability function with domain such that for all p, q in F

            P(p ==> q) = P(q | p) when defined

This condition is generally called Stalnaker’s Thesis (or more recently, just “the Thesis”).  Stalnaker’s logic of conditionals could not be nontrivially combined with this thesis but CE could.  As it happens, CE has a rich family of probability models.

Thus, if  <K, F, s, P> is a probability model for CE then S = < K, F, ==>, P> is a probability space model in the sense of the previous section, with some extra structure.

Now we can proceed precisely as in the preceding section to define a truth filter T* on the algebra of propositions.  As empirical statements we take the closure of the set of atomic sentences under just the Boolean connectives, that is the sentences in which there are no occurrences of –>.  The image of this language fragment by the map [[.]] is the relevant, privileged Boolean subalgebra F0 of in which every proposition is classified as true or false, as a first step.

In addition the propositions which have probability 1 are true.  And finally, anything implied by true propositions is true — all this understood as coming about as shown in the preceding section. Thus all theorems of CE are valid, and inference by modus ponens is valid.

As to sentences of form (A –> B), they are typically true only if P(A | B) = 1.  I say “typically” because we cannot rule out that the proposition [[A]] ==> [[B]] is a member of F0.  For the model of CE could be a model of a stronger theory, perhaps one that entails (implausibly!) that “if it is lit then it burns” is the meaning of “it is flammable”.  But typically that will not be the case, so typically (A –>B) will be classified as true only if P([[B]] | [[A]]) = 1.

REFERENCES

van Fraassen, B. C. (1976) “Probabilities of Conditionals”, in W. Harper and C.A. Hooker (eds.) Foundations of  Probability and Statistics, Volume l. Reidel: 261-308.

Odds are more intuitive than probability (2) Bayes’ Theorem

We are all tested these days, by the medical profession for viruses and bacteria, by the police for alcohol consumption, by the sport team doctor for marihuana intake, and so forth. I’m sure their tests are awfully well designed. But I’ll give an imaginary example with simpler numbers and ratios to show how understanding the tests, and Bayes’ Theorem, is much more intuitive if we think in terms of odds.

I will first give intuitive statistical argument to show how a probability assessment is updated. Then I will recast this as Bayesians do, first in terms of probabilities and then (see how much simpler!!) in terms of odds.

EXAMPLE. The Y-virus is an STD, with an incidence of 1 in 500 in the college population.  There is a test for the Y-virus, and it is 99% accurate, in the sense that

            The probability that a positive result is a false positive is 1%.

            The probability that a negative result is a false negative is 1%.

Student Jones is not very worried, since he considers himself a normal, average student and the incidence is so low.  But he takes the test, and the test result is positive! How likely is it, after seeing this result, that Jones has the Y-virus?

Test yourself: just of the top of your head, do you think the answer is:

 99%, Between 75% and 99%, Between 50% and 75%, Less than 50% ?

STEP 1.  AN INTUITIVE STATISTICAL ARGUMENT

Imagine a total college population of 50,000 students, with the actual incidence of the Y-virus precisely 1/500.  Thus, in this population there are 100 students with the Y-virus. Imagine furthermore that all the students are tested, and the test performs exactly as specified, with 1% false positive and 1% false negatives.

The results will then be: Of the 100 students with the Y-virus exactly 99 test positive, and one tests negative. Of the 49,900 students who do not have the Y-virus, 499 test positive (false positives). Jones belongs to the sub-population which tested positive, which therefore has 598 members.  In this sub-population, there are just 99 which have the Y-virus.  So the probability that Jones has the Y-virus is 99 out of 598, which is approximately 16.555…%, or approximately 1 out of 6.

So, roughly speaking the probability that Jones has the Y-virus, in the light of the positive result, is about 1/6.  That is not as bad as he feared!

But it is certainly true that after he sees the evidence, his probability does get much higher than it was. How much higher?   The probability changed from 0.2% to roughly 16%: approximately 80.

What were Jones’ odds, and how did they change? To begin the odds of (virus) : (no virus) were 1 : 499. Afterward they 99 : 499. How much higher is this? We get a nice whole number: the odds were multiflied by 99, precisely.

It is this multiplier, which changes the old odds to the new odds that is called the Bayes factor. Look for it below!

STEP 2. THE BAYESIAN RECIPE, IN TWO FORMS

In Bayesian terminology, the initial probability or odds are called the prior ones, and those after the evidence is accommodated the posterior ones. To determine the posterior probability, they use Bayes’ Theorem (named after the 18th century Reverent Thomas Bayes).  

First version: for probabilities

I will use + for a positive test result, y for having the Y-virus, and P for probability. From the details above we have the following data: P(y) = 0.002, P(~y) = 0.998; P(+|y) = 0.99, P(+|~y) = 0.01.

Bayes’ theorem says:

(*) P(y|+) = P(+|y) times P(y)/P(+).

We can calculate P(+), using another theorem: P(+) = P(y)P(+|y) + P(~y)P(+|~y)

Plugging in the numbers we get P(y|+) = 0.16555… . Which is in accord with our earlier statistical calculation.

Was this intuitive? Oh, it would be if you do it often enough! 🙂

Second version: for odds, if you think in terms of probability

The odds of A to B is the ratio (Probability of A) : (Probability of B). You can also write this with the familiar symbol / for ratio or division, that is the same thing.

When B is ~ A, we just call that the odds on A. If it is just as likely to snow as not to snow, then the probability of snow is 1/2, while the odds on its snowing are 1 : 1. (or 50/50 as people like to say). Jones’ prior odds of having the Y virus are 1 : 499.

How do my prior odds change to my posterior odds, when I get evidence like a positive test result? The odds formulation of Bayes’ Theorem is

Posterior odds = prior odds times the Bayes factor

The Bayes factor is actually an odds ratio itself, it is the odds of getting a positive test result:

( test result +, given that you have Y) : (test result +, given that you don’t have Y).

With P* for the new probability and P for the old probability, that means this:

P*(y|+) : P*(~y|+) = [P(y) : P(~y)] x [P(+|y) : P(+|~y)]

Here it is quite easy to see the numbers to fill in. The odds on having Y are 1 to 499; that is [P(y) : P(~y)]. And the Bayes factor is the odds of a true positive to a false positive, which is the ratio of 0.99 to 0.01. So we arrive at:

The posterior odds P*(y|+) : P*(~y|+) = (1/499) x (99/1) = 99/499.

News for Jones: his odds of having the virus have been multiplied by 99. The Bayes factor!

Third version: if you think in terms of odds in the first place

Then you don’t need the formulas, you will have a simple visual calculation. Just remember what I wrote in the earlier post about how to conditionalize an odds vector: replace the ruled out parts’ numbers by zeroes.

The prior odds of having virus Y are 1 : 499. The odds of a correct test result are 99 : 1. So before we have the test result, the odds vector for the relevant partition looks like this:

Now the positive result comes in, we conditionalize on this by replacing the numbers for what did not happen by zeroes:

So the odds changed from 1 : 499 to 99: 499. The prior odds were multiplied by 99 (the Bayes factor), as seen in this simple, intuitive change of the odds vector.

Note. It is a peculiarity of my example that the odds of getting a correct result are the same as the odds of getting a positive result. Exercise: change the example so that the test still has only 1% false negatives but, say, 10% false positives.

Probability statements (3) Form

An elementary statement assigning probability was defined to be one whose semantic value (set of probability measures that satisfy it) is convex.

Here I want to show that there is a wide range of elementary statements beyond the examples we had, including statements assigning odds and conditional probabilities.

Familiar sorts of statements include, besides the ones examined in the previous posts:

  • ODDS

“It is twice as likely as not to snow today”, with logical form P(A) = 2P(~A),

“The odds of A to B are 3 to 1”, with logical form P(A): P(B) = 3 : 1,

  • CONDITIONAL PROBABILITY

“The probability of A, given B, is 2/3”, with logical form P(A/B) = 2/3, or equivalently, P(A∩B):P(B) = 2: 3

  • CORRELATION

“Rain is more likely in the winter than at other times” , with logical form P(A/B) > P(A/~B).

A good tactic will be to look for a general form in which these can be expressed, and then to see when statements of that general form are elementary.

The concept of expectation

To introduce the more general form we can look to the form of a judgement about expectation (terms vary: expected value, expectation value) as it is understood in probability theory. So I’ll begin by introducing this informally, then give the precise definition, and after that, examine the above examples in those terms.

A good example of an expectation value is the announcement sometimes seen by casinos, “98% payback!”. Does that mean that, if you gamble there, the probability is 98% that you will end up with your money back? Certainly not: this is about a sort of average over huge wins, huge losses, and myriads of small losses. Even if we don’t go to casinos, we are always seeing expectation values announced. For example, the weather forecast gives 0.1 inch of precipitation for Seattle in the next 24 hours. That is not a prediction that it will be precisely, or even approximately 0.1 inch. Rather it is a weighted average over the chances of various amounts of quantity of precipitation within that period. In both these examples what is announced is the expected value of the quantity in question.

Suppose a quantity q has values a(1), a(2), … in certain possible scenarios, and the probabilities of those scenarios are p(1), p(2), …. then

the expectation value of q, for this probability function, is the sum

(a(1)p(1) + a(2)p(2) + …. )

In statistics, the term used is not “quantity” but “random variable”. A poor choice of terminology, mystifying for the uninitiated, but well, there you go.

Statements of expected value

Let’s add to our statement forms Exp(q) = x. A probability function or measure p satisfies Exp(q) = x if and only if the expectation value of q, for p, equals x.

This is the form of an elementary statement, for the semantic value of such a statement is a convex set of probability measures. For imagine that (a(1)p(1) + a(2)p(2) + …. ) = x and also (a(1)p'(1) + a(2)p'(2) + …. ) = x. If we then evaluate the expected value for the mixture bp +(1-b)p’, we see that each value a(i) gets coupled with a number that is between p(i) and p'(i), for i = 1, 2, … The resulting sum will therefore be between the above two sums — so, between x and x, in other words, equal to x.

This argument generalizes very easily to the expectation value being in some interval of numbers. So we can write Exp(q) ε I, for any interval I, and this will also be an elementary statement.

I will put the precise account in the Appendix below, but this is enough to show how the above common probability statements can all be put in terms of expectation.

Translating probability statements into statements of expected value

  • Example 1: our paradigm example P(A) = x.

There are two relevant possible ways things can be, A and ~A. Now, we can define a function 1A, which takes value 1 in the first way things can be, and value 0 in the other way they can be. (1A is called the indicator quantity for A.) So, for any probability function p, the expectation value of 1A equals [1.p(A) + 0.p(~A)], and that is just p(A).

Thus we can translate (P(A) = x) into (Exp(1A) = x): these statements are satisfied by the same probability functions, they have the same semantic content.

In many cases it is easy to rewrite the elementary statement so that it obviously has the form of equating an expectation value to 0.

  • Example 2:”It is twice as likely as not that A”, which would have the form P(A) = 2P(~A)

Rewrite this as 1.P(A) – 2.P(~A) = 0. This is the probability-weighted sum of 1 and 2, corresponding to the two possibilities A and ~A. So define r to be the quantity which takes value 1 on A and value -2 on ~A. Then for any probability function p, the expected value of r is the sum 1.p(A) +(- 2).p(~A). Therefore our example statement is equivalent to Exp(r) = 0.

  • Example 3: “The odds of A to B are m to 1″, which would have the form P(A):P(B) = m:1, or equivalently, P(A) = m.P(B)

Now we require a bit of ingenuity, because A and B — unlike A and ~A — may overlap.

To spell this out we should think of the four cells in the partition {A -B, A ∩ B, B- A, W – B – A}. Suppose that measure p assigns to these cells the probabilities x, y, z, u respectively.

Thus p assigns x + y to A and assigns y + z to B. That is, p(A) = x + y, P(B) = y + z. The equation that p must satisfy can now be rewritten, till it looks like the sort of sum we see in an expectation value:

p(A) = m.p(B),

x +y = m(y + z),

x + (1-m)y – mz = 0

This last line shows us how to define the relevant quantity, call it s. The four cells of the partition are the four ways things could possibly be. Quantity s is defined to have value 1 on (A-B), value (1-m) on (A ∩ B), value -m on (B-A), and value 0 elsewhere. Therefore

Exp(s, p) = 1p(A-B)+(1-m)p(A∩B) +(-m)p(B-A) + 0p(W – B-A).

x +(1-m)y +(-m)z +0

so the statement P(A):P(B) = m:1 is equivalent to Exp(s) = 0.

Thus the odds statement P(A):P(B) = 1:m is equivalent to Exp(s, P) = 0, for this random variable s.

Example 4: “The probability of A, given B, is m”, with logical form P(A/B) = m,

This is similar to the preceding, and is solved in the same way. P(A/B) is defined as P(A∩B) : P(B), and this ratio is here asserted to be m. So for a probability measure to satisfy this it must meet the condition

p(A∩B) : p(B) = m

p(A∩B) = mp(B)

p(A∩B) – 2p(B) = 0

p(A ∩ B) – m[p(A ∩ B) +p(B-A)] = 0

p(A ∩ B) – mp(A∩B) – mp(B-A)= 0

(1-m)p(A ∩ B) – mp(B-A)= 0

The deduction is quite similar to the preceding, noting that B is the union of (A∩ B) and (B -A), and we can define the random variable t:

t takes value (1-m) on (A ∩ B), value (- m) on B(-A), and value 0 elsewhere.

Then we see that P(A/B) = m is equivalent to Exp(t, P) = 0.

Conclusion

In these results we only needed recourse to statements of the form Exp(r) = x; in the Appendix I will show that these, but also those of more general form Exp(r) ε I, where I is a convex set of real numbers, are elementary statements.

I am not taking up the last of the series of examples, the example of positive correlation, because however symbolized, it is definitely not an elementary statement. Correlation is a non-linear relation, and convexity is not preserved.

But we have seen that the examples of the previous post, as well as odds statements and conditional probability statements, are elementary statements. For they are equivalent to statements that say that certain quantities have expected value 0, and such statements (we saw along the way) are elementary.

Open question at this point: is it also possible to formulate statements which are elementary, in the defined sense, but not equivalent to statements of expected value?

APPENDIX

Here I will spell out the above in the precise way introduced in the first post, in terms of frames and model structures.

Let K be the frame <W, F>, and M the model structure <W,F, I>.

A random variable r on K is a measurable function that has a numerical value at each world in W, for example, the height of the highest mountain or the price of wheat. To say that r is measurable means that, for each numerical value, the set of worlds at which r has that value is a legitimate proposition, that is, it is a member of F.

That r has value m on proposition A in a given model structure is defined to mean that r has the same value m at each world in proposition A.

I will here restrict the discussion to what I will call simple random variables: ones with finite range. If r has a finite range then its set of values V(r) = {a, b, c, …, k} correspond to a partition of W, the cells being the propositions C(r, j) = {w: r(w) = j}, with j = a, b, c, … ,k. Call this partition the characteristic partition of r.

If p is a probability measure in P(M), let the probability of C(r,j) be p(j). Then the expectation (or, expected value) of r relative to p is defined to be:

Exp(r, p) = ap(1) + bp(2) + … + kp(k)

the general formula being

Exp(r,p) = Σ{xp(Cr, x): x in V(r)}

Now we can introduce statements that assign expectation values by specifying their semantic value, that is the set of probability measures that satisfy them.

Exp(r, P) = y is satisfied by p in P(M) if and only if Exp(r, p) = y

Exp(r, P) ε [a, b] is satisfied by p in P(M) if and only if Exp(r, p) ε [a, b]

and similarly for open and half-open intervals.

Theorem. The statement Exp(r, P) = y, and more generally the statement Exp(r, P) ε I, where I is a convex set of real numbers, are elementary statements.

Just for the second part, suppose that x = Exp(r, p) and y = Exp(r, p’) both lie in the convex set I. Then the convex combination p” = mp + (1-m)p’ gives values to the cells C(r, j) that lie between the values given to them by p and by p’ respectively. So Exp(r, p”) is also in I.

This is a bit abstract, so a simple example:

ap(1) + bp(2) = x and ap'(1) + bp'(2) = y, so

ap”(1) + bp”(2) = map(1) + mbp(2) +(1-m)ap'(1) + (1-m)bp'(2)

= a[mp(1) + (1-m)p'(1)] + b[mp(2) + (1-m)p'(2)]

so a, b are being multiplied by a number that is somewhere between the numbers that multiplied them in the initial line — their sums when that is done must then also be between the two original sums, hence lie in the same convex set of real numbers.

Probability statements (2) Compounds

There is never any difficulty in adding truth-functional connectives to a set of statements, but doing so doesn’t give us any new insight into their character or structure. The better approach is to ask: are there, in this very class of statements itself, already ones that count as conjunctions, disjunctions, and the like? (What is the ‘internal’ logic of this sort of discourse?)

Conjunction

A simple example is (P(A) > x & P(A) < y), which can also be written as: P(A) ε (x, y). But that is the special case, of two statements about the probability of a single proposition. What about such combinations when different propositions are involved?

Theorem: If C is a family of convex sets (finite, countable or uncountable), then the intersection of the members of C is a convex set.

Proof: That the intersection ∩C is convex is trivially so if the intersection is empty, or has just one member. If ∩C has more than one member, consider any two p, p’ of its members: these belong to each member of C, and hence any of their convex combinations are also members of each member of C. Hence all the convex combinations belonging to all members of C, and hence of ∩C, are in ∩C.

This Theorem shows that it is fine to introduce the usual sort of conjunction in to the language, for then the set of measures that satisfy both of two elementary statements will also be convex. So if Q and R are elementary statements then (Q & R) is the statement such that |Q&R| = |Q| ∩ |R|, and this is again an elementary statement.

Disjunction

The same ease is not to be found for disjunction. The union of two convex sets is not in general convex. Anyway, we already know that disjunction does not behave like a truth-function when it comes to probability. In fact, it does not make sense to ask whether p satisfies (P(A) = r or P(A) = s) , as opposed to asking whether it satisfies either (P(A) = r) or satisfies (P(A) = s). At most we can ask whether p satisfies what the two ‘have in common’.

Can we find an operation on elementary statements that has the main characteristics we require of disjunction, in general?

Requirement. The general concept of disjunction of two statements, Q, R, in any kind of language, requires that it must be the logically strongest statement that is implied by both Q and R, and thus itself implies all that is implied by both Q and R.

In the first post I defined entailment for elementary statements. What we should look at therefore is this situation. Suppose that S is a statement S

Q entails S and R entails S

What is the relation that |S|bears to |Q| and |R|?

Theorem. If Q, R, S are elementary statements, Q entails S, and R entails S, then all convex combinations of members p of |Q| and p’ of |R| belong to |S|.

For note that if Q and R entail S then both |Q| and |R| are part of |S|. Therefore, iff is in |Q| and p’ is in |R| then both belong to |S|. Since |S| is convex it will also contain all the convex combinations of p and p’.

The smallest set that fits the Requirement above is therefore the convex hull of the union |Q| ⋃ |R|, that is the set of all convex combinations of members of those two sets. That is the smallest convex set which contains both. Since that is more than just the union, it does not correspond to a truth-functional disjunction. So let’s introduce a special symbol:

Definition. The join of convex sets X and Y is (X ⊕ Y) = {ap +(1-a)p’: a ε [0,1], p in X, p’ in Y}.

That is precisely the convex hull of X ⋃Y. Following upon this we can introduce a statement connective of ‘disjunction’ to the language, which will combine elementary statements into other elementary statements. Without expecting any confusion from this, I will use ⊕ equally for the statement connective and for the operation on convex sets:

| Q ⊕ R| = |Q| ⊕ |R|.

Negation

Really, there is no negation. In a specific case we can make up the negation, but it will typically not be an elementary statement. For example what would be the negation of P(A) = 0.5?

Its contraries are P(A) < 0.5 and P(A) > 0.5. Each of these is an elementary statement. But there is no truth-functional ‘or’ that would combine them into the contradictory of P(A) = 0.5, at least not one that would produce an elementary statement.

The sort of disjunction we do have produces something, but not the contradictory of P(A) = 0.5. In fact,

(P(A) < 0.5) ⊕ (P(A) > 0.5)

is satisfied by the 50/50 convex combination p” of p and p’ which assign 0.25 and 0.75 to A respectively, and p”(A) is 0.5. So we have arrived at a tautology, this disjunction has the same semantic value as |P(A) ε [0,1]|.

The underlying reason is of course that there is no largest convex subset of [0,1] disjoint from [0.5]. The two convex sets disjoint from [0.5] are [0.0.5) and (0.5,1] which are as large as can be, so there is no largest.

New Question: are there other forms of that elementary statements can have? What about other sorts of probability talk, such as about odds or conditional probabilities?

That will be the topic of the next post.

Probability statements (1) Elements

A word to begin: the literature has a number of sophisticated approaches to the logic of statements about probability (notably Fagin, Halpern, and Megiddo, Inform and Compute 87 (1990): 78-128). What I want to do here is much less ambitious and more elementary. But I hope that by keeping our focus very narrow, on just what I will call elementary statements and their ‘internal’ logic (rather than their Boolean combinations) we can get some interesting insights into probabilistic thinking.

That the probability of rain today is 60% is surely the paradigm example of a proposition that assigns a probability. Its form, using the usual symbols, is P(A) = x, and I will say that this statement is satisfied by any probability measure which assigns value x to proposition (or event) A.

I will P(A) = x as my paradigm example of what should count as an elementary statement. Then I hope to arrive at a useful concept of elementary statements in general, with a clear notion of when they are satisfied by a probability measure. We’ll soon see that P(A) < x, P(A) ≥ x, and the like, are other good candidates for this status. Given all that it will be possible to define the semantic correlate of logical consequence:

if Q, R are elementary statements then Q entails R exactly if every probability measure which satisfies Q also satisfies R.

The task will be first to settle on a useful concept of elementary statement, and then to explore the variety of such statements. These will surely include ones like P(A) ≥ 0.5, and other inequalities, and perhaps some combinations of those.

As my main clue I will take the concept of convexity. That is a notion that appears in many places, wherever it is useful to use the word “between”. For example, a geometric figure is convex if, for any two points that lie inside, and the points between those lie inside as well. So a solid cube or sphere is convex, but a squiggly worm is not.

A set of numbers is convex if for any two numbers in it, the numbers between them belong to it as well, in the following precise sense:

the number x is a convex combination of numbers y and z if and only there is a number a ε [0,1] such that x = ay + (1-a)z

So an interval is convex (contains all convex combinations of its own members), but the set of prime numbers is not.

Of course the two uses of “between” don’t stand for the same relationship — in each case, to talk about convexity we have to fix what will correspond to familiar cases of betweenness.

Here is the notion for probabilities. If P(A) = x is satisfied by both p and p’ then it is also satisfied by an mixture (convex combination) of p and p’, and that means by any function defined by an equation of this form:

p” = ap + (1-a)p’, with a ε [0,1]

Spelled out that means that

p”(A) = ap(A) + (1-a)p'(A), for all propositions A in the relevant domain.

The really important point to remember: as we can readily verify, if p” is a convex combination of p and p’ then p”(A) is always a number between p(A) and p'(A), inclusive.

The precise definition capturing the relevant notion of ‘between’ has the same form for probability functions as it has for numbers. And similarly, a set of probability measures is convex if and only if it contains all the mixtures of its members.

Still informally ( before we put it all in terms of models and model structures) let us refer to the set of probability measures that satisfy a statement Q as |Q|, and call it the semantic value of Q. Then we note that |P(A) = x| is convex. Trivially so, because the only number between x and x, inclusive, is x itself. But |P(A) < 0.5| is also convex, for if two numbers are less than 0.5 then so is every number that lies between them. Similarly for |P(A) ≥ 0.5| and the like.

This feature, that the semantic value is a convex set, I will choose as defining mark of a useful concept of elementary statement.

Of course there is a connection between convexity for the numbers assigned as probabilities and convexity for probability. Note that each of the examples can be written as P(A) ε I, for an (open, closed, or half-open) interval. Thus P(A) < 0.5 is the same as P(A) ε [0,0.5).

Theorem. If I is any set of numbers then |P(A) ε I| is convex if I is convex.

Suppose first that I is convex and that p, p’ satisfy the statement P(A) ε I. Then p(A) and p'(A) are in I, and since I is convex, any number between them is in I also. But if p” is a convex combination of p and p’ then p”(A) is a number between those two, so is also in I. Hence p” satisfies P(A) ε I.

Theorem. If |P(A) ε I|is convex then there is a convex subset J of I such that |P(A) εI| = |P(A) ε J|.

Suppose that |P(A) ε I| is empty. The empty set Λ is convex by definition, definitely a subset of I, and clearly |P(A) ε I| = |P(A) ε Λ|.

If |P(A) ε I|is not empty, let J be the set of numbers x such that P(A) = x and x ε I. Then if |P(A) ε I|is convex, hence contains all convex combinations of its members, J will contain all convex combinations of its members. Thus |P(A) ε I| = |P(A) ε J|

Conclusion: We have a general notion of elementary statement, namely one whose semantic value is a convex set of probability measures. And we have one general form of statements which are elementary, namely |P(A) ε I|, with I a convex set of numbers (more specifically, a convex subset of [0,1]).

Question: are there other forms of that elementary statements can have? What about combinations, like conjunction, disjunction, negation?

This will be the topic of the next post. In the Appendix below I will make things precise, as we have them so far. Mostly, though, I will proceed in the informal way of the above.

APPENDIX: making it precise

Probability is a modality, so it seems apt to set this up in the same was as we do for normal modal logics, in so far as that is possible.

Syntax: a set of proposition terms A, B, C, … The sentences will be of form P(A) R x, where R is any relation any of the relations of equality or inequality, <, ≤, >, etc.. and P(A) ε I.

Semantics: A frame (‘sample space’) is a couple K = <W, F>, where W is a non-empty set and F a field of subsets of W, called the propositions. We can think of the members of W as possible worlds or as events of some sort. The family of probability measures with domain F will be called P(M).

A model structure is a triple M = <W, F, I> where <W, F> is a frame an I is an interpretation, that is, I assigns members of F to the proposition terms. The set of probability measures on F will still be called P(M). A member p of P(M) satisfies P(A) R x relative to I if and only if p(A) bears R to x.

NOTE: I am using the capital letters from early in the alphabet equally for proposition terms and for the propositions for which they stand. If needed to be more precise, the last phrase in the preceding paragraph would be “if and only if p(I(A)) bears R to x.

Theorem. If R is any relation of equality or inequality, the set of probability measures in P(M) that satisfy P(A) R x is convex. Similarly for P(A) ε I: see theorems above for details.

Notation. If Q is a statement and M is a model structure then the set of elements of P(M) that satisfy Q will be called|M, Q|. In a context where a single model structure M is under discussion this will be abbreviated to |Q|.

Definition. A statement Q is elementary if and only if |M, Q| is convex, for every model structure M.

All the statements in our syntax so far are therefore elementary, by our definition.

What is Bayesian orgulity? (1)

Orgulity is the opposite of humility. Not being a native speaker, I had to open a dictionary when I saw Gordon Belot’s paper “Bayesian Orgulity” (Philosophy of Science 2013).

This putative orgulity concerns two sorts of theorems about subjective probability, frequencies, and calibration. Informally put, the first sort shows that a Bayesian agent will, indeed must, be sure that his probabilities are the right ones, and that statistics would bear that out in the long run. “Sure” means here that in his self-assessment, he will give zero probability to the contrary. Equally informally put, the other sort of theorem shows that Bayesian agents will, in an overwhelming majority of possible cases, be wrong in just that respect.

All the makings of a true paradox!

Need we take it as a paradox? It could just stand as an indictment of Bayesian epistemology. The main arguments concern an orthodox Bayesian agent with a numerically precise prior probability function, whose sole updating means is conditionalization on data coming in with certainty (like the voice of an angel). So there is much to be complained about here already.

But the arguments are in some respects very general and would seem to indict much more liberal forms of probabilism as well.

I want to offer, at the end of the discussion, a resolution of the paradox, based on an idea which I already know to encounter a lot of resistance.

I’ve heard that early missionaries in Polynesia faced first of all the onerous task of convicting the natives of sin, before they could preach salvation. So, first of all, I’ll make a good effort to show that we are not dealing with narrow technical issue here, but with a paradox eminently worthy of taking seriously.

My plan for the posts to follow:

First, how is subjective probability related to actual frequencies? (How does a probabilist agent — whose opinion is represented by a subjective probability function — reply when asked about the relative frequencies of actual occurrences ‘in the world’?)

Second, what about self-assessment by such an agent? (If asked whether, or to what extent, his or her own opinions concerning the future are well-calibrated, how must s/he answer?)

Third, how does Belot present the Bayesian’s orgulity, and how badly do they fare given his results? (Spoiler: Belot takes the wind out of their sails if they seek refuge in the leeway between zero probability and impossibility.)

Fourthly, most importantly, are there different ways to read the results? (Must we take them as an indictment of probabilism, or of the idea of subjective probability, or can we, on the contrary, offer a narrative on which everything makes sense?)

Glymour, bootstrapping, and the puzzle about theory confirmation

In his early papers Clark Glymour mounted a devastating attach on the hypothetico-deductive method and associated ideas about confirmation. Instead he offered his account of relevant evidence, and developed it into what he called his bootstrapping method.

Baron von Münchhausen pulled himself up by his bootstraps — a theory obtains evidential support from evidence via calculations within the theory itself, drawing on parts of that very theory. This understanding of the role of evidence in scientific practice was a further development of Duhem’s insight about the role of auxiliary hypotheses and of Weyl’s insistence that measurement results are obtained via theoretical calculations, based on principles from the very theory that is at issue.

Glymour’s account involves only deductive implication relations. But within these limits it arrives at the conclusion as I listed in the previous blog on a puzzle about theory confirmation. For an important result concerning the logic of relevant evidence developed in Glymour’s book is this:

If T implies T’ and E is consistent with T, and E provides[weakly, strongly] relevant evidence for consequence A of T’ relative to T, then E also provides this for A relative to T.

Here T’ may simply be the initial postulates that introduced the theory. In some of the examples, T’ may have no relevant evidence at all, or may even not be testable in and by itself. A whole theory may be better tested than a given subtheory. In some of the examples T’ may have no relevant evidence at all, or may even not be testable in and by itself.

This should have been widely noted. It upends entirely the popular, traditional impression that ‘the greater the evidential support the higher the probability’! For the probability of the larger theory T cannot be higher than that of its part T’, while T can at the same time have much larger evidential support.

In 2006 Igor Douven and Wouter Muijs published a paper in Synthese that introduces probability relations into Glymour’s account (“Bootstrap Confirmation Made Quantitative”). In view of the above, and of the puzzle for confirmation in my previous blog, it made sense to ask whether a similar result could be proved for this version.

Recall, the puzzle was this.

A theory may start with a flamboyant first postulate, which is typically, if just taken by itself, not even testable.  Let’s call it A.  Then new postulates are added as the theory is developed, and taken altogether they make it possible to design an experiment to test the theory, with positive empirical result, call it E.

Now, at the outset, the prior probability would naturally have A and E mutually irrelevant, since any connection between them would emerge only from the combination of A with other postulates introduced later on.  So for prior probability P,  P(A|E) = P(A) = P(A)

What we found that was that in this case, even when the probability of the entire theory increases when evidence E is assimilated, the probability of A does not change. And similarly when the evidence is information held with less than certainty, so that Jeffrey conditionalization is applied.

So how does this result fare on Douven and Meijs’ account? Here is their definition:

(Probabilistic Bootstrap Confirmation) Evidence E probabilistically bootstrap confirms theory T = {H1, . . . , Hn} precisely if p(T & E) > 0 and for each Hi in T it holds that

1. there is a part T’ of T such that Hi is not in T’ and p(Hi | T’ & E) > p(Hi | T’); and

2. there is no part T” of T such that p(Hi | T” & E) < p(Hi | T ).

The following diagram now shows what happens on this account when the initial postulate and the eventual evidence are mutually probabilistically independent.

So we see here again that in this case, the propability of the initial postulate is not changed. And although the probability of the theory as a whole does increase, it just goes from small to less small, for it can never exceed the probability of its initial postulate.

Igor Douven and I had a very interesting correspondence about this. Douven immediately produced and example in which the probability of the initial postulate decreased. In that example, of course, the prior probability relationship between the initial postulate and the eventual evidence is not one of mutual independence. But in this way too, it is clear that in epistemology it is not the case that ‘a rising tide lifts all boats’.

Subjective Probability and a Puzzle about Theory Confirmation

A new scientific (or quasi-scientific) theory often begins with a flamboyant, controversial new postulate. Just think of Copernicus’ theory that starts with the postulate that the Sun is stationary and the earth moving. Or Dalton’s, that all substances are composed of atoms, which combine in molecules in remarkable ways. Or von Daniken’s, that the earth has had extra-terrestrial visitors.

The first reaction is usually that this sort of speculation can’t even be tested. But the theory is developed, with many new additions, and eventually a testable consequence appears. When that is tested, and the result is positive, the theory is said to be confirmed.

I will take it here that “confirm” has a very specific meaning: that information confirms a theory if and only if it makes that theory more likely to be true. And in addition, I will take the “likely” to be a subjective probability: my own, but it could be yours, or the community’s. So, using the symbolism I introduced in the previous post (“Moore’s Paradox and Subjective Probability”) the relation is this:

Information E confirms theory T if and only if P(T | E) > P(T)

Now, the question I want to raise is this:

In this sort of scenario, does the confirmation of the theory also raise the probability that the initial flamboyant postulate is true?

I will argue now that in general, the answer to this question must be NO. The reason is that from the prior point of view, what is eventually tested is not relevant to that initial postulate — though of course it is relevant to that postulate relative to the developed theory.

The answer NO must, I think, be surprising at first blush. But I will blame that precisely on a failure to distinguish prior relevance from relevance relative to the theory.

I will present the argument in two forms — the first quick and easy, the second a bit more finicky (relegated to the Appendix).

For my first argument I will represent the impact of the positive test as a Jeffrey Conditionalization. The testable consequence of the theory is a proposition (or if you prefer the terminology, an event) B, in a probability space S.

The prior probability function I will call P as usual, the posterior probability function P*. Let q = P(B). Then, for any event Y in S,

P(Y) = qP(Y|B) + (1 – q)P(Y| ~B)

Now when the test is performed, the impact on our subjective probability is that the probability of B is raised from q to r. Jeffrey’s recipe for the posterior probability P* is simple: all probability ratios ‘inside’ B or ‘inside’ ~B are to be kept the same as they were. Hence:

for all events Y in S, P*(Y) = rP(Y|B) + (1 – r)P(Y| ~B)

In general there can be quite a large redistribution of probabilities due to such a Jeffrey shift. However, something remains the same. Both the above formulas, for P and for P*, assign to each event Y a number that is a convex combination of two end points, namely P(Y|B) and P(Y| ~B).

What is characteristic of a convex combination is that it will be a number between the two end points.

So in the case in which Y and B are mutually irrelevant, from a prior point of view, those two endpoints are the same:

P(Y|B) = P(Y| ~B) = P(Y)

hence any convex combination of those two is also just precisely that number

Application: Suppose A is the initial flamboyant postulate of the theory. Typically, from the prior point of view, there is no relevance between A and the eventual tested consequence of the entire theory, B. So the prior probability P is such that P(A|B) = P(A |B). Therefore, when the positive evidence comes in (and the probability of the entire theory rises!) the probability of that initial flamboyant postulate stays the same.

For example, in Dalton’s time, 1810, when he introduced the atomic hypothesis into chemistry, the prior probabilities were such that any facts about Brownian motion were irrelevant to that hypothesis. (Everyone involved was ignorant of Lucretius argument about the movement of dust particles, and although the irregular movement of coal dust particles had been described by the Dutch physiologist Jan Ingen Housz in 1785, the phenomenon was not given serious attention until Brown discussed it in 1827.)

So when, after many additions and elaborations of the atomic theory had made it into a theory that had a testable consequence in data about Brownian motion (1905), that full theory was confirmed in everyone’s eyes, but the initial hypothesis about unobservable atomic structure did not become any more likely than it was in 1810.

Right?

And notice this: the entire theory is in effect a conjunction of the initial postulate with much else. But a conjunction is never more likely to be true than any of its conjuncts. So the atomic theory is not now more likely to be true than it was in Dalton’s time.

Confirmation of empirical consequences raises the probability of the theory as a whole, but it is a matter of increase in a very low probability, below that of its initial postulate, which never rises above that.

My Take On This

The confirmation of empirical consequences, most particularly when they are the results of experiments designed on the basis of the theory itself, provides evidential support for the theory.

But that has confusedly misunderstood as confirmation of the theory as a whole in a way that raises its probability above its initial very low plausibility. What is confirmed are certain empirical consequences, and we are right to rely ever more on the theory, in our decisions and empirical predictions, as this support increases.

The name of the game is not confirmation but credentialing and empirical grounding.

APPENDIX

It is regrettable that discussions of confirmation give so often the impression of faith in the freakonomics slogan, that A RISING TIDE LIFTS ALL BOATS.

It just isn’t so.

Confirmation is more familiarly presented as due to conditionalization on new evidence, so let’s recast the argument in that form. The following diagram will illustrate this, with the same conclusion that the probability of the initial postulate does not change when the new evidence achieves relevance only because of the other parts of the theory.

Q(H|B) = 2/3 Q(A & H|B) = 2/3

Explanation: Proposition A is the initial postulate, and proposition B is what will eventually be cited as evidence. However, A by itself is still too uninformative to be testable at all.

The theory is extended by adding hypothesis H to A, and the more informative theory does allow for the design of a test. The test result is that proposition B is true.

The function q is the prior probability function. The size of the areas labeled A, B, H in the diagram represent their prior probabilities — notice that A and B are independent as far as the prior probability is concerned.

The function Q is the posterior probability, which is q conditionalized on the new evidence B.

The increase in probability of the conjunction (A & H) shows that the evidence confirms the theory taken as a whole. But the probability of A does not increase: Q(A) = q(A). The theory as a whole was confirmed only because its empirical consequence H was confirmed, and this ‘rising tide’ did not ‘lift the boat’ of the initial flamboyant postulate that gave the theory its name.

Moore’s Paradox and Subjective Probability

As I presented Moore’s Paradox in the preceding post, its point is that there are statements which can be true but cannot be believed.

In epistemology the view called probabilism (of which there is a variety of variants) starts by modeling opinion as not a matter of just belief and disbelief, but as a graded subjective probability. The version I like is liberal; for example, opinion is generally not numerically sharp at all, but may just go as far as something qualitative (“it seems likely to rain”) or comparative (“it seems more likely to rain than to snow”) of vague in various ways (e.g. “it seems at least twice as likely to rain as not”). But these various forms must be at least consistent with some (perhaps fictional) precise statistics, so it is appropriate to focus on the case of precise, sharp probabilities.

These terms are subjective: the “seems” is important, and actually stands for “seems to me”.

But the distinction between the two linguistic functions, of stating and of expressing, discussed in the previous post, is important here too. The same words could be used in two different ways, to state that I have a certain opinion (an autobiographical statement of fact) or to express the opinion I have.

What we saw earlier is that this affects logic: the logic of expression of opinion is not the same as the familiar logic of statements. We should expect a similar result here.

What must the logic of expression of subjective probabilities be like then?

The most obvious analogue of something incoherent like “I believe that (A and I do not believe that A)” must be something like “I am sure that (A and to me A just seems as likely as not)”.

I’ve been avoiding symbolism, but here things will get too cumbersome without a bit of that. So I will use the capital “P” for the subjective probability that is expressed as mine, and the little letter “p” for the attribution to myself of a state of opinion in terms of probability.

Then the above example of an incoherent subjective probability judgment is rendered as follows:

P(A & p(A) = 1/2) = 1

In the case of subjective certainty (probability = 1), each conjunct in a conjunction is certain as well, so that judgment implies”

P(A) = 1 and P(p(A) = 1/2) = 1

which shows very clearly a disharmony between the opinion I express and the state of opinion I attribute to myself

There must be something less than certainty that could attach to this sort of thing — let us ask what x could be such that

P(A & p(A) = 1/2) = x

This number x cannot be higher than P(A), and if I have any insight into what state of opinion I must actually be having, it can’t be higher than 1/2 either.

Though x cannot equal 1, there is no reason to think that x must be zero, in general. What more could we say?

I will make a proposal: there must be operative here a principle of coherence that goes beyond purely logical consistency, a principle of minimum harmony between the expression and self-attribution of opinion, on pain of incoherence.

To formulate this principle, we need the notion of conditional probability: the probability of something, given (or: on the supposition) that something else is the case) . This is quite intuitive: for example the probability that it will rain, given that it is cloudy, is higher than if the sky is clear, and the probability that a tossed coin will land heads up, given that it is a fair coin, is one-half.

We symbolize “given” with the upright bar “|”. So the principle of minimum harmony is this:

P(A| p(A) = x) = x

Now this leads us to a departure from the familiar basic logic of probability, that is quite analogous to the departure from the familiar logic of statements seen in the previous post.

For in the familiar basic logic of probability we have a theorem similar to the Deduction Theorem, namely:

That it is the case that P(A) = 1 implies that P(B) =y

hence: P(B | A) = y.

For example, if it is certain that a tossed die is fair then the probability of a toss landing with an even number up equals 1/2. And indeed, from this it follows that the probability of a tossed die landing with an even number up, given that the die is fair, equals 1/2.

But given our above principle of minimum harmony we can deduce a counterexample for the logic of expressed subjective probability:

(a) That P(A) = 1 implies that P(p(A) = 1/2) = 0

(b) But it is not in general the case that P(p(A) = 1/2 | A) = 0

To see that (b) is correct, think of some examples. At the moment, for me, that (the probability that rain in Peking seems to me only as likely as not), given (that it is in fact raining in Peking), is not zero. And if a coin is tossed, and I have not seen the outcome, my probability that the outcome was heads, given that it was heads, is actually just what it is without that supposition — certainly not 0, but 1/2.

That substantiates (b); what about (a)? For that the argument is a little more technical, and so if you believe me you can stop reading here. For anyone wanting to check my reasoning, here is an Appendix.

APPENDIX: that (a) is correct.

We need to use the standard definition of conditional probability, which is:

(P(X | Y) = P(X & Y) divided by P(Y)

and have to use the theorem that if P(X) = 1 then P(X & Y) = P(Y).

So let us apply these: suppose P(A) = 1 and by minimum harmony, P(A | p(A) = 1/2) = 1/2.

The latter amounts to P(A & p(A) = 1/2) divided by P(p(A) = 1/2), and so that must equal 1/2.

Or, equivalently, P(A & p(A) = 1/2) equals (1/2) times P(p(A) = 1/2)

But since by supposition P(A) = 1, that amounts to P(p(A) = 1/2) equals (1/2) times P(p(A) = 1/2). There is only one number that equals half itself, the number 0. So indeed, if P(A) = 1 then P(p(A) = 1/2) = 0. As was to be shown.

NOTE: another way to give this last little argument is to show that, with our principle of minimum harmony, if P(A) = 1 then P(A | p(A) = 1/2) is not a well-defined ratio. For if it were, it would imply that P(p(A) = 1/2) divided by itself would equal 1/2, but of course it also equals 1, a contradiction if the ratio is well-defined. Either way we get to the right conclusion, for if a ratio is not well-defined then the denominator equals zero.