This post is going to be formal, even too formal. But this post is not intended to be a one stop shop for the beginner to comprehend probability. Instead, the beginner should scan this post and use it as a Bayesianism-focused reference while studying the first few chapters of a traditional undergraduate probability textbook or some other professional work.4 Do not feel the need to understand everything here before proceeding further.
As I explain in the next post, the central idea of Bayesianism is the representation of credence, the strength of commitment to a proposition, by a non-zero real number [see (1)]. When encountering this idea and axioms of probability, think about what Bayesianism assumes in doing so. Ask yourself, for example, why negative reals do not represent credence. Think of what one could gain - if anything -by doing so. Ask yourself whether credence should lie on the real interval at all. Why not the rational numbers? Why not a vector space (to allow multi-dimensional values)? Why employ the continuum? Why should credence be normalizable [see (2)]?
I hope to discuss most of these questions in detail later, but I hope those encountering the axioms now will prefer honest toil to theft and think about why these postulates are postulated.
First, some notes on notation. The subset symbol `' denotes general inclusion, not strict inclusion. I denote for the set of positive integers. This is much more elegant than the explicit version, and the symbol for the positive numbers helps diminish confusion with the natural numbers (non-negative integers).
As I am more comfortable with set notation, I introduce the topic as such, but it is possible to rewrite what follows in propositional form. It is quite common to encounter this, but the difference is symbolic.5 Conjunction, disjunction, `and', `or', negation, etc. may be substituted as needed.
Let denote the sample space, the set of possibilities, subsets of which are called events. For a simple coin toss, the sample space is , and the set of events is the powerset . An event occurs if an element occurs. Then p is a probability function on if it satisfies the following: is closed under set complement, and
1. Positivity: .
3. Finite additivity:
Those familiar with probability theory will notice certain differences from a standard presentation. Note that p is defined on subsets of , but I have not assumed that it is defined on or a sigma-field of its subsets. The only general assumption about Domain(p) which I make is that it is closed up to complements, i.e. . Axiom (3) is a weakening of countable additivity. Usually, (3) is introduced as countable additivity:
3*. Countable additivity:.
(Note that I have ceased to make the domain assumptions explicit.) It is an elementary theorem that (1), (2), and (3*) entail (3). But accepting countable additivity, along with assuming that Domain(p) is a sigma-field, may assume too much. For purposes of simplification, I will implicitly assume that p is defined on all subsets of the sample space. But as we will see in later sections, this is not necessarily the case.1 And of course, one may generalize probability further, but that would require more advanced mathematics, i.e. measure theory.
To understand the appeal of these axioms, it is important to remember that they are relatively young: Kolmogorov first published them in 1933. Before, probabilities were defined as relative frequencies, i.e.
where number(A) denotes the number of occurrences of A in N trials, or limiting long run frequencies, i.e.
One notices several problems with these notions. For the latter, the assumption that a limit exists is required. For both, N counts a reference class of events which requires specification, and reference classes are not always clear.2 There are also counter-factual commitments implicit in the definition, e.g. `if you were to toss this coin ad infinitum...' Worse still, relative frequency presupposes the uniform distribution and a finite sample space, and not all possibilities are equiprobable. But the Kolmogorov axioms capture such notions, where applicable.
Before going further, conditional probabilities need to be introduced. Very often, conditional probability is presented as a definition, and not an analysis - which it almost always is in practice. Usually, it is given in ratio analysis form:
The assumption that may be dispensed with when given in multiplication law form:
As the ratio `definition' is most common, I accept it as the default, with caution as to its shortcomings.3 When , the ratio definition allows a quick proof that p(.|B) is itself a probability, a process known as reduction of sample space. The multiplication law also easily follows from the ratio form.
In what follows, probabilities are assumed to be defined wherever they appear. Items marked with an asterisk require countable additivity. [I currently see no need to include such theorems, some of which (e.g. continuity with respect to series of subsets and supersets) require calculus.] Unless stated otherwise, theorems follow from (1)-(3) and the ratio analysis of conditional probability. I will state these theorems in logical order; their derivation can be found in any elementary probability textbook, although many needlessly invoke countable additivity in the process.
Probability of impossibility. . Note that the converse does not hold: some probabilities assign probability 0 to non-empty events, just as some probabilities assign probability 1 to uncertain events. Such events are said to almost never occur and almost surely occur, respectively. There are some philosophical difficulties with such events, as I discuss later. Many Bayesians employ the regularity norm (an unfortunate name), which states that uncertain events always have positive probability.
This follows by noticing that , a rule which can be generalized to account for larger partitions of .
This is often encountered in its simpler form:
Law of Total Probability.
This has a ready generalization:
Which has an equivalent in terms of conditional probability:
Multiplication law. For a set of events in which arbitrary intersections have nonzero probability,
Sample space reduction. p(B)>0 implies that is itself a probability map [over Domain(p)]. I.e., satisfies (1)-(3) and therefore satisfies each of the previous theorems.
This may be generalized using the Law of Total Probability:
Of great importance is the odds form of Bayes' Theorem: for H a hypothesis and E some evidence,
In this form, the term is called the posterior odds on H, the Bayes' factor, and the prior odds on H. The significance of the terminology is explained in the next post.
1. The subjective interpretation of probability allows for this. It is possible to be separately committed to events .
2. This is discussed further in the subjective/objective post.
3. See e.g. Alan Hájek's What Conditional Probability Could not Be.
4. I largely follow Colin Howson and Peter Urbach. Scientific Reasoning: The Bayesian Approach. Open Court: La Salle, 1989.
5. ibid, pp.18-9.