## Friday, July 15, 2011

### A Primer on Bayesian Philosophy: 1 - Basic axiomatic probability

There are a great many questions philosophically prior to accepting the (mostly) accepted axioms of probability, i.e. Kolmogorov's axioms. But these are incapable of adequate discussion until probability-as-employed is introduced.

This post is going to be formal, even too formal. But this post is not intended to be a one stop shop for the beginner to comprehend probability. Instead, the beginner should scan this post and use it as a Bayesianism-focused reference while studying the first few chapters of a traditional undergraduate probability textbook or some other professional work.4 Do not feel the need to understand everything here before proceeding further.

As I explain in the next post, the central idea of Bayesianism is the representation of credence, the strength of commitment to a proposition, by a non-zero real number [see (1)]. When encountering this idea and axioms of probability, think about what Bayesianism assumes in doing so. Ask yourself, for example, why negative reals do not represent credence. Think of what one could gain - if anything -by doing so. Ask yourself whether credence should lie on the real interval at all. Why not the rational numbers? Why not a vector space (to allow multi-dimensional values)? Why employ the continuum? Why should credence be normalizable [see (2)]?

I hope to discuss most of these questions in detail later, but I hope those encountering the axioms now will prefer honest toil to theft and think about why these postulates are postulated.

First, some notes on notation. The subset symbol $\subset$' denotes general inclusion, not strict inclusion. I denote $[n]:=\{1,...,n\}\text{ for }n\in\mathbb{P}$for $\mathbb{P}$ the set of positive integers. This is much more elegant than the explicit version, and the symbol for the positive numbers helps diminish confusion with the natural numbers (non-negative integers).

As I am more comfortable with set notation, I introduce the topic as such, but it is possible to rewrite what follows in propositional form. It is quite common to encounter this, but the difference is symbolic.5 Conjunction, disjunction, and', or', negation, etc. may be substituted as needed.

Let $\Omega$ denote the sample space, the set of possibilities, subsets of which are called events. For a simple coin toss, the sample space is $\{heads,tails\}$, and the set of events is the powerset $2^\Omega=\{\emptyset,\{heads\},\{tails\},\{heads,tails\}\}$. An event occurs if an element occurs. Then p is a probability function on $\Omega$ if it satisfies the following: $Domain(p)\subset 2^\Omega$ is closed under set complement, and

1. Positivity: $E\in Domain(p)\implies 0\leq p(E)\in\mathbb{R}$.

2. Normalizability: $\Omega\in\text{Domain}(p)\text{ and }p(\Omega)=1$

3. Finite additivity: $E,F,,E\cup F\in Domain(p)\implies p(E\cup F)=p(E)+p(F)$

Those familiar with probability theory will notice certain differences from a standard presentation. Note that p is defined on subsets of $\Omega$, but I have not assumed that it is defined on $2^\Omega$ or a sigma-field of its subsets. The only general assumption about Domain(p) which I make is that it is closed up to complements, i.e. $E\in Domain(p)\implies E^c\in Domain(p)$. Axiom (3) is a weakening of countable additivity. Usually, (3) is introduced as countable additivity:

3*. Countable additivity:$\text{For }\{E_i\}_{i\in I}\text{ a countable set of mutually disjoint events},p\left(\bigcup_{i\in I}E_i\right)=\sum_{i\in I}p(E_i)$.

(Note that I have ceased to make the domain assumptions explicit.) It is an elementary theorem that (1), (2), and (3*) entail (3). But accepting countable additivity, along with assuming that Domain(p) is a sigma-field, may assume too much. For purposes of simplification, I will implicitly assume that p is defined on all subsets of the sample space. But as we will see in later sections, this is not necessarily the case.1 And of course, one may generalize probability further, but that would require more advanced mathematics, i.e. measure theory.

To understand the appeal of these axioms, it is important to remember that they are relatively young: Kolmogorov first published them in 1933. Before, probabilities were defined as relative frequencies, i.e.

$p(A)=\frac{number(A)}{N}$

where number(A) denotes the number of occurrences of A in N trials, or limiting long run frequencies, i.e.

$p(A)=\lim_{N\rightarrow\infty}\frac{number(A)}{N}$.

One notices several problems with these notions. For the latter, the assumption that a limit exists is required. For both, N counts a reference class of events which requires specification, and reference classes are not always clear.2 There are also counter-factual commitments implicit in the definition, e.g. if you were to toss this coin ad infinitum...' Worse still, relative frequency presupposes the uniform distribution and a finite sample space, and not all possibilities are equiprobable. But the Kolmogorov axioms capture such notions, where applicable.

Before going further, conditional probabilities need to be introduced. Very often, conditional probability is presented as a definition, and not an analysis - which it almost always is in practice. Usually, it is given in ratio analysis form:

$p(A|B)=\frac{p(A\cap B)}{p(B)}\text{ where }p(B)>0$

The assumption that $p(B)>0$ may be dispensed with when given in multiplication law form:

$p(A|B)=X\in\mathbb{R}:p(B)\times X=p(A\cap B)$

As the ratio `definition' is most common, I accept it as the default, with caution as to its shortcomings.3 When $p(B)>0$, the ratio definition allows a quick proof that p(.|B) is itself a probability, a process known as reduction of sample space. The multiplication law also easily follows from the ratio form.

Theorem List

In what follows, probabilities are assumed to be defined wherever they appear. Items marked with an asterisk require countable additivity. [I currently see no need to include such theorems, some of which (e.g. continuity with respect to series of subsets and supersets) require calculus.] Unless stated otherwise, theorems follow from (1)-(3) and the ratio analysis of conditional probability. I will state these theorems in logical order; their derivation can be found in any elementary probability textbook, although many needlessly invoke countable additivity in the process.

Equivalence condition. $A=B\implies p(A)=p(B)$

Probability of impossibility. $p(\emptyset)=0$. Note that the converse does not hold: some probabilities assign probability 0 to non-empty events, just as some probabilities assign probability 1 to uncertain events. Such events are said to almost never occur and almost surely occur, respectively. There are some philosophical difficulties with such events, as I discuss later. Many Bayesians employ the regularity norm (an unfortunate name), which states that uncertain events always have positive probability.

Complement rule. $\text{For }E\text{ an event}, p(E^c)=1-p(E)$

This follows by noticing that $\text{For }E\text{ an event}, 1=p(E^c)+p(E)$, a rule which can be generalized to account for larger partitions of $\Omega$.

Subset rule. $A\subset B\implies p(A)\leq p(B)$

Finite additivity. $\text{For }\{E_i\}_{i\in[n]}\text{ a set of mutually disjoint events, }p\left(\cup_{i\in[n]}E_i \right)=\sum_{i\in[n]}p(E_i)$

Inclusion/Exclusion Principle. $\text{ For }\{E_i\}_{i\in [n]}\text{ events, }p\left(\bigcup_{i\in[n]} E_i\right)=\sum_{I\subset[n]}(-1)^{|I|-1}p(\cap_{i\in I}E_i)$

This is often encountered in its simpler form: $p(A\cup B)=p(A)+p(B)-p(A\cap B)$

Law of Total Probability. $\text{ For }A,B\text{ arbitrary events, }p(A)=p(A\cap B)+p(A\cap B^c)$

$\text{ For }A\text{ an event and }\{B_i\}_{i\in[n]}\text{ a partition of }\Omega,\ p(A)=\sum_{i\in[n]}p(A\cap B_i)$

Which has an equivalent in terms of conditional probability:

$\text{ For }A\text{ an event, }\{B_i\}_{i\in[n]}\text{ a partition of }\Omega:p(B_i)>0\forall i,\ p(A)=\sum_{i\in[n]}p(B_i)p(A|B_i)$

Multiplication law. For $\{E_i\}_{i\in[n]}$ a set of events in which arbitrary intersections have nonzero probability,
$p(\cap_{i\in[n]}E_i)=p(E_1)p(E_2|E_1)\times\cdots\times p(E_n|E_1\cap\cdots\cap E_{n-1})$

Sample space reduction. p(B)>0 implies that $p(\cdot|B)$ is itself a probability map [over Domain(p)]. I.e., $p(\cdot|B)$ satisfies (1)-(3) and therefore satisfies each of the previous theorems.

Bayes' Theorem. $\text{ For }A,B\text{ events, } p(A|B)=\frac{p(B|A)p(A)}{p(B)}$

This may be generalized using the Law of Total Probability: $\text{For }A_i\text{ a block in a partition}\{A_i\}_{i\in[n]},$ $p(A_i|B)=\frac{p(B|A_i)p(A_i)}{\sum_{j\in[n]}p(A_j)p(B|A_j)}$

Of great importance is the odds form of Bayes' Theorem: for H a hypothesis and E some evidence,

$\frac{p(H|E)}{p(H^c|E)}=\frac{p(E|H)}{p(E|H^c)}\frac{p(H)}{p(H^c)}$

In this form, the term $\frac{p(H|E)}{p(H^c|E)}$ is called the posterior odds on H, $\frac{p(E|H)}{p(E|H^c)}$ the Bayes' factor, and $\frac{p(H)}{p(H^c)}$ the prior odds on H. The significance of the terminology is explained in the next post.

1. The subjective interpretation of probability allows for this. It is possible to be separately committed to events $E\text{ or }F,\text{ but not }E\cap F$.
2. This is discussed further in the subjective/objective post.
3. See e.g. Alan Hájek's What Conditional Probability Could not Be.
4. I largely follow Colin Howson and Peter Urbach. Scientific Reasoning: The Bayesian Approach. Open Court: La Salle, 1989.
5. ibid, pp.18-9.

#### 1 comment:

1. Want to get 30 bitcoin downline referrals every month, totally free?

Here's How:

1. Claim 5,000 (and up to 50,000) free satoshi per 24 hours from the Mellow Ads Faucet.

2. Start a 24 hours network campaign (using all your collected satoshis) promoting a bitcoin referral link.

3. When the campaign completes, re-claim and re-start.