Preface

These are lecture notes for a one-semester course on probability theory. They are meant to be fully self-contained, assuming a basic knowledge of measure theory (which is reviewed briefly the beginning). For further reading, I recommend the three following standard references, on which these lecture notes are in part based:

There are of course many other excellent books on the subject.

These notes are liable (i.e. virtually certain) to contain typos. If you find any, please make sure you tell me!

Before getting into the subject proper, in this short preface we give a very brief overview of the subject’s history and of its relation to the natural sciences, with which it has always had a close interaction. This is meant only for the curious reader, and does not constitute a part of the course itself.

Probability is the study of uncertain events – events whose outcome cannot be predicted with certainty. Examples of such events include

  1. I obtain heads when I flip a coin;

  2. it rains in Brig tomorrow;

  3. my kitchen light breaks in the next six months.

The classical view of how uncertainty arises in nature is based on nineteenth century physics (Newtonian mechanics and Maxwell’s electrodynamics), where the state of a physical system at any time is a deterministic function of its initial state. In principle, therefore, the future state of any system is fully predictable, provided we have precise enough information about its current state. From this point of view, the uncertainty of a future event is simply an expression of a lack of knowledge about the present. In reality, however, this point of view is essentially useless for most systems of interest. This is because the complexity of the system and the sensitive dependence on the initial state means that the required precision in the knowledge of the initial state is not achievable by any conceivable means. A famous example is the impossibility of predicting the weather for more than two weeks into the future. A simpler example is the humble coin flip or toss of a die, whose outcome cannot be determined in advance no matter how accurately the initial toss is determined. The quantum revolution of the first half of the twentieth century went further: uncertainty is inherent in the laws of nature, and even simple physical systems behave in an intrinsically random fashion, no matter how accurately one determines the initial data (a famous example is the double slit experiment in quantum mechanics).

The historical development of probability was initially motivated by the desire for a theoretical understanding of gambling, in the sixteenth and seventeenth centuries. Today, probability theory has become one of the theoretical foundations of our modern society. It underpins statistics, machine learning, artificial intelligence, and computer science. It also constitutes the bedrock of any experimental discipline, and as such lies at the heart of the natural and social sciences.

Aside from its applications, probability theory is an area of pure mathematics, which has flourished in the past fifty years. Having shed its former reputation of an application-driven low-brow game of counting balls and boxes, it has become one of the most central and active areas of pure mathematics.

The study of probability can be roughly divided into two disciplines, which, while not wholly separate, have rather different goals and ways of thinking.

In most instances, if one is familiar with probability theory, simple common sense is sufficient to answer probabilistic questions about the real world. Nevertheless, aside from important philosophical questions it raises, the study of the interpretation of probability can be of great practical importance in several applied fields. This is typically discussed in more detail in classes on statistics.

Being a course on mathematics, this course is entirely devoted to mathematical probability theory. Henceforth, we shall wrap ourselves in the warm blanket of mathematical rigour and axiomatic deduction, without having to worry about tricky epistemological questions raised by interpretation2.

1 Recap of measure theory

Since probability theory is founded on measure theory, in this preliminary chapter we give a review of the most important ingredients from measure theory. It is meant to be understandable for a reader who has learned some basic measure theory but may have forgotten some details or more technical aspects of it.

For full details and for proofs, we refer to Chapter 1 of the course Calculus II that you took last year.

Throughout these notes we use the following standard notations. We write \(\mathbb{N}= \{0,1,2,\dots\}\) and \(\mathbb{N}^* = \{1,2,3,\dots\}.\) For a finite set \(X,\) we denote by \(\# X\) the number of elements of \(X.\) For a set \(X\) and a subset \(A \subset X,\) we write \(A^c :=X \setminus A\) and denote by \(\mathcal P(X)\) the collection of all subsets of \(X.\) We denote by \(\mathbf 1_{A}\) the indicator function3 of the set \(A,\) defined through \[\mathbf 1_{A}(x) := \begin{cases} 1 & \text{if } x \in A \\ 0 & \text{if } x \notin A\,. \end{cases}\] We also use the notations \(a \wedge b :=\min \{a,b\}\) and \(a \vee b :=\max\{a,b\},\) which are common in probability theory. Moreover, we write \(a_+ :=a \vee 0\) and \(a_- :=(-a) \vee 0\) for the positive and negative parts of a real number \(a.\) Hence, for any \(a \in \mathbb{R}\) we have \(a = a_+ - a_-\) with \(a_+, a_- \geqslant 0.\) We often use the nonnegative reals augmented with \(\infty,\) denoted by \([0,\infty].\) They satisfy the obvious order relations as well as the convention \(0 \cdot \infty = 0.\)
Definition 1.1

Let \(X\) be a set. A \(\sigma\)-algebra (or \(\sigma\)-field) on \(X\) is a collection \(\mathcal A\) of subsets of \(X\) satisfying

  1. \(X \in \mathcal A\);

  2. \(A \in \mathcal A \; \Rightarrow \; A^c \in \mathcal A\);

  3. if \(A_n \in \mathcal A\) for all \(n \in \mathbb{N}\) then \(\bigcup_{n \in \mathbb{N}} A_n \in \mathcal A.\)

If \(\mathcal A\) is a \(\sigma\)-algebra on \(X,\) then we say that any \(A \in \mathcal A\) is a measurable subset of \(X,\) and call \((X, \mathcal A)\) a measurable space.

The following construction plays a particularly prominent role in probability.

Definition 1.2

Let \(\mathcal C \subset \mathcal P(X).\) Then \[\sigma(\mathcal C) :=\bigcap_{\substack{\mathcal A \text{ is a $\sigma$-algebra}\\ \mathcal C \subset \mathcal A}} \mathcal A\] is the \(\sigma\)-algebra generated by \(\mathcal C\).

The \(\sigma\)-algebra generated by \(\mathcal C\) is indeed a \(\sigma\)-algebra as its name implies, because the intersection of \(\sigma\)-algebras is a \(\sigma\)-algebra.

Example 1.3

  1. Let \(X = \mathbb{R}^d\) and \(\mathcal O\) be the collection of open subsets of \(\mathbb{R}^d.\) (More generally, \(X\) can be a topological space whose collection of open sets is \(\mathcal O.\)) Then \(\mathcal B(X) :=\sigma(\mathcal O)\) is the Borel \(\sigma\)-algebra of X.

  2. Let \((X_1, \mathcal A_1)\) and \((X_2, \mathcal A_2)\) be measurable spaces. The product \(\sigma\)-algebra on \(X_1 \times X_2\) is \[\mathcal A_1 \otimes \mathcal A_2 :=\sigma( A_1\times A_2 \,\colon A_1 \in \mathcal A_1 , A_2 \in \mathcal A_2).\]

Definition 1.4

A (positive) measure on a measurable space \((X, \mathcal A)\) is a function \(\mu \colon \mathcal A \to [0,\infty]\) satisfying \(\mu(\emptyset) = 0\) and \(\mu \bigl(\bigcup_{n \in \mathbb{N}} A_n\bigr) = \sum_{n \in \mathbb{N}} \mu(A_n)\) for any countable family \((A_n)_{n \in \mathbb{N}}\) of disjoint measurable subsets.

Example 1.5

  1. Let \(X\) be finite or countable, \(\mathcal A = \mathcal P(X),\) and \(\mu(A) :=\# A.\) This is the counting measure on \(X.\)

  2. For \(x \in X\) we define the Dirac delta measure at \(x\) through \[\delta_x(A) := \begin{cases} 1 & \text{if } x \in A \\ 0 & \text{if } x \notin A\,. \end{cases}\]

  3. The Lebesgue measure on \((\mathbb{R}, \mathcal B(\mathbb{R}))\) is defined as the unique measure \(\lambda\) satisfying \(\lambda((a,b)) = b - a\) for all \(a < b.\) (Recall from your course on measure theory that the existence and uniqueness of \(\lambda\) is nontrivial. Later in this class we shall give a proof of uniqueness: see Example 3.10 below.)

A measurable space \((X, \mathcal A)\) endowed with a measure \(\mu\) is called a measure space and denoted by the triple \((X, \mathcal A, \mu).\)

Definition 1.6

Let \((X, \mathcal A, \mu)\) be a measure space. Then a property \(P(x)\) depending on \(x \in X\) holds almost everywhere if \[\mu(\{x \in X \,\colon P(x) \text{ false}\}) = 0\,.\]

For example, on \((\mathbb{R}, \mathcal B(\mathbb{R}))\) endowed with Lebesgue measure, the indicator function \(\mathbf 1_{\mathbb{Q}}\) equals \(0\) almost everywhere, or \(\mathbf 1_{\mathbb{Q}}(x) = 0\) for almost all \(x.\)

Definition 1.7

Let \((X, \mathcal A)\) and \((Y, \mathcal B)\) be measurable spaces. A function \(f \colon X \to Y\) is measurable if for all \(B \in \mathcal B\) we have \(f^{-1}(B) \in \mathcal A.\)

Here, \(f^{-1}\) denotes the preimage function on sets, i.e. \(f^{-1}(B) :=\{x \in X \,\colon f(x) \in B\}.\)

Often, the \(\sigma\)-algebras \(\mathcal A\) and \(\mathcal B\) are clear from the context, and we do not even mention them explicitly.

The following definition allows one to transport measures between measurable spaces using measurable functions.

Definition 1.8

Let \((X, \mathcal A)\) and \((Y, \mathcal B)\) be measurable spaces, \(f \colon X \to Y\) measurable, and \(\mu\) a measure on \((X, \mathcal A).\) Then we define the pushforward or image measure of \(\mu\) under \(f,\) denoted by \(f_*\mu,\) as the measure on \((Y, \mathcal B)\) defined by \[f_*\mu(B) :=\mu(f^{-1}(B)) \quad \text{for all } B \in \mathcal B\,.\]

We now recall the notation for the integral.

Definition 1.9

Let \(\mu\) be a measure on \((X, \mathcal A).\)

  1. Let \(f \colon X \to [0,\infty].\) We use the notation \[\int f \, \mathrm d\mu = \int f(x) \, \mu(\mathrm dx) \in [0,\infty]\] for the integral of \(f\) with respect to \(\mu\) (see the class on measure theory for its definition, which is also briefly reviewed below).

  2. A function \(f \colon X \to \mathbb{R}\) is called integrable if \(\int \lvert f \rvert \, \mathrm d\mu < \infty,\) in which case we define \[\int f \, \mathrm d\mu :=\int f_+ \, \mathrm d\mu - \int f_- \, \mathrm d\mu\,.\]

It is helpful to recall briefly the construction of the integral in Definition 1.9 (i). It proceeds in two main steps.

The preceding definition captures a basic idea of measure theory, which we shall consistently and often tacitly use in this class: one can define the integral of any function provided that it is nonnegative, in which case the integral may be infinite. If the function is not nonnegative, then one has to impose that it is integrable for its integral to make sense. (Otherwise one might end up with expressions of the form \(\infty - \infty,\) which are ill-defined.)

The integral satisfies the three following convergence theorems, which are stated for some fixed measure space \((X, \mathcal A, \mu).\)

Proposition 1.10 • Monotone convergence, Beppo-Levi

Let \(f_1, f_2, \dots \,\colon X \to [0,\infty]\) be a pointwise nondecreasing sequence of measurable functions. Then \[\lim_{n \to \infty} \int f_n \, \mathrm d\mu = \int \lim_{n \to \infty} f_n \, \mathrm d\mu\,.\]

Proposition 1.11 • Fatou’s lemma

Let \(f_1, f_2, \dots \,\colon X \to [0,\infty]\) be a sequence of measurable functions. Then \[\liminf_{n \to \infty} \int f_n \, \mathrm d\mu \geqslant\int \liminf_{n \to \infty} f_n \, \mathrm d\mu\,.\]

Proposition 1.12 • Dominated convergence, Lebesgue

Let \(g,f,f_1,f_2, \dots\) be measurable functions. Suppose that \(f_n \to f\) almost everywhere, that \(g\) is integrable, and that \(\lvert f_n \rvert \leqslant g\) almost everywhere for all \(n.\) Then \[\lim_{n \to \infty} \int f_n \, \mathrm d\mu = \int f \, \mathrm d\mu\,.\]

Next, we recall the notation of product measure. Its uniqueness is guaranteed by the following finiteness property. A measure \(\mu\) on \((X,\mathcal A)\) is \(\sigma\)-finite if there exists a countable decomposition \(X = \bigcup_{n \in \mathbb{N}} X_n\) of \(X\) such that \(\mu(X_n) < \infty\) for all \(n \in \mathbb{N}.\) (For instance Lebesgue measure on \(\mathbb{R}\) is \(\sigma\)-finite but not finite.)

Definition 1.13

Let \(\mu\) and \(\nu\) be \(\sigma\)-finite measures on \((X_1, \mathcal A_1)\) and \((X_2, \mathcal A_2),\) respectively. The product measure \(\mu_1 \otimes \mu_2\) is the unique measure on \((X_1 \times X_2, \mathcal A_1 \otimes \mathcal A_2)\) satisfying \[\mu_1 \otimes \mu_2 (A_1 \times A_2) = \mu_1(A_1) \, \mu_2(A_2) \quad \text{for all $A_1 \in \mathcal A_1$ and $A_2 \in \mathcal A_2$}.\] For the proof of existence and uniqueness, we refer to the class on measure theory.

The following theorem states that product measures can be integrated successively over each component separately, provided the function is nonnegative or integrable.

Proposition 1.14 • Fubini-Tonelli

Let \(\mu_1\) and \(\mu_2\) be \(\sigma\)-finite measures on \((X_1, \mathcal A_1)\) and \((X_2, \mathcal A_2),\) respectively. Let \(f \colon X_1 \times X_2 \to [0,\infty]\) be measurable. Then \[\tag{1.2} \begin{aligned} \int_{X_1 \times X_2} f \, \mathrm d(\mu_1 \otimes \mu_2) &= \int_{X_1} \biggl(\int_{X_2} f(x_1,x_2) \, \mu_2 (\mathrm dx_2)\biggr) \, \mu_1(\mathrm dx_1) \\ &= \int_{X_2} \biggl(\int_{X_1} f(x_1,x_2) \, \mu_1 (\mathrm dx_1)\biggr) \, \mu_2(\mathrm dx_2)\,. \end{aligned}\] The same identity holds if \(f \,\colon X_1 \times X_2 \to \mathbb{R}\) is integrable with respect to \(\mu_1 \otimes \mu_2.\)

2 Foundations of probability theory

2.1 Probability spaces

In this section we shall give a motivation of Kolmogorov’s axioms of probability. We shall see that a mathematical formulation of probability theory rests on three core ingredients: (i) a set of realisations, (ii) a collection of events, and (iii) a probability measure that expresses probabilities of events.

A random experiment (such as the toss of a die) has a number of possible outcomes or realisations.

We consider two basic examples.

Example 2.1

Toss of a die: \(\Omega = \{1,2,3,4,5,6\}.\) The realisation \(\omega \in \Omega\) denotes the number shown by the die.

Example 2.2

A game of darts. A person throws a dart at a disc-shaped dartboard. \(\Omega\) is the unit disc in the plane, \(\Omega = \{\omega \in \mathbb{R}^2 \,\colon\lvert \omega \rvert \leqslant 1\}.\) The realisation \(\omega \in \Omega\) denotes where dart hits the dartboard.

These examples show that it makes sense to consider very general sets \(\Omega,\) from finite to uncountable.

Example 2.3 • Example 2.1 continued

The event \(A = \{2,4,6\}\) is the event that I obtained an even number. The event \(A = \{6\}\) is the event that I obtained a \(6.\)

Example 2.4 • Example 2.2 continued

The event \(A = \{\omega \in \mathbb{R}^2 \,\colon\lvert \omega \rvert \leqslant 1/20\}\) is the event that I hit the bull’s eye of the dartboard.

Example 2.5 • Example 2.1 continued

For a balanced die, we have \(\mathbb{P}(\{2,4,6\}) = 1/2\) and \(\mathbb{P}(\{6\}) = 1/6.\)

Example 2.6 • Example 2.2 continued

If the dart hits any region of the dartboard with uniform probability, then we have \(\mathbb{P}(\{\omega \in \mathbb{R}^2 \,\colon\lvert \omega \rvert \leqslant 1/20\}) = (1/20)^2\) (relative area of bull’s eye).

That \(\mathbb{P}(A) \in [0,1]\) reflects the fact that probabilities must be nonnegative and cannot exceed \(1 = 100\)%. Moreover, we require \(\mathbb{P}\) to satisfy the two following obvious properties.

The triple \((\Omega, \mathcal A, \mathbb{P})\) therefore looks rather similar to a measure space. Imposing that the additivity property for mutually exclusive events extends to countable families, we arrive at the following celebrated and fundamental definition.

Definition 2.7 • Kolmogorov, 1933

A probability space is a measure space \((\Omega, \mathcal A, \mathbb{P})\) satisfying \(\mathbb{P}(\Omega) = 1.\)

A measure \(\mathbb{P}\) on \((\Omega, \mathcal A)\) satisfying \(\mathbb{P}(\Omega) = 1\) is called a probability measure.

We give two examples that shall accompany us through much of this chapter.

Example 2.8

I throw a balanced die twice: \[\Omega = \{1,2,\dots, 6\}^2\,, \qquad \mathcal A = \mathcal P(\Omega)\,, \qquad \mathbb{P}(A) = \frac{\# A}{36}\,.\]

Example 2.9

Here is a more interesting (and more subtle) example. I throw a die repeatedly until I obtain a \(6.\) Since I may have to throw the die an arbitrarily large number of times, I choose \[\Omega = \{1,2,\dots,6\}^{\mathbb{N}^*}\,.\] As a reminder, this is the set of sequences \(\omega \,\colon\mathbb{N}^* \to \{1,2,\dots,6\}.\) We use the notation \(\omega = (\omega_k)_{k \in \mathbb{N}^*}\) for its elements.

The set \(\Omega\) is uncountable, and as we shall see it is ill-advised to take \(\mathcal A\) to be the full power set \(\mathcal P(\Omega).\) To find the correct choice for \(\mathcal A,\) let us begin by noting that we certainly want to assign a probability to any event depending on a finite number of throws (such as “the first 10 throws are all smaller than \(4\)”). Generally, such an event is called a cylinder set, and it is of the form \[\tag{2.1} \bigl\{\omega \in \Omega \,\colon\omega_1 = i_1, \dots, \omega_n = i_n\bigr\}\,,\] which is indexed by the parameters \(n \in \mathbb{N}^*\) and \(i_1, \dots, i_n \in \{1,2, \dots, 6\}.\) Hence, we define \(\mathcal A\) to be the \(\sigma\)-algebra generated by the cylinder sets, i.e. \[\tag{2.2} \mathcal A = \sigma \Bigl(\bigl\{\omega \in \Omega \,\colon\omega_1 = i_1, \dots, \omega_n = i_n\bigr\} \,\colon n \in \mathbb{N}^*, i_1, \dots, i_n \in \{1,2, \dots, 6\}\Bigr)\,.\] The \(\sigma\)-algebra \(\mathcal A\) thus constructed is called the cylinder \(\sigma\)-algebra, and it plays a fundamental role in probability. It is the canonical \(\sigma\)-algebra on an infinite product space (such as \(\Omega\)).

Clearly, the probability measure \(\mathbb{P}\) on \(\mathcal A\) should have the following value on any cylinder set: \[\tag{2.3} \mathbb{P}\Bigl(\bigl\{\omega \in \Omega \,\colon\omega_1 = i_1, \dots, \omega_n = i_n\bigr\}\Bigr) = \biggl(\frac{1}{6}\biggr)^n\,.\] In fact, we shall prove later that there exists a unique measure \(\mathbb{P}\) on \((\Omega, \mathcal A)\) satisfying (2.3).

We conclude with a more difficult example, which is of great interest in mathematics and the sciences. It goes beyond the scope of this course, but we can nevertheless mention its basic mathematical structure.

Example 2.10

You will probably have heard of Brownian motion, which was first observed by the botanist Robert Brown in 1827. With a microscope, he observed a particle of pollen immersed in water and noticed that it underwent an erratic random motion. Brownian motion was famously studied by Albert Einstein in one of his groundbreaking papers of 1905, where he gave a theoretical explanation of its origin.

The random realisation is the entire trajectory of the particle, so that we choose \[\Omega = C([0,\infty),\mathbb{R}^3)\] to be the space of continuous paths \(\omega = (\omega(t))_{t \geqslant 0}\) in \(\mathbb{R}^3.\) For the collection of events, as in the previous example, we choose the cylinder \(\sigma\)-algebra, which in this instance takes the form \[\mathcal A = \sigma \Bigl(\{\omega \in \Omega \,\colon\omega(t) \in B\} \,\colon t \in [0,\infty), B \in \mathcal B(\mathbb{R}^3)\Bigr)\,.\] (If you wish, you can think about the analogy between this definition and (2.2). It may help to consider intersections of cylinder sets \(\{\omega \in \Omega \,\colon\omega(t) \in B\}.\)) What about the probability measure \(\mathbb{P}\) on \((\Omega, \mathcal A)\)? Clearly, there are many possible choices, but one of them stands out by being by far the most natural one; it is called Wiener measure, an infinite-dimensional Gaussian measure which underlies the mathematical definition of Brownian motion. We shall not discuss it further in this course.

Remark 2.11 We conclude this section with an important remark. Since a probability measure \(\mathbb{P}\) is a measure, we always have \(\mathbb{P}\bigl(\bigcup_{n \in \mathbb{N}} A_n\bigr) = \sum_{n \in \mathbb{N}} \mathbb{P}(A_n)\) for any countable family \((A_n)_{n \in \mathbb{N}}\) of disjoint events. From this it is easy4 to deduce that, for any (not necessarily disjoint) family of events \((A_n)_{n \in \mathbb{N}}\) we have the bound \[\tag{2.4} \mathbb{P}\biggl(\bigcup_{n \in \mathbb{N}} A_n\biggr) \leqslant\sum_{n \in \mathbb{N}} \mathbb{P}(A_n)\,.\] An estimate of the form (2.4) is called a union bound. Union bounds are ubiquitous in probability, and we shall also use them throughout this course. Roughly, a union bound states that the union of unlikely events remains unlikely provided there are not too many of them.

2.2 Conditional probability

Often one is interested in conditional statements statements, where instead of considering the probability of an event \(A,\) we are interested in the probability of the event \(A\) knowing that the event \(B\) happened. For instance, suppose I’d like to know the probability that my car breaks down today (event \(A\)). If I condition on the event \(B\) that I am driving 1000 km today, this will likely influence the answer.

The idea is that we have some extra knowledge in computing the probability of \(A\): we know that \(B\) happened. This can change the probability of \(A\) dramatically, since now we only consider realisations within \(B\) and not in the whole space \(\Omega.\) In the frequentist interpretation, we count the relative frequency of the occurrence of the event \(A,\) but only among those realisations that lie in \(B.\)

Definition 2.12 • Conditional probability

Suppose that \(B\) is an event satisfying \(\mathbb{P}(B) > 0.\) Then the conditional probability of an event \(A\) given \(B\) is \[\mathbb{P}(A\mid B) :=\frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)}\,.\]

This is clearly the correct definition in light of the intuition above: we only consider the probabilities of realisations in \(B,\) and we normalize by \(\mathbb{P}(B)\) to ensure that the following holds (check this carefully if you’re not sure).

Remark 2.13

\(\mathbb{P}(\cdot \mid B)\) is a probability measure for any event \(B\) satisfying \(\mathbb{P}(B) > 0.\)

Remark 2.14

Definition 2.12 only makes sense if \(\mathbb{P}(B) > 0.\) For brevity, we shall usually omit the explicit mention of this condition, with the general convention that any statement involving the conditional probability \(\mathbb{P}(A \mid B)\) is only valid provided that \(\mathbb{P}(B) > 0.\)

Moreover, we adopt the convention that \[\mathbb{P}(A \mid B) \, \mathbb{P}(B) :=0 \quad \text{if } \mathbb{P}(B) = 0\,.\] (Recall also the convention \(0 \cdot \infty = 0\) from Chapter 1.)

Example 2.15

Consider two throws of a balanced die from Example 2.8. Knowing that the sum of the throws is \(4,\) what is the probability that on the first throw I obtained \(2\)? Here, \[A = \{(2,1), (2,2), (2,3), (2,4), (2,5), (2,6)\}\,, \qquad B = \{(1,3), (2,2), (3,1)\}\,.\] We find \[A \cap B = \{(2,2)\}\] and hence \[\mathbb{P}(A \cap B) = \frac{1}{36}\,, \qquad \mathbb{P}(B) = \frac{3}{36}\,.\] We conclude that \[\mathbb{P}(A \mid B) = \frac{1}{3}\,,\] which is different from \[\mathbb{P}(A) = \frac{1}{6}\,.\] Intuitively, this is not surprising: if we know that the sum of the throws was small, this should increase the odds that the first throw was a small number. Similarly, knowing that the sum of the throws is \(4,\) the probability that on the first throw I obtained \(4\) (or more) is clearly zero.

Example 2.16

Consider the following two simple questions.

  • I have two children, one of whom is a girl. What is the probability that the other one is also a girl?

  • I have two children, the oldest of whom is a girl. What is the probability that the other one is also a girl?

To address them, for the purposes of this mathematical exercise, we make the simplifying assumption that children are either boys (B) or girls (G), each born with probability \(1/2.\) Thus, the probability space is \[\Omega = \{(\mathrm B, \mathrm B), (\mathrm B, \mathrm G), (\mathrm G, \mathrm B), (\mathrm G, \mathrm G)\}\,,\] and each of the four realisations occurs with probability \(1/4.\)

For question (1), we have \[A = \{(\mathrm G, \mathrm G)\}\,, \qquad B = \{(\mathrm B, \mathrm G), (\mathrm G, \mathrm B), (\mathrm G, \mathrm G)\}\,,\] and therefore \[\mathbb{P}(A \mid B) = \frac{1/4}{3/4} = \frac{1}{3}\,.\]

For question (2), we have \[A = \{(\mathrm G, \mathrm G)\}\,, \qquad B = \{(\mathrm B, \mathrm G), (\mathrm G, \mathrm G)\}\,,\] and therefore \[\mathbb{P}(A \mid B) = \frac{1/4}{2/4} = \frac{1}{2}\,.\]

More subtle apparent paradoxes easily arise from a careless use of conditional probabilities. For a famous, and a famously confusing and much debated, example, you can look up the Monty Hall problem online, e.g. on Wikipedia (we will not go into it here).

The following result is mathematically trivial, but it has profound consequences in statistics and the sciences.

Proposition 2.17 • Bayes’ theorem

Let \(A\) and \(B\) be events satisfying \(\mathbb{P}(A) > 0\) and \(\mathbb{P}(B) > 0.\) Then \[\mathbb{P}(A \mid B) = \frac{\mathbb{P}(B \mid A) \, \mathbb{P}(A)}{\mathbb{P}(B)}\,.\] Suppose that \(\Omega = B_1 \cup \cdots \cup B_n\) with disjoint events \(B_1, \dots, B_n\); such events are called a partition of \(\Omega.\) Then the denominator can be expressed as \[\mathbb{P}(B) = \sum_{i = 1}^n \mathbb{P}(B \mid B_i) \, \mathbb{P}(B_i)\,.\]

Many mistakes in science, and popular reporting of science, arise from fallacies related to a misunderstanding or a misuse of conditional probabilities. A surprisingly common mistake is to mix up \(\mathbb{P}(A \mid B)\) and \(\mathbb{P}(B \mid A).\) One issue is that our intuition is bad at estimating conditional probabilities, which is why having a clear and rigorous formulation of the concept is so crucial.

Example 2.18

Here is a classical application of Bayes’ theorem in medicine. A patient is tested for a disease. Suppose that

  • in 1 % of cases the test is positive even though the patient is healthy;

  • in 2 % of cases the test is negative even though the patient is sick.

We are interested in the questions

  • If a patient tests positive, what is the probability that he is healthy?

  • If a patient tests negative, what is the probability that he is sick?

It turns out that the answer depends greatly on the prevalence of the virus. Let us suppose that one in a thousand patients is sick.

This is where clear mathematical thinking and Bayes’ theorem come in very handy. Introduce the events \[\begin{aligned} S &= \{\text{patient is sick}\} \\ H &= \{\text{patient is healthy}\} = S^c \\ P &= \{\text{test is positive}\} \\ N &= \{\text{test is negative}\} = P^c\,. \end{aligned}\] We know that \[\mathbb{P}(P \mid H) = 0.01\,, \qquad \mathbb{P}(N \mid S) = 0.02\,, \qquad \mathbb{P}(S) = 0.001\,.\]

By Bayes’ theorem, the answer to question (1) is \[\begin{aligned} \mathbb{P}(H \mid P) &= \frac{\mathbb{P}(P \mid H) \mathbb{P}(H)}{\mathbb{P}(P)} \\ &= \frac{\mathbb{P}(P \mid H) \mathbb{P}(H)}{\mathbb{P}(P \mid S) \mathbb{P}(S) + \mathbb{P}(P \mid H) \mathbb{P}(H)} \\ &= \frac{0.01 \cdot (1 - 0.001)}{(1 - 0.02) \cdot 0.001 + 0.01 \cdot (1 - 0.001)} \approx 91\%\,. \end{aligned}\] Thus, even though the patient was tested positive, the probability that he is healthy is more than \(90 \%.\) This figure is perhaps higher than one would intuitively expect, and shows, first, the danger of relying on our intuition for questions of this type and, second, the usefulness of clear thinking combined with simple mathematics. A similar calculation gives the answer to question (2) as \[\mathbb{P}(S \mid N) \approx 0.002\%\,.\] Thus, a negative test is a very reliable sign of being healthy.

The assumption of one in a thousand patients being sick was crucial in the above calculations. If instead we consider a different population of patients, where the virus is far more prevalent, the conditional probabilities computed to answers questions (1) and (2) will change considerably.

2.3 Random variables

Informally, a random variable is a variable whose value depends on the realisation \(\omega \in \Omega.\)

Definition 2.19

A random variable is a measurable real-valued function on \(\Omega.\) More generally, for a measurable space \((E, \mathcal E),\) a random variable with values in \(E\) is a measurable function from \(\Omega\) to \(E.\)

For instance we can speak about vector-valued random variables, with values in \(E = \mathbb{R}^d.\)

Example 2.20 • Example 2.8 continued

The sum of both values is the random variable \(X \,\colon\Omega \to \mathbb{R}\) defined by \[X((i,j)) :=i+j\,,\] with the notation \(\omega = (i,j) \in \{1,2, \dots,6\}^2.\)

Example 2.21 • Example 2.9 continued

Define the random variable \(X \,\colon\Omega \to \mathbb{N}^* \cup \{\infty\}\) to be the number of throws required to obtain a \(6\) for the first time, i.e. \[X(\omega) :=\inf \{k \,\colon\omega_k = 6\}\] with the convention that \(\inf \emptyset = \infty\) (which happens if I never throw a \(6\)).

To see that \(X\) is indeed a random variable, we have to check that it is measurable. To that end, we have to check that, for any \(n \in \mathbb{N}^*,\) the set \(X^{-1}(\{n\})\) is a cylinder set of the form (2.1). Indeed, \[X^{-1}(\{n\}) = \bigl\{\omega \in \Omega \,\colon\omega_1 \neq 6, \omega_2 \neq 6, \dots, \omega_{n - 1} \neq 6, \omega_n = 6\bigr\}\,,\] as desired. Intuitively, that \(X\) is a random variable is clear since the event “\(X\) equals \(n\)” clearly depends only on the first \(n\) throws, and \(\mathcal A\) is constructed precisely so that such events are measurable.

Definition 2.22

The law of a random variable with values in \(E\) is the measure \[\mathbb{P}_X :=X_* \mathbb{P}\] on \((E, \mathcal E).\) (Recall Definition 1.8.)

We sometimes use the equivalence relation \(\overset{\mathrm d}{=}\) on random variables, i.e. equality in law, defined by \[\tag{2.5} X \overset{\mathrm d}{=}Y \quad \Longleftrightarrow \quad \mathbb{P}_X = \mathbb{P}_Y\,.\]

Clearly, \(\mathbb{P}_X\) is a probability measure on \((E, \mathcal E).\) Hence, any random variable \(X\) with values in \(E\) gives rise to a new probability space \((E, \mathcal E, \mathbb{P}_X).\) The intuition is that this space is in general smaller than the original space, and it contains only information captured by the random variable \(X.\) If all we care about is the value of \(X,\) we can completely forget the original probability space \((\Omega, \mathcal A, \mathbb{P})\) and only work on the smaller space \((E, \mathcal E, \mathbb{P}_X),\) which is often much simpler.

For instance, in Example 2.8, if we only care about the value of \(X = i+j\) (and not, say, which of the two throws produced the larger value), we can work on the space \(E = X(\Omega) = \{2,3,\dots, 12\}\) instead of on the original larger space \(\Omega = \{1,2,\dots, 6\}^2.\) You can easily check that the probability measure \(\mathbb{P}_X\) on \(E\) is given by \[\mathbb{P}_X(\{k\}) = \frac{(k - 1) \wedge (13 - k)}{36}\,.\]

In general, for any \(B \in \mathcal E,\) we have \[\mathbb{P}_X(B) = \mathbb{P}(X^{-1}(B)) = \mathbb{P}(\{\omega \in \Omega \,\colon X(\omega) \in B\}) =:\mathbb{P}(X \in B)\,,\] where the notation on the right-hand side is being defined by this equation. This quantity is the probability that \(X\) lies in \(B.\)

Probability theory uses its own shorthand notation for events and probabilities determined by a random variable \(X\): \[\begin{aligned} \{\omega \,\colon X(\omega) \in B\} &\equiv \{X \in B\}\,, \\ \mathbb{P}\bigl(\{\omega \,\colon X(\omega) \in B\}\bigr) &\equiv \mathbb{P}(X \in B)\,. \end{aligned}\] In addition, inside \(\mathbb{P},\) intersection of events is often denoted with a comma instead of the symbol \(\cap.\) For instance, we write \[\tag{2.6} \mathbb{P}(\{X \in A\} \cap \{Y \in B\}) \equiv \mathbb{P}(X \in A, Y \in B)\,.\] We shall always use these shorthand notations.

Before looking at some examples, let us record the following rather banal remark, which is sometimes good to keep in mind. For a given probability measure \(\mu\) on a measurable space \((E, \mathcal E),\) can we construct a random variable \(X\) with law \(\mathbb{P}_X = \mu\)? Obviously yes, just by setting \((\Omega, \mathcal A, \mathbb{P}) = (E, \mathcal E, \mu)\) and \(X(\omega) = \omega.\)

2.3.1 Elementary special cases

Let us now review some special cases of random variables, some of which you may already have seen in school.

Let \(X\) be a random variable with values in \((E, \mathcal E).\)

Example 2.23 • Example 2.9 continued

For \(n \in \mathbb{N}^*\) let us compute the probablility that we first obtain a \(6\) on the \(n\)th throw, \[\begin{aligned} \mathbb{P}(X = n) &= \mathbb{P}(\omega_1 \neq 6, \dots, \omega_{n-1} \neq 6, \omega_n = 6) \\ &= \mathbb{P}\Biggl(\bigcup_{i_1, \dots, i_{n-1} = 1}^5 \{\omega_1 = i_1, \dots, \omega_{n-1} = i_{n-1}, \omega_n = 6\}\Biggr) \\ &= \sum_{i_1, \dots, i_{n-1} = 1}^5 \mathbb{P}\bigl(\omega_1 = i_1, \dots, \omega_{n-1} = i_{n-1}, \omega_n = 6\bigr) \\ &= 5^{n-1} \biggl(\frac{1}{6}\biggr)^n \\ &= \frac{1}{6} \biggl(\frac{5}{6}\biggr)^{n-1}\,. \end{aligned}\] This computation shows the power of a clear and rigorous formulation in solving very concrete problems. In particular, we find that the probability that we never throw a \(6\) is \[\mathbb{P}(X = \infty) = 1 - \mathbb{P}(X < \infty) = 1 - \sum_{n \in \mathbb{N}^*} P(X = n) = 1 - 1 = 0\,.\] Nevertheless, the event \(\{X = \infty\} = \{\omega \in \Omega \,\colon\omega_k < 6 \text{ for all } k \in \mathbb{N}^*\}\) is enourmous, in particular uncountable.

Home

Contents

Study Weeks