2.3 Expectation
Let \(X\) be a random variable with values in \(\mathbb{R}.\) We would like to determine a number that represents the typical, or mean, value of \(X.\) For example, consider a random variable \(X\) that is equal to \(a\) with probability \(p\) and to \(b\) with probability \(1 - p.\) In other words, \[\mathbb{P}_X = p \delta_a + (1-p) \delta_b\,.\] In school we learn that the mean value of \(X\) is \[p a + (1 - p) b = \int_R x \, \mathbb{P}_X(\mathrm dx) = \int_\Omega X(\omega) \, \mathbb{P}(\mathrm d\omega)\,,\] where the second identity follows by definition of \(\mathbb{P}_X\) (Definition 2.15). Note that which probability space \(X\) is defined on does not matter.
Let \(X\) be a random variable with values in \(\mathbb{R}.\) The expectation of \(X\) is \[\mathbb{E}[X] :=\int X(\omega) \, \mathbb{P}(\mathrm d\omega)\,,\] which is well-defined if \(X \geqslant 0\) (in which case \(\mathbb{E}[X] \in [0,\infty]\)) or if \(\mathbb{E}[\lvert X \rvert] < \infty\) (in which case \(X\) is called integrable).
If \(X = (X_1, \dots, X_d) \in \mathbb{R}^d\) has values in \(\mathbb{R}^d,\) then we define \[\mathbb{E}[X] :=(\mathbb{E}[X_1], \dots, \mathbb{E}[X_d])\,.\]
The following result states how to compute the expectation of a function of a random variable.
Let \(X\) be a random variable with values in \((E, \mathcal E),\) and let \(f \,\colon E \to \mathbb{R}\cup \{\infty\}\) be measurable. Then \[\mathbb{E}[f(X)] = \int f \, \mathrm d\mathbb{P}_X\] provided that \(f \geqslant 0\) or \(\mathbb{E}[\lvert f(X) \rvert] < \infty.\)
Proof. The proof is an archetypal argument from measure theory, which we recall here. See also the discussion after Definition 1.9. Consider first the case where \(f\) is an indicator function, \(f = \mathbf 1_{B}\) for \(B \in \mathcal E.\) Then \[\mathbb{E}[f(X)] = \mathbb{E}[\mathbf 1_{B}(X)] = \mathbb{P}(X \in B) = \mathbb{P}_X(B) = \int_B \mathrm d\mathbb{P}_X = \int \mathbf 1_{B} \, \mathrm d\mathbb{P}_X = \int f \, \mathrm d\mathbb{P}_X\,,\] as desired.
Moreover, by linearity of the integrals over \(\mathbb{P}\) in \(\mathbb{E}[\cdot]\) as well as over \(\mathbb{P}_X,\) we deduce that the claim holds for any simple function \(f.\)
Next, let \(f\) be an arbitrary nonnegative measurable function. Choose a sequence of simple functions \(f_n\) that converge to \(f\) monotonically from below. Then by using the claim for simple functions, as well as the monotone convergence theorem twice (since \((f_n)\) and \((f_n(X))\) are pointwise nondecreasing sequences on \(E\) and \(\Omega,\) respectively), we obtain \[\mathbb{E}[f(X)] = \lim_{n \to \infty} \mathbb{E}[f_n(X)] = \lim_{n \to \infty} \int f_n \, \mathrm d\mathbb{P}_X = \int f \, \mathrm d\mathbb{P}_X\,.\] This concludes the proof for \(f \geqslant 0.\) If \(f\) is a general integrable function, we split \(f = f_+ - f_-\) into its positive and negative parts, and apply the result for positive \(f\) to \(f_+\) and \(f_-\) separately.
Let us compute the expectation of the sum of two throws of a die: \[\mathbb{E}[X] = \frac{1}{36} \sum_{i,j = 1}^6 (i+j) = \frac{1}{36} \biggl(6 \sum_{i = 1}^6 i + 6 \sum_{j = 1}^6 j\biggr) = 7\,.\]
Let us compute the expected number of throws until we obtain a \(6\): \[\mathbb{E}[X] = \sum_{n = 1}^\infty n \, \mathbb{P}(X = n) = \frac{1}{6} \sum_{n = 1}^\infty n \, \biggl(\frac{5}{6}\biggr)^{n-1} = \frac{1}{6} \frac{1}{(1/6)^2} = 6\,,\] where we used the geometric series identity \(\sum_{n=1}^\infty n x^{n-1} = \frac{1}{(1-x)^2},\) which is proved by differentiating in \(x.\)
2.4 The classical laws
In this section we go to the zoo. We encounter the most common and useful laws and learn their names, along with any parameters they depend on. We leave it as a simple exercise to check that each of them is indeed a probability measure (i.e. that the total measure equals one).
Discrete laws
The following laws are defined on \((E, \mathcal E)\) with \(E\) finite or countable and \(\mathcal E = \mathcal P(E).\)
Uniform. Let \(E\) be a finite set and define \(\mathbb{P}(X = x) = \frac{1}{\# E}\) for all \(x \in E.\)
Bernoulli (\(p \in [0,1]\)). Let \(E = \{0,1\}\) and set \(\mathbb{P}(X = 1) = p\) and \(\mathbb{P}(X = 0) = 1-p.\) (This law models a coin toss that is biased for \(p \neq \frac12.\))
Binomial (\(n \in \mathbb{N}^*,\) \(p \in [0,1]\)). Let \(E = \{0,1,\dots, n\}\) and \[\mathbb{P}(X = k) = \binom{n}{k} p^k (1 - p)^{n-k}\] for \(k \in E.\) Here \(\binom{n}{k} = \frac{n!}{k! (n-k)!}\) is the binomial coefficient. (This law models the number of heads when tossing a biased coin \(n\) times; see the exercises.)
Geometric (\(p \in (0,1)\)). Let \(E = \mathbb{N}\) and \[\mathbb{P}(X=k) = (1-p) p^k\] for all \(k \in \mathbb{N}.\) (This law models the number of heads before the first tails when tossing a biased coin. See also Example 2.9 and its continuation in 2.2.1.) Note that the different convention \(\mathbb{P}(X = k) = (1 - p)^{k-1} p\) for \(k \in E = \mathbb{N}^*\) is also often used in the literature.
Poisson (\(\lambda > 0\)). Let \(E = \mathbb{N}\) and \[\mathbb{P}(X = k) = \frac{\lambda^k}{k!} \mathrm e^{-\lambda}\] for all \(k \in \mathbb{N}.\) (This law models the number of rare events observed in a long time interval. More precisely, if \(X_n\) has binomial law with parameters \(n\) and \(p_n\) satisfying \(n p_n \to \lambda\) as \(n \to \infty,\) then \(\mathbb{P}(X_n = k) \to \mathbb{P}(X = k)\) for all \(k.\) See the exercises.)
Continuous laws
The following laws are defined on \((\mathbb{R}, \mathcal B(\mathbb{R})).\) They are continuous, and therefore determined by their densities \(p(x).\)
Uniform on \([a,b].\) Let \[p(x) = \frac{1}{b-a} \mathbf 1_{[a,b]}(x)\,.\]
Exponential (\(\lambda > 0\)). Let \[p(x) = \lambda \mathrm e^{-\lambda x} \mathbf 1_{[0,\infty)}(x)\,.\]
Gaussian or normal (\(m \in \mathbb{R}, \sigma > 0\)). Let \[p(x) = \frac{1}{\sqrt{2 \pi} \sigma} \mathrm e^{- \frac{(x - m)^2}{2 \sigma^2}}\,.\] (\(m\) is called the mean and \(\sigma\) the standard deviation.)
2.5 Cumulative distribution function
The law of a real-valued random variable can be fully and equivalently characterised by a function on \(\mathbb{R}.\)
Let \(X\) be a random variable with values in \(\mathbb{R}.\) We define its cumulative distribution function \(F_X \,\colon\mathbb{R}\to [0,1]\) by \[F_X(t) :=\mathbb{P}(X \leqslant t)\,.\] For brevity, sometimes we simply speak of the distribution function of \(X.\)
If \(F = F_X\) is the distribution function of a random variable \(X,\) then
- \(F\) is nondecreasing;
- \(\lim_{t \to -\infty} F(t) = 0\) and \(\lim_{t \to \infty} F(t) = 1\);
- \(F\) is right-continuous, i.e. \(\lim_{s \downarrow t} F(s) = F(t)\) for all \(t \in \mathbb{R}.\)
Proof. The proof of (i) is obvious.
Let us prove (ii). It uses some basic facts from measure theory, which are reviewed in Exercise 1.1. Since the events \(\{X \leqslant n\}\) are increasing, we find \[\lim_{t \to \infty} F(t) = \lim_{n \to \infty} \mathbb{P}(X \leqslant n) = \mathbb{P}\Biggl(\bigcup_{n \in \mathbb{N}} \{X \leqslant n\}\Biggr) = \mathbb{P}(\Omega) = 1\,,\] where the second step follows by \(\sigma\)-additivity of \(\mathbb{P}\) and the last step from \(\bigcup_{n \in \mathbb{N}} \{X \leqslant n\} = \Omega\) since \(X \in \mathbb{R}.\) Analogously, since the events \(\{X \leqslant-n\}\) are decreasing, we find \[\lim_{t \to -\infty} F(t) = \lim_{n \to \infty} \mathbb{P}(X \leqslant-n) = \mathbb{P}\Biggl(\bigcap_{n \in \mathbb{N}} \{X \leqslant-n\}\Biggr) = \mathbb{P}(\emptyset) = 0\,,\] where the second step follows by \(\sigma\)-additivity of \(\mathbb{P}\) and the last step from \(\bigcap_{n \in \mathbb{N}} \{X \leqslant-n\} = \emptyset\) since \(X \in \mathbb{R}.\)
To prove (iii), let \((a_n)\) be a strictly decreasing sequence tending to \(0.\) Since the events \(\{X \leqslant t + a_n\}\) are decreasing, we find \[\lim_{n \to \infty} F(t + a_n) = \lim_{n \to \infty} \mathbb{P}(X \leqslant t + a_n) = \mathbb{P}\Biggl(\bigcap_{n \in \mathbb{N}} \{X \leqslant t + a_n\}\Biggr) = \mathbb{P}(X \leqslant t) = F(t)\,,\] where the second step follows by \(\sigma\)-additivity of \(\mathbb{P}\) and the last step from \(\bigcap_{n \in \mathbb{N}} \{X \leqslant t + a_n\} = \{X \leqslant t\}.\)
As a helpful check to see if you understood the proof of (iii), try to see what goes wrong if you try to prove that \(F\) is left-continuous, which is in general wrong.
Note that the \(F\) is discontinuous whenever \(\mathbb{P}_X\) has an atom. For instance, if \(X\) is equal to a constant \(a,\) then \(\mathbb{P}_X = \delta_a\) and hence \(F_X(t) = \mathbf 1_{t \geqslant a}.\)
Conversely, it is natural to ask whether any function \(F\) satisfying the three properties (i), (ii), (iii) is the distribution function of some random variable. As the following proposition shows, the answer is yes!
This very important result is not just a theoretical curiosity; it is extremely useful. Indeed, the proof relies on an explicit construction that is of great use in both theoretical probability and applications. It is the standard algorithm for generating random variables with a given distribution, for instance how your computer generates normal random variables. See Remark 2.24 below.
If \(F : \mathbb{R}\to [0,1]\) satisfies (i), (ii), (iii) from Proposition 2.22 then there exists a random variable \(X\) with values in \(\mathbb{R}\) such that \(F = F_X.\)
Proof. To construct \(X,\) we have to start by constructing the probability space. We take \(\Omega = (0,1),\) \(\mathcal A = \mathcal B((0,1)),\) and \(\mathbb{P}\) to be Lebesgue measure. Then we define \[X(\omega) :=\sup \{s \in \mathbb{R}\,\colon F(s) < \omega\}\,.\] We note that \(X(\omega) \in \mathbb{R}\) for any \(\omega \in (0,1)\) by the assumption (ii). The rest of the proof consists in showing that this explicit choice satisfies \(F = F_X.\)
To that end, we shall show that for all \(t \in \mathbb{R}\) \[\tag{2.8} \{\omega \in \Omega \,\colon X(\omega) \leqslant t\} = \{\omega \in \Omega \,\colon\omega \leqslant F(t)\}\,.\] Supposing that (2.8) is true, we obtain \[F_X(t) = \mathbb{P}(X \leqslant t) = \mathbb{P}\bigl(\{\omega \in \Omega \,\colon\omega \leqslant F(t)\}\bigr) = F(t)\] by definition of Lebesgue measure, as desired.
Hence, what remains is to show (2.8), which follows from the two following observations.
Suppose that \(\omega \leqslant F(t).\) Then, by definition of \(X\) and of \(\sup,\) we immediately deduce that \(t \geqslant X(\omega).\)
Suppose that \(\omega > F(t).\) Since \(F\) is right-continuous, there exists \(\varepsilon> 0\) such that \(F(t + \varepsilon) < \omega.\) Thus, again by definition of \(X,\) we deduce that \(X(\omega) \geqslant t + \varepsilon.\) Hence, \(X(\omega) > t.\)
As a helpful check to see that you understood the proof, you can try to spot exactly where we used each of the assumptions (i), (ii), (iii) on \(F.\)
The function \(X \,\colon(0,1) \to \mathbb{R}\) constructed in the above proof is often called inverse of \(F,\) and denoted by \(F^{-1},\) even if \(F\) is not bijective. (The function \(F\) can be locally constant.) If \(F\) is bijective, then \(F^{-1}\) coincides with the usual inverse.
The construction of the proof is remarkable. It provides an algorithm for generating a random variable \(X\) with distribution function \(F\) starting from a random variable \(Y\) which is uniformly distributed on \((0,1)\): simply set \(X :=F^{-1}(Y).\) Thus, the problem is reduced to the generation of the special and simple random variable \(Y.\)
To be more concrete, suppose that we have a computer that generates a random floating point number \(\omega\) that is uniformly distributed in \((0,1).\) (We shall see later that such a generator can be easily constructed by taking the binary digits of \(\omega\) to be independent Bernoulli random variables with parameter \(p = 1/2.\) Thus, all that we need is a random number generator that generates such Bernoulli random variables.) We wish to generate a standard Gaussian random variable, with parameters \(m = 0\) and \(\sigma = 1.\) To that end, we define the function \[F(t) :=\frac{1}{\sqrt{2 \pi}} \int_{-\infty}^t \mathrm e^{-\frac{x^2}{2}} \, \mathrm dx\,.\] The function \(F\) is famously not an elementary function, but it can be easily computed numerically, and most software packages have an implementation of (a version of) it, often called \(\text{Erf}.\) Then the desired Gaussian random variable is \(X :=F^{-1}(\omega),\) where the right-hand side is again evaluated numerically.