2.6 The \(\sigma\)-algebra generated by a random variable
Every random variable \(X\) gives naturally rise to a \(\sigma\)-algebra, which is the smallest (i.e. coarsest) \(\sigma\)-algebra on \(\Omega\) with respect to which \(X\) is measurable. To build intuition, consider the case where \(X\) is a random variable with values in \(\{1,2,3\},\) and define the events \(A_i :=X^{-1}(\{i\})\) for \(i = 1,2,3.\) Then \(X\) is measurable with respect to the \(\sigma\)-algebra \[\begin{aligned} \mathcal B &:=\{\emptyset, A_1, A_2, A_3, A_1 \cup A_2, A_1 \cup A_3, A_2 \cup A_3, \Omega\} \\ &= \{X^{-1}(\emptyset), X^{-1}(\{1\}), X^{-1}(\{2\}), X^{-1}(\{3\}), \\ &\qquad X^{-1}(\{1,2\}), X^{-1}(\{1,3\}), X^{-1}(\{2,3\}), X^{-1}(\{1,2,3\})\}\,. \end{aligned}\] You can convince yourself that this is the smallest \(\sigma\)-algebra with respect to which \(X\) is measurable.
In some sense, \(\mathcal B\) captures the resolving power of \(X,\) but it does not contain the full information about \(X.\) For instance, the random variable \(Y = 2 X\) generates the same \(\sigma\)-algebra as \(X,\) but it is clearly different from \(X.\) However, both \(X\) and \(Y\) have the same ability to resolve the probability space \(\Omega.\) This is the basic intuition behind the following definition.
Let \(X\) be a random variable with values in a measurable space \((E, \mathcal E).\) Then the \(\sigma\)-algebra generated by \(X\) is \[\sigma(X) :=\{X^{-1}(B) \,\colon B \in \mathcal E\}\,.\]
Note that, as advertised above, this is the smallest \(\sigma\)-algebra with respect to which \(X\) is measurable. Indeed, clearly any such \(\sigma\)-algebra will have to contain all sets of the form \(X^{-1}(B)\) for \(B \in \mathcal E\); moreover, the set \(\sigma(X)\) is a \(\sigma\)-algebra.
The following result is of fundamental importance in probability, and it is tacitly used throughout the field. It says that being a measurable function of a random variable \(X\) is the same thing as being \(\sigma(X)\)-measurable. It is sometimes called the Doob-Dynkin lemma.
Let \(X\) be a random variable with values in \((E, \mathcal E),\) and \(Y\) a random variable with values in \(\mathbb{R}.\) Then the following statements are equivalent.
- \(Y\) is measurable with respect to \(\sigma(X).\)
- There is a measurable function \(f \,\colon(E, \mathcal E) \to (\mathbb{R}, \mathcal B(\mathbb{R}))\) such that \(Y = f(X).\)
Proof. The implication (ii) \(\Rightarrow\) (i) is clear, since \(Y^{-1}(\mathcal C) = X^{-1}(f^{-1}(\mathcal C))\) for all \(\mathcal C \in \mathcal B(\mathbb{R}).\)
To prove the reverse implication, (i) \(\Rightarrow\) (ii), suppose that \(Y\) is measurable with respect to \(\sigma(X).\) We consider two cases.
\(Y\) is a simple function. Then \(Y = \sum_{i = 1}^n \lambda_i \mathbf 1_{A_i}\) for \(\lambda_i \in \mathbb{R}\) and \(A_i \in \sigma(X).\) Hence, for all \(i = 1, \dots, n\) there exist \(B_i \in \mathcal E\) such that \(A_i = X^{-1}(B_i),\) which implies that \[Y = \sum_{i = 1}^n \lambda_i \mathbf 1_{X^{-1}(B_i)} = \sum_{i = 1}^n \lambda_i \mathbf 1_{B_i} \circ X = f \circ X\,,\] where we defined the function \(f :=\sum_{i = 1}^n \lambda_i \mathbf 1_{B_i},\) which is measurable from \(E\) to \(\mathbb{R}.\)
\(Y\) is general. We can write \(Y = \lim_{n \to \infty} Y_n\) pointwise, where \(Y_n\) is simple and \(\sigma(X)\)-measurable. By the previous simple case, we know that for each \(n\) we can find a measurable function \(f_n \,\colon E \to \mathbb{R}\) such that \(Y_n = f_n(X).\) Now define the function \(f \,\colon E \to \mathbb{R}\) through \[f(x) := \begin{cases} \lim_{n \to \infty} f_n(x) & \text{if the limit exists} \\ 0 & \text{otherwise}\,. \end{cases}\] Then \(f\) is measurable5 . Moreover, for all \(\omega \in \Omega,\) the limit \(\lim_{n \to \infty} f_n(x)\) exists for all \(x = X(\omega),\) because \(f_n(X(\omega)) = Y_n(\omega)\) and this converges by assumption to \(Y(\omega).\) We conclude that \(f(X(\omega)) = \lim_{n \to \infty} f_n(X(\omega)) = Y(\omega)\) for all \(\omega\in \Omega,\) as desired.This is an exercise from measure theory, using that \(\limsup_{n\to \infty} f_n\) and \(\liminf_{n\to \infty} f_n\) are measurable.
2.7 Moments and inequalities
Let \(X\) be a random variable with values in \(\mathbb{R}\) and \(p \geqslant 1.\) The \(p\)-th moment of \(X\) is \(\mathbb{E}[X^p],\) which is well-defined under either of the following conditions:
\(p \in \mathbb{N}^*\) and \(\mathbb{E}[\lvert X \rvert^p] < \infty\);
\(X \geqslant 0.\)
In probability, we say that some property \(P(\omega)\) depending on the realisation \(\omega\) holds almost surely instead of almost everywhere (as in measure theory) if \(\mathbb{P}(P \text{ true}) = 1.\) We often abbreviate a.s. for almost surely.
We use the following definition from measure theory.
For \(p \in [1,\infty],\) we denote by \(L^p \equiv L^p(\Omega, \mathcal A, \mathbb{P})\) the usual \(L^p\)-space with norm denoted by \(\lVert X \rVert_p.\)
It might be helpful to do a quick review of measure theory to recall how these spaces are defined. As in measure theory, there is a technical annoyance, which arises from the need to identify random variables that are almost surely equal.
For \(p \in [1,\infty)\) we denote by \(\mathcal L^p(\Omega, \mathcal A, \mathbb{P})\) the set of real-valued random variables \(X\) satisfying \(\mathbb{E}[\lvert X \rvert^p] < \infty.\)
We denote by \(\mathcal L^\infty(\Omega, \mathcal A, \mathbb{P})\) the set of real-valued random variables \(X\) such that there exists a constant \(C\) satisfying \(\lvert X \rvert \leqslant C\) almost surely.
For \(p \in [1,\infty]\) we define the equivalence relation \(\sim\) on \(\mathcal L^p\) by setting \(X \sim Y\) if and only if \(X = Y\) almost surely.
For \(p \in [1,\infty]\) we define the quotient space \[L^p(\Omega, \mathcal A, \mathbb{P}) :=\mathcal L^p(\Omega, \mathcal A, \mathbb{P}) / \sim\,.\]
Thus an element of \(L^p\) is an equivalence class of random variables. Throughout the following, and in accordance with the literature, we usually skirt around this issue by abusing notation and identifying an element of \(L^p\) with a representative of its class. This convention is consistent provided that all operations performed on such representatives do not depend on the choice of the representative within its class. It is always good to keep the precise definition in mind, as this subtlety is sometimes important.
For \(p \in [1, \infty)\) and \(X \in L^p\) we write \[\lVert X \rVert_p :=\bigl(\mathbb{E}[\lvert X \rvert^p]\bigr)^{1/p}\,.\] Note that this definition makes sense, since it is independent of the representative of \(X.\)
We write \[\lVert X \rVert_\infty :=\inf\{C \geqslant 0 \,\colon\lvert X \rvert \leqslant C \text{ a.s.}\}\,.\] This number is sometimes called the essential supremum of \(\lvert X \rvert.\) It is independent of the representative of \(X\) (unlike \(\sup \lvert X \rvert\)).
Note that this result is one place where taking the quotient in the definition of \(L^p\) is essential; it is wrong for the space \(\mathcal L^p.\)
For each \(p \in [1,\infty],\) the space \(L^p(\Omega, \mathcal A, \mathbb{P})\) is a Banach space.
Also proved in the class on measure theory is the most important inequality in all of analysis.
Let \(p,q \in [1,\infty]\) satisfy \(\frac{1}{p} + \frac{1}{q} = 1\) (with the convention \(\frac{1}{\infty} = 0\)). Then for any random variables \(X,Y\) we have \[\lVert XY \rVert_1 \leqslant\lVert X \rVert_p \lVert Y \rVert_q\,.\]
Let us list some obvious but important special cases of Hölder’s inequality:
- \(\lVert X \rVert_p \leqslant\lVert X \rVert_q\) if \(1 \leqslant p \leqslant q.\)
- \(\mathbb{E}[\lvert XY \rvert] \leqslant\lVert X \rVert_2 \lVert Y \rVert_2\) (Cauchy-Schwarz inequality).
- \(\mathbb{E}[\lvert X \rvert]^2 \leqslant\mathbb{E}[X^2].\)
Let \(X \in L^2.\) The variance of \(X\) is \[\mathop{\mathrm{Var}}(X) :=\mathbb{E}[(X - \mathbb{E}[X])^2]\] and its standard deviation is \(\sigma_X :=\sqrt{\mathop{\mathrm{Var}}(X)}.\)
Just as the expectation measures the typical mean value of \(X,\) the variance measures the typical spread of \(X\) around its mean value. It is important to realise that the variance is not the only quantity to quantify this spread, it is merely the most convenient and the most popular one. For example, another quantity that measures the spread is \(\mathbb{E}[\lvert X - \mathbb{E}[X] \rvert]\); as we shall see in the exercises, this quantity has advantages and disadvantages as compared to the variance, and it is sometimes used in statistics where it is closely related to the median of \(X\) (see the exercises).
The following observations follow immediately from the definition of the variance.
- \(\mathop{\mathrm{Var}}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2.\)
- For all \(a \in \mathbb{R}\) we have \(\mathbb{E}[(X - a)^2] = \mathop{\mathrm{Var}}(X) + (\mathbb{E}[X] - a)^2,\) and hence \[\mathop{\mathrm{Var}}(X) = \inf_{a \in \mathbb{R}} \mathbb{E}[(X - a)^2]\] This gives another, so-called variational, interpretation of the variance.
- \(\mathop{\mathrm{Var}}(X) = 0\) if and only if \(X\) is almost surely constant.
Next we state the most important inequality in probability, which is traditionally associated with at least the names of Bienaymé, Chebyshev, and Markov. We shall call it Chebyshev’s inequality, as it is also commonly known, for historical reasons that we do not go into here.
Let \(f \,\colon\mathbb{R}\to [0,\infty)\) be nondecreasing and \(X\) a random variable. Then for all \(a \in \mathbb{R}\) we have \[\mathbb{P}(X \geqslant a) \leqslant\frac{\mathbb{E}[f(X)]}{f(a)}\,.\]
Proof. Since \(f\) is nondecreasing, on the event \(X \geqslant a\) we have \(f(X) \geqslant f(a).\) Thus, \[\mathbb{P}(X \geqslant a) = \mathbb{E}[\mathbf 1_{X \geqslant a}] \leqslant\mathbb{E}\biggl[\mathbf 1_{X \geqslant a} \frac{f(X)}{f(a)}\biggr] \leqslant\mathbb{E}\biggl[\frac{f(X)}{f(a)}\biggr]\,,\] as claimed.
Here are some important and famous special cases of Chebyshev’s inequality:
- If \(X \geqslant 0\) and \(a > 0\) then \(\mathbb{P}(X \geqslant a) \leqslant\frac{\mathbb{E}[X]}{a}\) (often called Markov’s inequality).
- If \(X \in L^2\) and \(a > 0\) then \[\mathbb{P}(\lvert X - \mathbb{E}[X] \rvert \geqslant a) \leqslant\frac{\mathop{\mathrm{Var}}(X)}{a^2}\] (often simply called Chebyshev’s inequality)
- \(\mathbb{P}(X \geqslant a) \leqslant\mathrm e^{-t a} \mathbb{E}[\mathrm e^{tX}]\) for any \(t > 0\) (often called Chernov’s inequality). Since this inequality holds for any \(t > 0,\) one can even take the infimum over \(t\) to deduce that \(\mathbb{P}(X \geqslant a) \leqslant\mathrm e^{-I(a)},\) where \[I(a) :=\sup_{t > 0} \{t a - \log \mathbb{E}[\mathrm e^{tX}]\}\,.\] This estimate is often very sharp, and it plays a fundamental role in the so-called theory of large deviations and statistical mechanics, which however goes beyond the scope of this course.
Finally, the notion of variance can be generalised to the covariance of several random variables, which roughly measures how strongly they tend to fluctuate jointly.
For \(X,Y \in L^2\) define the covariance of \(X\) and \(Y\) as \[\mathop{\mathrm{Cov}}(X,Y) :=\mathbb{E}\bigl[(X - \mathbb{E}[X]) (Y - \mathbb{E}[Y])\bigr] = \mathbb{E}[XY] - \mathbb{E}[X] \mathbb{E}[Y]\,.\] For a random vector \(X = (X_1, \dots, X_d)\) with values in \(\mathbb{R}^d\) with \(X_i \in L^2\) for all \(i = 1, \dots, d,\) we define the \(d \times d\) covariance matrix \[\mathop{\mathrm{Cov}}(X) :=(\mathop{\mathrm{Cov}}(X_i, X_j))_{i,j = 1}^d\,.\]
The covariance matrix of a random vector is one of the most fundamental objects of study in high-dimensional statistics and machine learning. We shall discuss some of its properties in the exercises.