4.2 The law of large numbers
In this section we prove the law of large numbers under the optimal conditions on the random variables. Recall that we already proved the weak law of large numbers in Proposition 3.24, which had the deficiency of establishing convergence in \(L^2\) instead of almost surely, which led to issues explained in Remark 3.26. This issue was remedied in the law of large numbers in \(L^4\) in Proposition 3.27. But the latter result still required the random variables \(X_n\) to lie in \(L^4\) instead of in the optimal space, \(L^1.\) (This space is optimal since we clearly want \(\mathbb{E}[X_1]\) to be well-defined and finite.)
We shall need the following tool from measure theory, which is a consequence of the monotone class lemma.
For each \(i = 1, \dots, n,\) let \(\mathcal C_i \subset \mathcal A\) be a collection of events stable under intersections containing \(\Omega.\) Define \(\mathcal B_i :=\sigma(\mathcal C_i).\) If for all \(C_1 \in \mathcal C_1, \dots, C_n \in \mathcal C_n\) we have \[\mathbb{P}(C_1 \cap \cdots \cap C_n) = \mathbb{P}(C_1) \cdots \mathbb{P}(C_n)\,,\] then \(\mathcal B_1, \dots, \mathcal B_n\) are independent.
Proof*. (Optional material.) We use the monotone class lemma from 3.2, whose notations we also take over. Fix \(C_2 \in \mathcal C_2, \dots, C_n \in \mathcal C_n,\) and define \[\mathcal M_1 :=\bigl\{B_1 \in \mathcal B_1 \,\colon\mathbb{P}(B_1 \cap C_2 \cap \cdots \cap C_n) = \mathbb{P}(B_1)\, \mathbb{P}(C_2) \cdots \mathbb{P}(C_n)\bigr\}\,.\] By assumption, \(\mathcal C_1 \subset \mathcal M_1.\) Moreover, it is easy to verify that \(\mathcal M_1\) is a monotone class. Hence, \[\mathcal M_1 \supset \mathcal M(\mathcal C_1) = \sigma(\mathcal C_1) = \mathcal B_1\,,\] where the second step follows from the monotone class lemma (Proposition 3.8). We conclude: for all \(B_1 \in \mathcal B_1, C_2\in \mathcal C_2, \dots, C_n \in \mathcal C_n,\) we have \[\mathbb{P}(B_1 \cap C_2 \cap \cdots \cap C_n) = \mathbb{P}(B_1)\, \mathbb{P}(C_2) \cdots \mathbb{P}(C_n)\,.\]
We now continue in this fashion, moving on to the second argument. More precisely, fix \(B_1 \in \mathcal B_1, C_3 \in \mathcal C_3, \dots, C_n \in \mathcal C_n\) and define \[\mathcal M_2 :=\bigl\{B_2 \in \mathcal B_2 \,\colon\mathbb{P}(B_1 \cap B_2 \cap C_3 \cap \cdots \cap C_n) = \mathbb{P}(B_1) \, \mathbb{P}(B_2) \, \mathbb{P}(C_3) \cdots \mathbb{P}(C_n)\bigr\}\,.\] As above, it is easy to see that \(\mathcal M_2\) is a monotone class, and by the previous step we know that \(\mathcal C_2 \subset \mathcal M_2.\) By the monotone class lemma, we find that \(\mathcal M_2 \supset \mathcal B_2.\) By repeating this procedure \(n\) times we arrive at the claim.
Our proof of the law of large numbers rests on the following fundamental result.
Let \((X_n)_{n \geqslant 1}\) be independent random variables. For \(n \geqslant 1\) define the \(\sigma\)-algebra \[\mathcal B_n :=\sigma(X_n, X_{n+1}, \dots) = \sigma \biggl(\bigcup_{k \geqslant n} \sigma(X_k)\biggr)\,.\] Then the tail \(\sigma\)-algebra \[\mathcal B_\infty :=\bigcap_{n \geqslant 1} \mathcal B_n\] satisfies a zero-one law in the sense that any tail event \(B \in \mathcal B_\infty\) satisfies \(\mathbb{P}(B) = 0\) or \(\mathbb{P}(B) = 1.\)
It is important to understand the meaning of the objects in Proposition 4.7. The \(\sigma\)-algebra \(\mathcal B_n\) contains all the information from time \(n\) onwards, i.e. it discards all information up to time \(n - 1.\) The tail events in \(\mathcal B_\infty\) are precisely those whose occurrence can be determined if an arbitrarily large but finite initial segment of the variables \(X_k\) is discarded. For example \(\{\sup_n X_n \leqslant 1\}\) is not in \(\mathcal B_\infty,\) since it clearly depends on all random variables \(X_n.\) But \(\{\limsup_n X_n \leqslant 1\}\) is in \(\mathcal B_\infty,\) since it depends only on the “distant future”, i.e. changing any finite number of variables \(X_n\) does not change its occurrence.
Kolmogorov’s zero-one law is remarkable: it states that any tail event occurs almost surely or its complement occurs almost surely. As we shall see, the tail \(\sigma\)-algebra is rich (i.e. large) enough to make this statement very useful.
Proof of Proposition 4.7. Define \(\mathcal D_n :=\sigma(X_1, \dots, X_n)\) (the \(\sigma\)-algebra containing the information up to time \(n\)). Then we claim that \(\mathcal D_n\) and \(\mathcal B_{n+1}\) are independent. This sounds intuitively obvious, as \(\mathcal D_n\) contains information up to time \(n,\) and \(\mathcal B_{n+1}\) information starting from time \(n+1.\) For a rigorous proof, we proceed in two steps.
For any \(k \geqslant n+1\) we define \(\mathcal B_{n+1,k} :=\sigma(X_{n+1}, \dots, X_k).\) Define the collections \[\begin{aligned} \mathcal C_1 &:=\bigl\{B_1 \cap \cdots \cap B_n \,\colon B_i \in \sigma(X_i) \, \forall i\bigr\}\,, \\ \mathcal C_2 &:=\bigl\{B_{n+1} \cap \cdots \cap B_k \,\colon B_i \in \sigma(X_i) \, \forall i\bigr\}\,. \end{aligned}\] Clearly, these collections are stable under finite intersections and \(\mathcal D_n = \sigma(\mathcal C_n)\) and \(\mathcal B_{n+1,k} = \sigma(\mathcal C_2).\) By Lemma 4.6 and independence of the random variables \((X_n),\) we therefore conclude that \(\mathcal D_n\) and \(\mathcal B_{n+1,k}\) are independent.
Define the collections \(\mathcal C_1 :=\mathcal D_n\) and \(\mathcal C_2 :=\bigcup_{k \geqslant n+1} \mathcal B_{n+1,k},\) which are clearly stable under finite intersections. Thus, \(\sigma(\mathcal C_1) = \mathcal D_n\) and \(\sigma(\mathcal C_2) = \mathcal B_{n+1}.\) By the previous step and Lemma 4.6, we conclude that \(\mathcal D_n\) and \(\mathcal B_{n+1}\) are independent, as desired.
Next, choose \(\mathcal C_1 :=\bigcup_{n \geqslant 1} \mathcal D_n\) and \(\mathcal C_2 :=\mathcal B_\infty.\) Since \(\mathcal D_n\) and \(\mathcal B_{n+1}\) are independent for all \(n,\) we conclude that \(\mathbb{P}(C_1 \cap C_2) = \mathbb{P}(C_1) \, \mathbb{P}(C_2)\) for all \(C_1 \in \mathcal C_1\) and \(C_2 \in \mathcal C_2.\) By Lemma 4.6, we deduce that \(\sigma(\mathcal C_1) = \sigma(X_1, X_2, \dots) = \mathcal B_1\) and \(\mathcal B_\infty\) are independent. Since \(\mathcal B_\infty \subset \mathcal B_1,\) we conclude that \(\mathcal B_\infty\) is independent of itself! This means that any tail event \(B \in \mathcal B_\infty\) satisfies \(\mathbb{P}(B) = \mathbb{P}(B \cap B) = \mathbb{P}(B)^2,\) from which the zero-one law follows.
At first sight, this proof seems quite strange. It is in fact nothing but a careful justification of a simple fact: \(\mathcal B_\infty\) is independent of itself. Since we are working with rather abstract \(\sigma\)-algebras, it is important to proceed slowly and carefully, as we tried to do above. The zero-one law has deep implications in probability. The law of large numbers, which we are about to state and prove, is one. The following remark is another one.
Let \((X_n)_{n \geqslant 1}\) be independent random variables. Clearly, \[X_+ :=\limsup_{k \to \infty} \frac{1}{k} (X_1 + \cdots + X_k) = \limsup_{k \to \infty} \frac{1}{k} (X_n + \cdots + X_k)\] for any \(n \in \mathbb{N}^*.\) Hence, \(X_+\) is \(\mathcal B_n\)-measurable for all \(n \in \mathbb{N}^*,\) which implies that \(X_+\) is \(\mathcal B_\infty\)-measurable. The same holds for \(X_-\) where \(\limsup\) is replaced with \(\liminf.\) In particular, the event \[\biggl\{\frac{1}{k}(X_1 + \cdots + X_k) \text{ converges}\biggr\} = \{X_- = X_+\}\] is \(\mathcal B_\infty\)-measurable, and hence has either probability \(1\) or \(0.\) In the former case, the limiting random variable \(X_- = X_+\) is \(\mathcal B_\infty\)-measurable, and it is therefore almost surely constant (exercise). In summary: averages of independent random variables either diverge almost surely or converge almost surely to a constant.
We can now state and prove the law of large numbers.
Let \((X_n)_{n \geqslant 1}\) be independent random variables in \(L^1\) with the same law. Then \[\frac{1}{n} (X_1 + \cdots + X_n) \overset{\text{a.s.}}{\longrightarrow} \mathbb{E}[X_1]\,.\]
Proof. Let \(S_n :=X_1 + \cdots + X_n\) and \(S_0 :=0.\) Let \(a > \mathbb{E}[X_1]\) and define \(M :=\sup_{n \in \mathbb{N}} (S_n - na).\) Thus, \(M\) is a random variable with values in \([0,\infty].\) The core of the proof is to show that \[\tag{4.1} M < \infty \quad \text{a.s.}\]
Let us suppose first that (4.1) has been proved and use it to conclude the proof of the law of large numbers. By definition of \(M\) we have \(S_n \leqslant na + M\) for all \(n\) and hence (4.1) implies, for all \(a > \mathbb{E}[X_1],\) \[\limsup_{n} \frac{S_n}{n} \leqslant a \quad \text{a.s.}\] This implies that \[\tag{4.2} \limsup_{n} \frac{S_n}{n} \leqslant\mathbb{E}[X_1] \quad \text{a.s.}\,,\] since \[\mathbb{P}\biggl(\limsup_{n} \frac{S_n}{n} \leqslant\mathbb{E}[X_1]\biggr) = \mathbb{P}\biggl(\bigcap_{k \in \mathbb{N}^*} \biggl\{\limsup_{n} \frac{S_n}{n} \leqslant\mathbb{E}[X_1] + \frac{1}{k}\biggr\}\biggr) = 1\,,\] where we used that a countable intersection of events of probability one has probability one.
Replacing \(X_n\) with \(-X_n\) we obtain \[\tag{4.3} \liminf_{n} \frac{S_n}{n} \geqslant\mathbb{E}[X_1] \quad \text{a.s.}\] From (4.2) and (4.3) we conclude the law of large numbers.
What remains, therefore, is to prove (4.1). First, we claim that \(\{M < \infty\} \in \mathcal B_\infty.\) Indeed, for all \(k \geqslant 0\) we have \[\{M < \infty\} = \biggl\{\sup_{n \geqslant 0} (S_n - na) < \infty\biggr\} = \biggl\{\sup_{n \geqslant k} \bigl((S_n - S_k) - na\bigr) < \infty\biggr\} \in \sigma(X_{k+1}, X_{k+2}, \dots)\,,\] since \(S_n - S_k = X_{k+1} + X_{k+2} + \cdots + X_n.\) By the zero-one law, Proposition 4.7, to prove (4.1), it therefore suffices to prove that \(\mathbb{P}(M = \infty) < 1.\)
We proceed by contradiction and suppose that \(\mathbb{P}(M = \infty) = 1.\) For all \(k \in \mathbb{N}\) we define \[M_k :=\sup_{0 \leqslant n \leqslant k} (S_n - na)\,, \qquad M_k' :=\sup_{0 \leqslant n \leqslant k} (S_{n+1} - S_1 - na)\,.\] Since \(S_n = X_1 + \cdots + X_n\) and \(S_{n+1} - S_1 = X_2 + \cdots + X_{n+1},\) we conclude that \(M_k \overset{\mathrm d}{=}M_k'\) (recall (2.5)). Moreover, \(M_k\) and \(M'_k\) are increasing sequences that converge from below to their limits \(M\) and \(M'.\) By \(M_k \overset{\mathrm d}{=}M_k',\) we conclude that \(M \overset{\mathrm d}{=}M',\) since \[\mathbb{P}(M' \leqslant x) = \lim_{k \to \infty} \mathbb{P}(M'_k \leqslant x) = \lim_{k \to \infty} \mathbb{P}(M_k \leqslant x) = \mathbb{P}(M \leqslant x)\,.\] Moreover, \[\begin{aligned} M_{k+1} &= \sup \biggl\{0, \sup_{1 \leqslant n \leqslant k+1} (S_n - na)\biggr\} \\ &= \sup \biggl\{0, \sup_{0 \leqslant n \leqslant k} (S_{n+1} - (n+1)a)\biggr\} \\ &= \sup\{0, M'_k + X_1 - a\} \\ &= M'_k - \inf\{a - X_1, M'_k\}\,. \end{aligned}\] Hence, \[\tag{4.4} \mathbb{E}[\inf\{a - X_1, M'_k\}] = \mathbb{E}[M'_k] - \mathbb{E}[M_{k+1}] = \mathbb{E}[M_k] - \mathbb{E}[M_{k+1}] \leqslant 0\,,\] since the sequence \((M_k)\) is nondecreasing. Moreover, since \(M'_k \geqslant 0\) we have \[\lvert \inf\{a - X_1, M'_k\} \rvert \leqslant\lvert a - X_1 \rvert\] for all \(k,\) so that we may apply dominated convergence to (4.4) to get \[\mathbb{E}[\inf\{a - X_1, M'\}] \leqslant 0\,.\]
Now if \(\mathbb{P}(M = \infty) = 1\) then also \(\mathbb{P}(M' = \infty) = 1\) and hence \(\inf\{a - X_1, M'\} = a - X_1.\) But \[\mathbb{E}[a - X_1] > 0\] by assumption. This is the desired contradiction.