UniDistance | BMathE_M08

3.4 The Borel-Cantelli lemma

In this section we encounter the first deep result that has a genuinely probabilistic flavour. It is the standard tool to prove that some asymptotic property holds almost surely, and it lies at the heart of many important results in probability. A typical application is given in Example 3.20 below.

Let \((A_n)_{n \in \mathbb{N}}\) be a sequence of events in \(\mathcal A.\) We define the new events \[\begin{aligned} \limsup_n A_n &:=\bigcap_{n \geqslant 0} \bigcup_{k \geqslant n} A_k \\ \liminf_n A_n &:=\bigcup_{n \geqslant 0} \bigcap_{k \geqslant n} A_k\,. \end{aligned}\] The following interpretation is crucial, and it is often the better way to understand these events: \[\begin{aligned} \limsup_n A_n &= \{\omega \,\colon\omega \in A_k \text{ infinitely often}\}\,, \\ \liminf_n A_n &= \{\omega \,\colon\omega \in A_k \text{ eventually}\}\,. \end{aligned}\] In other words:

\(\limsup_n A_n\) is the set of realisations that appear in infinitely many \(A_k.\)

\(\liminf_n A_n\) is the set of realisations that appear in all \(A_k\) beyond a certain index.

It is very important that you play with these different formulations until you are comfortable with them.

We always have \[\tag{3.7} \liminf_n A_n \subset \limsup_n A_n\,.\] If you’re not sure why this is true, pause here until you see why.

The reason behind the terminology \(\limsup\) and \(\liminf\) stems from the fact that \[\mathbf 1_{\limsup_n A_n} = \limsup_n \mathbf 1_{A_n}\,, \qquad \mathbf 1_{\liminf_n A_n} = \liminf_n \mathbf 1_{A_n}\,,\] where the right-hand sides are the usual \(\limsup\) and \(\liminf\) on \(\mathbb{R}\); see the exercises. (This remark also provides another way of seeing (3.7).)

The event \(\limsup_n A_n\) is more useful and more commonly used than \(\liminf_n A_n,\) which appears quite rarely in probability.

Proposition 3.16 • Borel-Cantelli lemma

If \(\sum_{n \in \mathbb{N}} \mathbb{P}(A_n) < \infty\) then \(\mathbb{P}(\limsup_n A_n) = 0.\) In other words, almost surely \(A_n\) happens only finitely often.
If \(\sum_{n \in \mathbb{N}} \mathbb{P}(A_n) = \infty\) and \((A_n)_{n \in \mathbb{N}}\) are independent, then \(\mathbb{P}(\limsup_n A_n) = 1.\) In other words, almost surely \(A_n\) happens infinitely often.

Remark 3.17

In (ii), the independence is important. The conclusion is clearly wrong if we take \(A_n = A\) for all \(n \in \mathbb{N}\) with \(0 < \mathbb{P}(A) < 1.\)

Proof of Proposition 3.16. For (i), we write (using Exercise 1.1) \[\mathbb{P}(\limsup_n A_n) = \lim_n \mathbb{P}\Biggl(\bigcup_{k \geqslant n} A_k\Biggr) \leqslant\lim_n \sum_{k \geqslant n} \mathbb{P}(A_k) = 0\,,\] since the sum \(\sum_k \mathbb{P}(A_k)\) is finite.

For (ii), we choose \(N \in \mathbb{N}\) and write for any \(n \geqslant N,\) by independence of the events \(A_k,\) \[\mathbb{P}\Biggl(\bigcap_{k = N}^n A_k^c\Biggr) = \prod_{k = N}^n \mathbb{P}(A_k^c) = \prod_{k = N}^n (1 - \mathbb{P}(A_k)) \leqslant\prod_{k = N}^n \mathrm e^{-\mathbb{P}(A_k)} = \mathrm e^{-\sum_{k = N}^n \mathbb{P}(A_k)}\,,\] which tends to zero as \(n \to \infty\) because of the assumption \(\sum_{n \in \mathbb{N}} \mathbb{P}(A_n) = \infty.\) Here we used the basic inequality \(1 - x \leqslant\mathrm e^{-x}\) (which follows for instance by convexity of the function \(\mathrm e^{-x}\)). We conclude (by Exercise 1.1) that \[\mathbb{P}\Biggl(\bigcap_{k \geqslant N}A_k^c\Biggr) = 0\,,\] and hence also \[\mathbb{P}\Biggl(\bigcup_{N \geqslant 0} \bigcap_{k \geqslant N}A_k^c\Biggr) = 0\,,\] which is equivalent to \[\mathbb{P}\Biggl(\bigcap_{N \geqslant 0} \bigcup_{k \geqslant N}A_k\Biggr) = 1\,. \]

Remark 3.18

A somewhat different proof of (i) follows from the observation that \[\mathbb{E}\biggl[\sum_n \mathbf 1_{A_n}\biggr] = \sum_n \mathbb{P}(A_n) < \infty\,,\] so that the random variable \(\sum_n \mathbf 1_{A_n}\) is almost surely finite, which implies that \(A_n\) happens only finitely often.

The Borel-Cantelli lemma states that whether \(A_n\) happens infinitely often depends on how fast the sequence \(\mathbb{P}(A_n)\) tends to zero. This principle is well illustrated by the following example, which provides a good intuition for Proposition 3.16 (i).

Example 3.19

Let \(b_1, b_2, \dots \in [0,1).\) We partition \([0,\infty)\) into intervals \([a_n, a_{n+1})\) of length \(b_n,\) by setting \(a_0 = 0\) and \(a_{n} = a_{n - 1} + b_n\) for \(n \geqslant 1.\) Now we “fold” these intervals into the unit interval \([0,1)\) using the function \(f(x) :=x - \lfloor x \rfloor,\) the fractional part of \(x.\) Thus, we define \(A_n :=f([a_n, a_{n+1})).\) You may find it helpful to draw these sets.

If \(\sum_n b_n = \infty\) then \(\bigcup_n [a_n, a_{n+1}) = [0,\infty).\) This means that the folded sets \(A_n\) keep on “passing through” the interval \([0,1),\) and every \(\omega \in [0,1)\) is contained in infinitely many sets \(A_n.\)
If \(\sum_n b_n < \infty\) then \(\bigcup_n [a_n, a_{n+1})\) is a finite interval. This means that the folded sets \(A_n\) do not cover enough ground to keep on passing through \([0,1),\) and every \(\omega \in [0,1)\) is contained in only finitely many sets \(A_n.\)

These observations can be interpreted in light of the Borel-Cantelli lemma on the probability space \(([0,1), \mathcal B([0,1)), \mathbb{P}),\) where \(\mathbb{P}\) is Lebesgue measure. Then \(\mathbb{P}(A_n) = b_n\) (why?), and Proposition 3.16 (i) is applicable. In particular, we see that the condition \(\sum_n \mathbb{P}(A_n) < \infty\) is not only sufficient but in general also necessary. Note that Proposition 3.16 (ii) does not apply to this example, because the events \((A_n)\) are not independent (why?).

Example 3.20

In this example we investigate the statistics of digits of random numbers. For simplicity, we work with binary digits, although the same argument can be easily adapted to any basis (such as 10). Take the probability space \(([0,1), \mathcal B([0,1)), \mathbb{P}),\) where \(\mathbb{P}\) is Lebesgue measure. Use the notation \(\omega = 0.\omega_1\omega_2\omega_3 \cdots\) for the binary digits \(\omega_n \in \{0,1\}\) of \(\omega \in [0,1),\) i.e. \(\omega = \sum_{n = 1}^\infty \omega_n 2^{-n}\) (with the usual convention that we cannot have \(\omega_n = 1\) for all \(n\) above a certain index; this ensures the uniqueness of the binary representation). This definition is not very direct, and it turns out to be more convenient to define the digits through the events \[B_n :=\bigcup_{k = 1}^{2^{n - 1}} \biggl[\frac{k - 1/2}{2^{n - 1}}, \frac{k}{2^{n - 1}} \biggr)\,, \qquad n \geqslant 1\,.\] (Draw them!) Then we define the random variable \(X_n :=\mathbf 1_{B_n}.\) One can then show by induction that \(X_n(\omega) = \omega_n\); the details are left as an exercise (drawing the events \(B_n\) should make this clear).

Next, we find that \(\mathbb{P}(X_n = 0) = \mathbb{P}(X_n = 1) = \frac{1}{2}\) for all \(n,\) and we claim that \((X_n)_{n \geqslant 1}\) are independent random variables. Indeed, for any \(p \in \mathbb{N}^*\) and \(x_1, \dots, x_p \in \{0,1\}\) we find \[\begin{gathered} \mathbb{P}(X_1 = x_1, \dots, X_p = x_p) = \mathbb{P}\Biggl(\Biggl[\sum_{k = 1}^p x_k 2^{-k}, \sum_{k = 1}^p x_k 2^{-k} + 2^{-p} \Biggr)\Biggr) \\ = \frac{1}{2^p} = \prod_{k = 1}^p \mathbb{P}(X_k = x_k)\,, \end{gathered}\] as desired.

Moreover, we claim that for any \(p \in \mathbb{N}^*\) and \(x_1, \dots, x_p \in \{0,1\}\) we have, almost surely, \[\tag{3.8} \# \{k \in \mathbb{N}\,\colon(X_{k+1}, \dots, X_{k+p}) = (x_1, \dots, x_p) \} = \infty\,.\] This is a simple consequence of Proposition 3.16 (ii). The only issue is that events \[\bigl\{(X_{k+1}, \dots, X_{k+p}) = (x_1, \dots, x_p)\bigr\} \,, \qquad k \in \mathbb{N}\,,\] are not independent (since the collections of random variables on which they depend overlap). There’s a very easy solution to this, however: we can just pick every \(p\)-th of them, in which case they are independent. Thus, defining \(Y_n :=(X_{np + 1}, \dots, X_{np + p}),\) from Remark 3.15 we conclude that \((Y_n)_{n \in \mathbb{N}}\) are independent random variables. Setting \(A_n :=\{Y_n = (x_1, \dots, x_p)\},\) the family \((A_n)_{n \in \mathbb{N}}\) is independent and satisfies \(\mathbb{P}(A_n) = 2^{-p}.\) Hence Proposition 3.16 (ii) yields (3.8).

We can even upgrade¹⁰ (3.8) by taking a countable union over \(p \in \mathbb{N}^*\) and \(x_1, \dots, x_p \in \{0,1\}\) to show that almost surely \[\# \{k \in \mathbb{N}\,\colon(X_{k+1}, \dots, X_{k+p}) = (x_1, \dots, x_p) \} = \infty \qquad \forall p \geqslant 1\,, \, \forall x_1, \dots, x_p \in \{0,1\}\,.\] In other words: Almost every real number exhibits every finite sequence of binary digits infinitely often.

3.5 Sums of independent random variables

The following definition gives the law of the sum of independent random variables in terms of their laws.

Definition 3.21

Let \(\mu\) and \(\nu\) be probability measures on \(\mathbb{R}.\) The convolution of \(\mu\) and \(\nu\) is the probability measure \[\mu * \nu :=p_*(\mu \otimes \nu)\,,\] where \(p(x,y) :=x + y.\)

Remark 3.22

This definition is a generalisation of the usual convolution of functions that you have learned in analysis. Indeed, suppose that \(\mu\) and \(\nu\) both have densities, i.e. \(\mu(\mathrm dx) = f(x) \, \mathrm dx\) and \(\nu(\mathrm dx) = g(x) \, \mathrm dx.\) Then we have \((\mu * \nu)(\mathrm dx) = h(x) \, \mathrm dx,\) where \[h(x) = \int f(x - y) \, g(y) \, \mathrm dy\,.\] The right-hand side is also usually denoted by \(f * g\) and it is called the convolution of \(f\) and \(g.\) To verify this assertion, take a nonnegative measurable function \(\phi\) and calculate, using Definition 3.21, \[\int \phi(x) \, (\mu * \nu)(\mathrm dx) = \int \phi(x + y) \, f(x) \, g(y) \, \mathrm dx \, \mathrm dy = \int \phi(x) \, f(x - y) \, g(y) \, \mathrm dx\, \mathrm dy\] as desired, where in the last step we used the change of variables \(x \mapsto x - y.\)

Remark 3.23

If \(X\) and \(Y\) are independent random variables then clearly \(\mathbb{P}_{X+Y} = \mathbb{P}_X * \mathbb{P}_Y.\) Moreover, \(\mathop{\mathrm{Var}}(X+Y) = \mathop{\mathrm{Var}}(X) + \mathop{\mathrm{Var}}(Y).\)

Next, we state a weak version of the most famous result in probability theory: the law of large numbers. It states that the average of a large number of independent copies of a random variable is close to its expectation. The following result is known as the weak law of large numbers because it falls short of the best possible statement, for two reasons: the assumption \(X_n \in L^2\) can be weakened to \(X_n \in L^1,\) and the convergence in fact holds almost surely instead of just in \(L^2.\) Later on, we shall see how to remedy both issues. The weak law has the advantage that it is very easy to prove.

Proposition 3.24 • Weak law of large numbers

Let \((X_n)_{n \geqslant 1}\) be independent random variables in \(L^2\) with the same law. Then \[\frac{1}{n} (X_1+ \cdots + X_n) \overset{L^2}{\longrightarrow} \mathbb{E}[X_1]\,.\]

Proof. The proof is a simple computation using Remark 3.23: \[\begin{gathered} \mathbb{E}\Biggl[\biggl(\frac{1}{n}(X_1 + \cdots + X_n) - \mathbb{E}[X_1]\biggr)^2\Biggr] = \mathop{\mathrm{Var}}\biggl(\frac{1}{n} (X_1 + \cdots + X_n)\biggr) \\ = \frac{1}{n^2} \mathop{\mathrm{Var}}(X_1 + \cdots + X_n) = \frac{1}{n^2} (\mathop{\mathrm{Var}}(X_1) + \cdots + \mathop{\mathrm{Var}}(X_n)) = \frac{1}{n} \mathop{\mathrm{Var}}(X_1)\,. \end{gathered}\]

Note that, for this proof to work, it in fact suffices that the random variables \((X_n)\) be uncorrelated (i.e. \(\mathop{\mathrm{Cov}}(X_n, X_m) = 0\) for \(n \neq m\)) instead of requiring independence (recall Remark 3.13 (ii)).

The following is a nice application of the law of large numbers to polynomial approximation in numerical analysis.

Example 3.25 • Polynomial approximation

The problem of polynomial approximation is one of the classical problems in numerical analysis: Suppose that we are given the values of an unknown function \(f\) at \(n+1\) points. How do we approximate \(f\) with a polynomial of degree \(n\)?

To be more precise, let \(f\) be continuous on \([0,1].\) Define the Bernstein polynomial \[f_n(x) :=\sum_{k = 0}^n \binom{n}{k} x^k (1 - x)^{n-k} f \biggl(\frac{k}{n}\biggr)\,.\] Then we claim that \(f_n\) converges to \(f\) uniformly on \([0,1].\)

To see why, take independent random variables \(X_1, \dots, X_n,\) each having a Bernoulli law with parameter \(p.\) Then \(S_n = X_1 + \cdots + X_n\) has binomial law (recall Exercise 2.2) \[\mathbb{P}(S_n = k) = \binom{n}{k} p^k (1 - p)^{n-k}\] and hence \[\tag{3.9} \mathbb{E}\biggl[f\biggl(\frac{S_n}{n}\biggr)\biggr] = f_n(p)\,.\] Then by the law of large numbers we have \(\frac{S_n}{n} \to p\) in \(L^2,\) so that we expect \(f_n(p) \to f(p).\) Let us carry out this argument carefully. Let \(\varepsilon> 0\) and choose \(\delta > 0\) such that \(\lvert x-y \rvert < \delta\) implies \(\lvert f(x) - f(y) \rvert < \varepsilon\) (since \(f\) is continuous, and hence uniformly continuous, on the compact interval \([0,1].\)) Then we partition \[1 = \mathbf 1_{\lvert S_n/n - p \rvert < \delta} + \mathbf 1_{\lvert S_n/n - p \rvert \geqslant\delta}\] and use this to estimate \[\begin{gathered} \biggl\lvert \mathbb{E}\biggl[f\biggl(\frac{S_n}{n}\biggr)\biggr] - f(p) \biggr\rvert \leqslant\varepsilon+ 2 \lVert f \rVert_{L^\infty} \mathbb{P}\biggl(\biggl\lvert \frac{S_n}{n} - p \biggr\rvert \geqslant\delta\biggr) \\ \leqslant\varepsilon+ 2 \lVert f \rVert_{L^\infty} \frac{\mathop{\mathrm{Var}}(S_n / n)}{\delta^2} = \varepsilon+ 2 \lVert f \rVert_{L^\infty} \frac{p(1-p)}{n \delta^2} \leqslant\varepsilon+ \lVert f \rVert_{L^\infty} \frac{1}{2 n \delta^2}\,, \end{gathered}\] where in the second step we used Chebyshev’s inequality. Recalling (3.9), we conclude the proof.

As advertised, the convergence in \(L^2\) in Proposition 3.24 is often not strong enough, and we would need almost sure convergence. This issue is clarified in the following remark.

Remark 3.26

If \(Y_n \to Y\) in \(L^p\) for \(p \geqslant 1,\) it could well be that \(Y_n(\omega)\) fails to converge for every \(\omega \in \Omega.\) In other words, in no realisation of the randomness does \(Y_n\) converge. This is a major weakness of convergence in \(L^p.\)

For an example, we continue with Example 3.19. Suppose that \(\sum_n b_n = \infty.\) Take the random variable \(Y_n :=\mathbf 1_{A_n}.\) Then \(\lVert Y_n \rVert_p = b_n^{1/p} \to 0,\) so that \(Y_n\) converges to \(Y = 0\) in \(L^p.\) However, for all \(\omega \in \Omega\) we have \(Y_n(\omega) = 1\) infinitely often. In particular, although \(Y_n\) converges to \(0\) in \(L^p\) for any \(p \geqslant 1,\) almost surely \(Y_n\) does not converge.

Note, however, that if \(\sum_{n} b_n < \infty\) then \(Y_n \to 0\) almost surely.

Using the Borel-Cantelli lemma, we can upgrade the convergence in \(L^2\) to almost sure convergence. The result is the following proposition. It represents an archetypal use of the Borel-Cantelli lemma.

Proposition 3.27

(Strong law of large numbers in \(L^4\)). Let \((X_n)_{n \geqslant 1}\) be independent random variables in \(L^4\) with the same law. Then \[\frac{1}{n} (X_1+ \cdots + X_n) \overset{\text{a.s.}}{\longrightarrow} \mathbb{E}[X_1]\,.\]

Proof. The proof is very constructive. It consists of two main ideas. The first is that if we repeat the proof of Proposition 3.24 with a higher \(L^p\) norm, we get stronger decay in \(n\) of the error probabilities. The second is that if these error probabilities are summable over \(n,\) we can use the Borel-Cantelli lemma to conclude almost sure convergence.

To begin with, without loss of generality we can replace \(X_n\) with \(X_n - \mathbb{E}[X_n],\) and hence suppose that \(\mathbb{E}[X_n] = 0.\) Then we simply calculate, using the independence of the random variables \((X_n),\) to get \[\begin{gathered} \mathbb{E}\Biggl[\biggl(\frac{1}{n} \sum_{k = 1}^n X_k\biggr)^4\Biggr] = \frac{1}{n^4} \sum_{k_1, \dots, k_4 = 1}^n \mathbb{E}[X_{k_1} X_{k_2} X_{k_3} X_{k_4}] \\ = \frac{1}{n^4} \biggl(n \mathbb{E}[X_1^4] + 3 n (n-1) \mathbb{E}[X_1^2]^2\biggr) = O \biggl(\frac{1}{n^2}\biggr)\,, \end{gathered}\] where in the second step we classified the indices \(k_1, k_2, k_3, k_4\) according to their coincidences: to obtain a nonzero contribution we need either all four indices to coincide, or two and two indices to coincide, which gives rise to the two terms in the third step.

We conclude that \[\sum_{n \geqslant 1} \mathbb{E}\Biggl[\biggl(\frac{1}{n} \sum_{k = 1}^n X_k\biggr)^4\Biggr] = \mathbb{E}\Biggl[ \sum_{n \geqslant 1} \biggl(\frac{1}{n} \sum_{k = 1}^n X_k\biggr)^4\Biggr] < \infty\,,\] and hence \[\sum_{n \geqslant 1} \biggl(\frac{1}{n} \sum_{k = 1}^n X_k\biggr)^4 < \infty\] almost surely, from which the claim follows. This is an application of the (proof of the) Borel-Cantelli lemma (see also Remark 3.18).

Remark 3.28

The condition that \(X_n \in L^4\) is not optimal. In fact, Proposition 3.27 remains true under the weaker assumption \(X_n \in L^1\) (see Proposition 7.5 in Section 7). This assumption is known to be optimal (in the sense that the conclusion must be wrong if \(X_n \notin L^1\)). This optimal result is usually known as the strong law of large numbers or just the law of large numbers. It turns out that, unlike the relatively straightforward proof of Proposition 3.27, the proof of the law of Proposition 7.5 is somewhat involved, and was a major achievement in 20th century probability. Its proof relies on some far-reaching and deep ideas of probability theory. If you are interested, you can read all about it in Section 7.

However, the \(L^4\) assumption from Proposition 3.27 is good enough for most¹¹ applications, and in this course we shall restrict ourselves to it.

Let us discuss some application of the strong law of large numbers.

Corollary 3.29

Let \((A_n)_{n \geqslant 1}\) be independent of events of constant probability. Then \[\frac{1}{n} \sum_{k = 1}^n \mathbf 1_{A_k} \overset{\text{a.s.}}{\longrightarrow} \mathbb{P}(A_1)\,.\]

We can use this result to bring Example 3.20 to a striking conclusion.

Example 3.30 • Example 3.20 continued

Let \(p \in \mathbb{N}^*\) and \(l \in \{1, \dots, p\}.\) Define \[Y^l_n :=(X_{(n-1)p + l}, \dots, X_{(n-1)p + l + p-1})\,.\] Explicitly, the sequence \(Y_1^l, Y_2^l, \dots\) is \[(X_{l}, \dots, X_{l + p-1}), (X_{p + l}, \dots, X_{p + l + p-1}), \dots\] By Example 3.20 and Remark 3.15, \((Y^l_n)_{n \in \mathbb{N}^*}\) is an independent family of random variables, for each \(l \in \{1, \dots, p\}.\) Hence, for any \(x = (x_1, \dots, x_p) \in \{0,1\}^p,\) applying Corollary 3.29 to the events \(A_n = \{Y^l_n = x\},\) we find that \[\frac{1}{n} \# \Bigl\{i \in \{1, \dots, n\} \,\colon Y^l_i = x\Bigr\} \overset{\text{a.s.}}{\longrightarrow} \frac{1}{2^p}\,.\] Since this holds for all \(l \in \{1, \dots, p\}\) we deduce that \[\frac{1}{n} \# \Bigl\{i \in \{1, \dots, n\} \,\colon(X_{i}, \dots, X_{i + p-1}) = x\Bigr\} \overset{\text{a.s.}}{\longrightarrow} \frac{1}{2^p}\,.\] Taking the countable union over \(p \in \mathbb{N}^*\) and \(x \in \{0,1\}^p,\) we conclude: almost surely, for all \(p \in \mathbb{N}^*,\) \(x \in \{0,1\}^p,\) \[\frac{1}{n} \# \Bigl\{i \in \{1, \dots, n\} \,\colon(X_{i}, \dots, X_{i + p-1}) = x\Bigr\} \longrightarrow \frac{1}{2^p}\,.\] In words: for almost every real number \(\omega \in [0,1),\) any finite sequence of length \(p\) appears with frequency \(2^{-p}\) in the binary digits of \(\omega.\)

Note that it is difficult to construct a number \(\omega\) with this rather striking property. The easiest way to show that such a number exists is the preceding probabilistic argument. Not only does it show that such a number exists, but it shows that almost all numbers have this property.

3.4 The Borel-Cantelli lemma

3.5 Sums of independent random variables

Home

Contents

Study Weeks

✕