3.5 Sums of independent random variables

The following definition gives the law of the sum of independent random variables in terms of their laws.

Definition 3.21

Let \(\mu\) and \(\nu\) be probability measures on \(\mathbb{R}.\) The convolution of \(\mu\) and \(\nu\) is the probability measure \[\mu * \nu :=p_*(\mu \otimes \nu)\,,\] where \(p(x,y) :=x + y.\)

Remark 3.22

This definition is a generalisation of the usual convolution of functions that you have learned in analysis. Indeed, suppose that \(\mu\) and \(\nu\) both have densities, i.e. \(\mu(\mathrm dx) = f(x) \, \mathrm dx\) and \(\nu(\mathrm dx) = g(x) \, \mathrm dx.\) Then we have \((\mu * \nu)(\mathrm dx) = h(x) \, \mathrm dx,\) where \[h(x) = \int f(x - y) \, g(y) \, \mathrm dy\,.\] The right-hand side is also usually denoted by \(f * g\) and it is called the convolution of \(f\) and \(g.\) To verify this assertion, take a nonnegative measurable function \(\phi\) and calculate, using Definition 3.21, \[\int \phi(x) \, (\mu * \nu)(\mathrm dx) = \int \phi(x + y) \, f(x) \, g(y) \, \mathrm dx \, \mathrm dy = \int \phi(x) \, f(x - y) \, g(y) \, \mathrm dx\, \mathrm dy\] as desired, where in the last step we used the change of variables \(x \mapsto x - y.\)

Remark 3.23

If \(X\) and \(Y\) are independent random variables then clearly \(\mathbb{P}_{X+Y} = \mathbb{P}_X * \mathbb{P}_Y.\) Moreover, \(\mathop{\mathrm{Var}}(X+Y) = \mathop{\mathrm{Var}}(X) + \mathop{\mathrm{Var}}(Y).\)

Next, we state a weak version of the most famous result in probability theory: the law of large numbers. It states that the average of a large number of independent copies of a random variable is close to its expectation. The following result is known as the weak law of large numbers because it falls short of the best possible statement, for two reasons: the assumption \(X_n \in L^2\) can be weakened to \(X_n \in L^1,\) and the convergence in fact holds almost surely instead of just in \(L^2.\) Later on, we shall see how to remedy both issues. The weak law has the advantage that it is very easy to prove.

Proposition 3.24 • Weak law of large numbers

Let \((X_n)_{n \geqslant 1}\) be independent random variables in \(L^2\) with the same law. Then \[\frac{1}{n} (X_1+ \cdots + X_n) \overset{L^2}{\longrightarrow} \mathbb{E}[X_1]\,.\]

Proof. The proof is a simple computation using Remark 3.23: \[\begin{gathered} \mathbb{E}\Biggl[\biggl(\frac{1}{n}(X_1 + \cdots + X_n) - \mathbb{E}[X_1]\biggr)^2\Biggr] = \mathop{\mathrm{Var}}\biggl(\frac{1}{n} (X_1 + \cdots + X_n)\biggr) \\ = \frac{1}{n^2} \mathop{\mathrm{Var}}(X_1 + \cdots + X_n) = \frac{1}{n^2} (\mathop{\mathrm{Var}}(X_1) + \cdots + \mathop{\mathrm{Var}}(X_n)) = \frac{1}{n} \mathop{\mathrm{Var}}(X_1)\,. \end{gathered}\]

Note that, for this proof to work, it in fact suffices that the random variables \((X_n)\) be uncorrelated (i.e. \(\mathop{\mathrm{Cov}}(X_n, X_m) = 0\) for \(n \neq m\)) instead of requiring independence (recall Remark 3.13 (ii)).

The following is a nice application of the law of large numbers to polynomial approximation in numerical analysis.

Example 3.25 • Polynomial approximation

The problem of polynomial approximation is one of the classical problems in numerical analysis: Suppose that we are given the values of an unknown function \(f\) at \(n+1\) points. How do we approximate \(f\) with a polynomial of degree \(n\)?

To be more precise, let \(f\) be continuous on \([0,1].\) Define the Bernstein polynomial \[f_n(x) :=\sum_{k = 0}^n \binom{n}{k} x^k (1 - x)^{n-k} f \biggl(\frac{k}{n}\biggr)\,.\] Then we claim that \(f_n\) converges to \(f\) uniformly on \([0,1].\)

To see why, take independent random variables \(X_1, \dots, X_n,\) each having a Bernoulli law with parameter \(p.\) Then \(S_n = X_1 + \cdots + X_n\) has binomial law (recall Exercise 2.2) \[\mathbb{P}(S_n = k) = \binom{n}{k} p^k (1 - p)^{n-k}\] and hence \[\tag{3.9} \mathbb{E}\biggl[f\biggl(\frac{S_n}{n}\biggr)\biggr] = f_n(p)\,.\] Then by the law of large numbers we have \(\frac{S_n}{n} \to p\) in \(L^2,\) so that we expect \(f_n(p) \to f(p).\) Let us carry out this argument carefully. Let \(\varepsilon> 0\) and choose \(\delta > 0\) such that \(\lvert x-y \rvert < \delta\) implies \(\lvert f(x) - f(y) \rvert < \varepsilon\) (since \(f\) is continuous, and hence uniformly continuous, on the compact interval \([0,1].\)) Then we partition \[1 = \mathbf 1_{\lvert S_n/n - p \rvert < \delta} + \mathbf 1_{\lvert S_n/n - p \rvert \geqslant\delta}\] and use this to estimate \[\begin{gathered} \biggl\lvert \mathbb{E}\biggl[f\biggl(\frac{S_n}{n}\biggr)\biggr] - f(p) \biggr\rvert \leqslant\varepsilon+ 2 \lVert f \rVert_{L^\infty} \mathbb{P}\biggl(\biggl\lvert \frac{S_n}{n} - p \biggr\rvert \geqslant\delta\biggr) \\ \leqslant\varepsilon+ 2 \lVert f \rVert_{L^\infty} \frac{\mathop{\mathrm{Var}}(S_n / n)}{\delta^2} = \varepsilon+ 2 \lVert f \rVert_{L^\infty} \frac{p(1-p)}{n \delta^2} \leqslant\varepsilon+ \lVert f \rVert_{L^\infty} \frac{1}{2 n \delta^2}\,, \end{gathered}\] where in the second step we used Chebyshev’s inequality. Recalling (3.9), we conclude the proof.

As advertised, the convergence in \(L^2\) in Proposition 3.24 is often not strong enough, and we would need almost sure convergence. This issue is clarified in the following remark.

Remark 3.26

If \(Y_n \to Y\) in \(L^p\) for \(p \geqslant 1,\) it could well be that \(Y_n(\omega)\) fails to converge for every \(\omega \in \Omega.\) In other words, in no realisation of the randomness does \(Y_n\) converge. This is a major weakness of convergence in \(L^p.\)

For an example, we continue with Example 3.19. Suppose that \(\sum_n b_n = \infty.\) Take the random variable \(Y_n :=\mathbf 1_{A_n}.\) Then \(\lVert Y_n \rVert_p = b_n^{1/p} \to 0,\) so that \(Y_n\) converges to \(Y = 0\) in \(L^p.\) However, for all \(\omega \in \Omega\) we have \(Y_n(\omega) = 1\) infinitely often. In particular, although \(Y_n\) converges to \(0\) in \(L^p\) for any \(p \geqslant 1,\) almost surely \(Y_n\) does not converge.

Note, however, that if \(\sum_{n} b_n < \infty\) then \(Y_n \to 0\) almost surely.

Using the Borel-Cantelli lemma, we can upgrade the convergence in \(L^2\) to almost sure convergence. The result is the following proposition, which is still not optimal because of the assumption that \(X_n \in L^4\) (instead of the optimal \(X_n \in L^1\)). It represents an archetypal use of the Borel-Cantelli lemma.

Proposition 3.27

(Law of large numbers in \(L^4\)). Let \((X_n)_{n \geqslant 1}\) be independent random variables in \(L^4\) with the same law. Then \[\frac{1}{n} (X_1+ \cdots + X_n) \overset{\text{a.s.}}{\longrightarrow} \mathbb{E}[X_1]\,.\]

Proof. The proof is very constructive. It consists of two main ideas. The first is that if we repeat the proof of Proposition 3.24 with a higher \(L^p\) norm, we get stronger decay in \(n\) of the error probabilities. The second is that if these error probabilities are summable over \(n,\) we can use the Borel-Cantelli lemma to conclude almost sure convergence.

To begin with, without loss of generality we can replace \(X_n\) with \(X_n - \mathbb{E}[X_n],\) and hence suppose that \(\mathbb{E}[X_n] = 0.\) Then we simply calculate, using the independence of the random variables \((X_n),\) to get \[\begin{gathered} \mathbb{E}\Biggl[\biggl(\frac{1}{n} \sum_{k = 1}^n X_k\biggr)^4\Biggr] = \frac{1}{n^4} \sum_{k_1, \dots, k_4 = 1}^n \mathbb{E}[X_{k_1} X_{k_2} X_{k_3} X_{k_4}] \\ = \frac{1}{n^4} \biggl(n \mathbb{E}[X_1^4] + 3 n (n-1) \mathbb{E}[X_1^2]^2\biggr) = O \biggl(\frac{1}{n^2}\biggr)\,, \end{gathered}\] where in the second step we classified the indices \(k_1, k_2, k_3, k_4\) according to their coincidences: to obtain a nonzero contribution we need either all four indices to coincide, or two and two indices to coincide, which gives rise to the two terms in the third step.

We conclude that \[\sum_{n \geqslant 1} \mathbb{E}\Biggl[\biggl(\frac{1}{n} \sum_{k = 1}^n X_k\biggr)^4\Biggr] = \mathbb{E}\Biggl[ \sum_{n \geqslant 1} \biggl(\frac{1}{n} \sum_{k = 1}^n X_k\biggr)^4\Biggr] < \infty\,,\] and hence \[\sum_{n \geqslant 1} \biggl(\frac{1}{n} \sum_{k = 1}^n X_k\biggr)^4 < \infty\] almost surely, from which the claim follows. This is an application of the (proof of the) Borel-Cantelli lemma (see also Remark 3.18).

For many applications, the non-optimal assumption \(X_n \in L^4\) in Proposition 3.27 does not pose a problem. Here is an example.

Corollary 3.28

Let \((A_n)_{n \geqslant 1}\) be independent of events of constant probability. Then \[\frac{1}{n} \sum_{k = 1}^n \mathbf 1_{A_k} \overset{\text{a.s.}}{\longrightarrow} \mathbb{P}(A_1)\,.\]

We can use this result to bring Example 3.20 to a striking conclusion.

Example 3.29 • Example 3.20 continued

Let \(p \in \mathbb{N}^*\) and \(l \in \{1, \dots, p\}.\) Define \[Y^l_n :=(X_{(n-1)p + l}, \dots, X_{(n-1)p + l + p-1})\,.\] Explicitly, the sequence \(Y_1^l, Y_2^l, \dots\) is \[(X_{l}, \dots, X_{l + p-1}), (X_{p + l}, \dots, X_{p + l + p-1}), \dots\] By Example 3.20 and Remark 3.15, \((Y^l_n)_{n \in \mathbb{N}^*}\) is an independent family of random variables, for each \(l \in \{1, \dots, p\}.\) Hence, for any \(x = (x_1, \dots, x_p) \in \{0,1\}^p,\) applying Corollary 3.28 to the events \(A_n = \{Y^l_n = x\},\) we find that \[\frac{1}{n} \# \Bigl\{i \in \{1, \dots, n\} \,\colon Y^l_i = x\Bigr\} \overset{\text{a.s.}}{\longrightarrow} \frac{1}{2^p}\,.\] Since this holds for all \(l \in \{1, \dots, p\}\) we deduce that \[\frac{1}{n} \# \Bigl\{i \in \{1, \dots, n\} \,\colon(X_{i}, \dots, X_{i + p-1}) = x\Bigr\} \overset{\text{a.s.}}{\longrightarrow} \frac{1}{2^p}\,.\] Taking the countable union over \(p \in \mathbb{N}^*\) and \(x \in \{0,1\}^p,\) we conclude: almost surely, for all \(p \in \mathbb{N}^*,\) \(x \in \{0,1\}^p,\) \[\frac{1}{n} \# \Bigl\{i \in \{1, \dots, n\} \,\colon(X_{i}, \dots, X_{i + p-1}) = x\Bigr\} \longrightarrow \frac{1}{2^p}\,.\] In words: for almost every real number \(\omega \in [0,1),\) any finite sequence of length \(p\) appears with frequency \(2^{-p}\) in the binary digits of \(\omega.\)

Note that it is difficult to construct a number \(\omega\) with this rather striking property. The easiest way to show that such a number exists is the preceding probabilistic argument. Not only does it show that such a number exists, but it shows that almost all numbers have this property.

Home

Contents

Weeks