5.8 Asymptotic behaviour
In this section we study the following important question. Suppose that we take an irreducible and recurrent Markov chain. In Section 5.6 we saw that the chain has a unique stationary measure \(\mu.\) Let \(f\) be a nonnegative function on \(S.\) How are the time average \(\sum_{i = 0}^n f(X_i)\) and the space average \(\int f \, \mathrm d\mu\) related? The following result says that for large \(n\) they coincide almost surely (up to a rescaling), no matter where the chain starts from. Such results are usually known as ergodic theorems.
Consider an irreducible and recurrent Markov chain with stationary measure \(\mu.\) Let \(f,g \,\colon S \to [0,\infty)\) such that \(0 < \int g \, \mathrm d\mu < \infty.\) Then, for all \(x \in S,\) we have \(\mathbb{P}_x\)-a.s. \[\frac{\sum_{i = 0}^{n-1} f(X_i)}{\sum_{i = 0}^{n-1} g(X_i)} \longrightarrow \frac{\int f \, \mathrm d\mu}{\int g \, \mathrm d\mu}\] as \(n \to \infty.\)
As a consequence (taking \(g = 1\)), if the chain is positive recurrent, i.e. it has a stationary probability measure, the space average can be computed as the almost surely limit of the time average.
If the Markov chain is irreducible and positive recurrent with stationary probability measure \(\mu,\) then, for all \(x \in S,\) we have \(\mathbb{P}_x\)-a.s. \[\tag{5.17} \frac{1}{n} \sum_{i = 0}^{n-1} f(X_i) \longrightarrow \int f \, \mathrm d\mu\] as \(n \to \infty.\)
Proof of Proposition 5.45. First, by monotone approximation, we may suppose that \(\int f \, \mathrm d\mu < \infty.\) (Otherwise, consider a sequence of functions \(f_m\) supported on finite subsets of \(S,\) that converges from below to \(f,\) and use the monotone convergence theorem.)
The main tool of the proof is the sequence of random times \(T_k,\) corresponding to the \(k\)th return of the chain to \(x.\) That is, \[T_0 :=0\,, \qquad T_1 :=H_x = \inf \{n > 0 \,\colon X_n = x\}\,, \qquad T_{k+1} :=\inf\{n > T_k \,\colon X_n = x\}\,.\] Since \((X_n)\) is recurrent, \(T_k < \infty\) a.s. for all \(k \in \mathbb{N}.\)
For \(k \in \mathbb{N}\) we define the sum over the \(k\)th excursion, \[Z_k :=\sum_{n = T_k}^{T_{k+1} - 1} f(X_n)\,.\] The main observation is that \((Z_k)_{k \in \mathbb{N}}\) are independent identically distributed random variables. Intuitively, this is clear from the Markov property, since each excursion from \(x\) back to \(x\) is a fresh start, the chain forgetting everything about its past except that it starts at \(x\) again.
More formally, by Proposition 3.12, it suffices to show that for any bounded and measureable functions \(g_0, \dots, g_k\) we have \[\tag{5.18} \mathbb{E}_x \Biggl[\prod_{i = 0}^k g_i(Z_i)\Biggr] = \prod_{i = 0}^k \mathbb{E}_x[g_i(Z_0)]\,.\] We show (5.18) by induction on \(k.\) The case \(k = 0\) is obvious. For the induction step, we condition on \(T_{k}\) and \(T_{k+1}\) to get \[\begin{aligned} &\mspace{-10mu}\mathbb{E}_x \Biggl[\prod_{i = 0}^k g_i(Z_i)\Biggr] \\ &= \sum_{m,n \in \mathbb{N}^*} \mathbb{E}_x \Biggl[g_k(Z_k) \mathbf 1_{T_{k+1} = n+ m} \mathbf 1_{T_{k} = n} \prod_{i = 0}^{k-1} g_i(Z_i)\Biggr] \\ &= \sum_{m,n \in \mathbb{N}^*} \mathbb{E}_x \Biggl[g_k\biggl(\sum_{l = n}^{n+m-1}f(X_l)\biggr) \mathbf 1_{T_{k+1} = n+m} \mathbf 1_{T_{k} = n} \prod_{i = 0}^{k-1} g_i(Z_i) \, \bigg|\, X_n = x\Biggr] \mathbb{P}_x(X_n = x) \\ &= \sum_{m,n \in \mathbb{N}^*} \mathbb{E}_x \Biggl[g_k\biggl(\sum_{l = 0}^{m-1}f(X_l)\biggr) \mathbf 1_{H_x = m} \Biggr] \, \mathbb{E}_x\Biggl[\mathbf 1_{T_{k} = n} \prod_{i = 0}^{k-1} g_i(Z_i) \, \bigg|\, X_n = x\Biggr] \mathbb{P}_x(X_n = x) \\ &= \sum_{m,n \in \mathbb{N}^*} \mathbb{E}_x \Biggl[g_k(Z_0) \mathbf 1_{H_x = m} \Biggr] \, \mathbb{E}_x\Biggl[\mathbf 1_{T_{k} = n} \prod_{i = 0}^{k-1} g_i(Z_i)\Biggr] \\ &= \mathbb{E}_x [g_k(Z_0)] \, \mathbb{E}_x\Biggl[\prod_{i = 0}^{k-1} g_i(Z_i)\Biggr]\,, \end{aligned}\] where in the third step we used Corollary 5.16 combined with the observation that on the event \(\{T_k = n\}\) we have \[\{T_{k+1} = n+m\} = \{X_{n+1} \neq x, \dots, X_{n+m-1} \neq x, X_{n+m} = x\}\,,\] and in the fourth step we used that \(X_n = x\) on the event \(\{T_k = n\}.\) Now (5.18) follows by induction.
In general we need the optimal version from Proposition 7.5, since, as just shown, we only know that \(Z_0\) has a finite expectation, i.e. is in \(L^1.\)
What remains is to go from such an average over an integer multiple of excursions to an average over a given time. To that end, for \(n \in \mathbb{N}\) we denote by \(N_x(n)\) the number of returns to \(x\) before time \(n,\) i.e. \[T_{N_x(n)} \leqslant n < T_{N_x(n) + 1}\,.\] Then we clearly have \[\sum_{i = 0}^{T_{N_x(n)} - 1} f(X_i) \leqslant\sum_{i = 0}^{n-1} f(X_i) \leqslant\sum_{i = 0}^{T_{N_x(n) + 1} - 1} f(X_i)\,,\] i.e. \[\sum_{k = 0}^{N_x(n) - 1} Z_k \leqslant\sum_{i = 0}^{n-1} f(X_i) \leqslant\sum_{k = 0}^{N_x(n)} Z_k\,.\] Dividing by \(N_x(n)\) and using (5.19) yields \[\frac{1}{N_x(n)} \sum_{i = 0}^{n-1} f(X_i) \longrightarrow \frac{1}{\mu(x)} \int f \, \mathrm d\mu\] \(\mathbb{P}_x\)-a.s.
The claim now follows by replacing \(f\) with \(g\) and dividing the two results.
From Proposition 5.45 and Corollary 5.46, we deduce by choosing \(f(y) = \mathbf 1_{x = y}\) that for an irreducible positive recurrent chain with arbitrary initial distribution and stationary probability distribution \(\mu\) we have a.s. for all \(x \in S\) \[\frac{1}{n} \sum_{i = 0}^{n-1} \mathbf 1_{X_k = x} \longrightarrow \mu(x)\] as \(n \to \infty.\) (A similar argument shows that if the chain is null recurrent then the limit is \(0.\)) Taking the expectation and using dominated convergence yields \[\frac{1}{n} \sum_{i = 0}^{n-1} \mathbb{P}(X_k = x) \longrightarrow \mu(x)\,.\] In words, the time averages of the laws converge to \(\mu.\) It is natural to ask whether this holds without averaging over time: does the law of \(X_n\) converge to \(\mu\) for any irreducible positive recurrent chain? In other words, does the measure \(\mu_n(x) :=\mathbb{P}(X_n = x)\) converge to \(\mu\) in some sense?
It is easy to see that in general the answer is no. Consider the very simple example \(S = \{1,2\}\) and \(Q = \bigl(\begin{smallmatrix} 0 & 1 \\ 1 & 0 \end{smallmatrix}\bigr).\) This chain is clearly irreducible and positive recurrent. Taking for instance an initial distribution \(\delta_2,\) we clearly have \[\mathbb{P}(X_n = x) = \mathbf 1_{x - n \text{ is even}}\,.\] In other words, the chain jumps between \(1\) and \(2.\)
The problem with the above example is periodicity: the chain has a period of two, meaning that one can only return to the initial state in an even number of steps. It turns out that if in addition we impose that the chain is aperiodic, i.e. has no nontrivial period, then \(\mu_n\) indeed converges to \(\mu\) in the total variation distance. We shall not go into further details in this course.
5.9 Markov chain Monte Carlo and the Metropolis–Hastings algorithm
We conclude this chapter with a remarkable application of the theory of Markov chains. It is the original and most important algorithm in so-called Markov chain Monte Carlo (often abbreviated as MCMC). MCMC methods are some of the most important numerical algorithms ever devised, and they are used in countless applications, from fundamental sciences to economics and weather forecasts.
Suppose that we are given a large finite set \(S\) and a probability measure \(\mu\) on \(S.\) We would like to compute the average of some function \(f,\) i.e. find \(\int f \, \mathrm d\mu.\) In most applications of interest the set \(S\) is huge and the measure \(\mu\) might also be hard to evaluate, thus rendering any simple-minded numerical evaluation of the integral hopeless.
The idea behind Monte Carlo methods is to draw a sample of random variables \(X_0, X_1, \dots\) that on average reproduce the law \(\mu,\) and hence we can hope that \[\tag{5.20} \frac{1}{n} \sum_{i = 0}^{n-1} f(X_i) \longrightarrow \int f \, \mathrm d\mu\] as \(n \to \infty.\) A simple way to guarantee convergence in (5.20) is to take the variables \(X_n\) to be independent with law \(\mu,\) in which case (5.20) clearly holds a.s. by the law of large numbers. However, in practice, even this is often either extremely difficult or impossible. Let us consider a typical and celebrated example.
The Ising model is the simplest and most famous model of ferromagnetism, the remarkable property of certain materials to exhibit spontaneous magnetisation (as used e.g. in fridge magnets). Despite being over a hundred years old, it is still actively studied and many important questions about it remain unsolved. Although seemingly simple, it is known to exhibit a remarkably intricate behaviour, in particular a host of phase transitions – abrupt transitions from one phase of matter to another – which can be studied theoretically (both analytically and numerically).
Let \((V,E)\) be a finite connected graph on the vertex set \(V.\) Typically, one should think of \(V\) being a subset of the lattice \(\mathbb{Z}^d\) such as \(V = \{0, \dots, L-1\}^d,\) and two elements \(u,v \in V\) are adjacent if and only if they are nearest neighbours. At every vertex \(v\) sits a so-called spin, which can point up \(+1\) or down \(-1.\) More formally, we consider a spin configuration \(x = (x_v)_{v \in V} \in S :=\{\pm 1\}^V.\) With each spin configuration \(x \in S\) we associate the energy \[H(x) :=- J \sum_{\{u,v\} \in E} x_u x_v - \sum_{v \in V} h_v x_v\,,\] where \(h \in \mathbb{R}^V\) is called an external magnetic field. (The physical intuition behind this definition is that, if \(J > 0,\) neighbouring spins like to align: each pair of neighbouring spins that are aligned lowers the energy by 1, while each pair of neighbouring spins that point in opposite directions increases the energy by 1. The material is called ferromagnetic. If \(J < 0\) the the opposite behaviour is true, and neighbouring spins favour opposite orientations. The material is called antiferromagnetic. An external magnetic field introduces a bias in the orientation favoured by the spins.)
The probability of the spin configuration \(\sigma\) is given by the Boltzmann-Gibbs distribution from statistical mechanics: \[\mu(x) :=\frac{1}{Z} \, \mathrm e^{-\beta H(x)} \,, \qquad Z :=\sum_{x \in S} \mathrm e^{-\beta H(x)}\,,\] where \(\beta > 0\) is a fixed parameter that has the physical interpretation of the inverse temperature. (The physical intuition behind this definition is that configurations of low energy occur with higher probability than configurations of high energy. The stronger this imbalance between high and low energies is, the lower the temperature of the system. In fact, this parameter \(\beta\) can be regarded as one possible mathematically rigorous definition of the rather mysterious physical notion of temperature in statistical mechanics.)
One is typically interested in quantities such as the magnetisation at a given vertex \(v \in V,\) \[\tag{5.21} \int x_v \, \mu (\mathrm dx)\,,\] as well as the two-point correlation function at two vertices \(u,v,\) \[\tag{5.22} \int x_u \, x_v \, \mu (\mathrm dx)\,,\] The size of the set \(S\) is \(2^{\lvert V \rvert},\) which means that any direct numerical evaluation of such an integral is utterly hopeless for large \(V,\) no matter the computing power at one’s disposal. Hence, some kind of Monte Carlo is required for the numerical evaluation of (5.21) and (5.22). However, there is no known effective way of generating random samples \(X\) with law \(\mu.\)
Instead of trying to generating independent samples with law \(\mu,\) which is typically impossible, the remarkable idea is the following.
Let \(\mu\) be a probability measure on a finite set \(S.\) In Markov chain Monte Carlo, one constructs an irreducible Markov chain with stationary measure \(\mu.\)
If we have constructed such a chain, then we can use it to sample the measure \(\mu\) through (5.17), choosing a large enough \(n\) to obtain a good enough aproximation. (Indeed, by Proposition 5.42, the chain is positive recurrent and we can apply Corollary 5.46.)
How to construct such a chain? The original and most famous, still widely used, algorithm is the following.
Let \(\mu\) be a probability measure on a finite set \(S.\) Suppose that \(\mu(x) = C \rho(x)\) for some constant \(C > 0\) (which does not need to be determined) and a known function \(\rho > 0.\)
Choose the initial state \(X_0\) in some arbitrary fashion.
Choose a stochastic matrix \(G\) on \(S\) satisfying the following conditions: \(G(x,x) = 0\) for all \(x\); the associated Markov chain is irreducible; \(G(x,y) > 0\) if and only if \(G(y,x) > 0.\) This matrix is called the proposal function.
Construct the chain by defining inductively \(X_{n+1}\) as a function of \(x = X_n\) as follows.
Choose a candidate \(y\) randomly according to the law \(G(x, \cdot).\)
Calculate the acceptance ratio \[\tag{5.23} A(x,y) :=1 \wedge \frac{\rho(y) \, G(y,x)}{\rho(x) \, G(x,y)}\,.\]
Set \(X_{n+1} :=y\) with probability \(A(x,y)\) (acceptance of proposal) or \(X_{n+1} :=x\) with probability \(1 - A(x,y)\) (rejection of proposal).
In practice, the latter step is performed by drawing a random variable \(U\) with uniform law on \([0,1],\) and then setting \[X_{n+1} := \begin{cases} y & \text{if } U \leqslant A(x,y) \\ x & \text{otherwise}\,. \end{cases}\] Explicitly, the transition matrix \(Q\) of the Metropolis-Hastings chain is \[Q(x,y) = \begin{cases} G(x,y) A(x,y) & \text{if } x \neq y \\ 1 - \sum_{z \in S} G(x,z) A(x,z) & \text{if } x = y\,. \end{cases}\] For many applications, it is crucial that the Metropolis-Hastings algorithm only requires the knowledge of \(\rho(x)\) and not of the constant \(C\) (and hence of the actual measure \(\mu\)); the latter can be prohibitively difficult to determine, while \(\rho\) is often known and simple. See for instance Example 5.52 below.
The Metropolis-Hastings algorithm is an instance of MCMC: it defines an irreducible Markov chain with reversible measure \(\mu.\)
Proof. Let us first show that the chain is irreducible. By the irreducibility assumption on \(G,\) for any two \(x,y \in S\) there exists \(n\) such that \(G^n(x,y) > 0.\) Since \(A(x,y) > 0\) whenever \(G(x,y) > 0,\) by assumption on \(\rho\) and \(G,\) we hence conclude that we also have \(Q^n(x,y) > 0,\) and the chain is irreducible.
What remains is to verify that \(\mu\) is reversible. The detailed balance equations (5.9) for the matrix \(Q\) read \[\mu(x) Q(x,y) = \mu(y) Q(y,x)\] for all \(x \neq y,\) which are equivalently writtten as \[\rho(x) G(x,y) A(x,y) = \rho(y) G(y,x) A(y,x)\] for all \(x \neq y.\) This is equivalent to the condition \[\tag{5.24} \frac{A(x,y)}{A(y,x)} = \frac{\rho(y) G(y,x)}{\rho(x) G(x,y)}\] for all \(x,y\) satisfying \(G(x,y) > 0.\) For the choice (5.23), we always have \(A(x,y) = 1\) or \(A(y,x) = 1,\) and hence it satisfies (5.24).
It is common to choose the proposal function to be symmetric, \(G(x,y) = G(y,x),\) in which case the acceptance ratio (5.23) simplifies to \[\tag{5.25} A(x,y) = 1 \wedge \frac{\rho(y)}{\rho(x)}\,.\]
Let us apply the Metropolis-Hastings algorithm to the Ising model. In this case, we have \[\rho(x) = \mathrm e^{- \beta H(x)}\,.\] Note that computing the constant \(C = \frac{1}{Z}\) (and hence determining the measure \(\mu\)) is practially impossible (it is a sum over \(2^{\lvert V \rvert}\) terms), but the function \(\rho\) itself is simple and easy to compute numerically.
We have a lot of freedom in choosing the proposal function, and different choices lead to different versions of the algorithm. We shall consider the simplest choice: choose a vertex \(v\) uniformly at random and flip the corresponding spin. That is, we map \(x \mapsto x(v),\) where \(x(v)\) is the configuration of spins obtained from \(x\) by flipping the spin at \(v\): \(x(v)_u = (-1)^{\delta_{uv}} x_u.\) More formally, \[G(x,y) = \begin{cases} \frac{1}{\lvert V \rvert} & \text{if } y = x(v) \text{ for some $v \in V$} \\ 0 & \text{otherwise}\,. \end{cases}\] It is easy to check that \(G\) is irreducible and symmetric and satisfies \(G(x,x) = 0\) for all \(x.\) The acceptance ratio (5.25) is \[A(x,x(v)) = 1 \wedge \mathrm e^{\beta H(x) - \beta H(x(v))} = 1 \wedge \mathrm e^{-2\beta J \sum_{u \in V} \mathbf 1_{\{u,v\} \in E} x_u x_v - 2 \beta h_v x_v}\,.\] Note that the right-hand side is trivial to evaluate numerically, as it involves just a sum over over the neighbours of \(v.\)