0.3.2 Convergence rates and convergence guarantees
In this section we study the behavior of gradient descent theoretically. We start by investigating it in the simplest case possible, namely for a univariate (\(d=1\)) quadratic objective function.
(GD for univariate quadratic \(f\) with non-degenerate minimum). \(\,\)
Let \(f:\mathbb{R}\to\mathbb{R}\) be given by \[f(x)=\frac{a}{2}x^{2}+bx+c,\] for some \(a,b,c\in\mathbb{R}\) with \(a>0,\) and let \(x_{*}=-\frac{b}{a}\) denote the unique global minimizer of \(f(x).\) Consider the numerical minimization of this quadratic using gradient descent (Algorithm [algo:GD]) with step-size \(\eta>0\) starting at \(x_{0}\in\mathbb{R}\setminus\{x_{*}\},\) and let \(x_{n},n\ge0\) denote the iterates. Then:
a) \[x_{n}\to x_{*}\quad\text{iff}\quad\eta<\frac{2}{a}.\]
b) If \(\eta<\frac{2}{a}\) and \(\eta\ne\frac{1}{a}\) then \[|x_{n}-x_{*}|=|x_{0}-x_{*}|e^{-\kappa n}\] where \[\kappa=-\log|1-a\eta|>0.\]
c) If \(\eta>\frac{2}{a}\) then \[|x_{n}-x_{*}|=|x_{0}-x_{*}|e^{\gamma n},\] where \[\gamma=\log|1-a\eta|>0.\]
d) If \(\eta=\frac{1}{a}\) then \(x_{n}=x_{*}\) for all \(n\ge1.\)
e) If \(\eta\le\frac{1}{a}\) then \(x_{n}\to x_{*}\) monotonically.
f) If \(\eta>\frac{1}{a}\) then \({\rm sign}(x_{n}-x_{*})={\rm sign}(x_{0}-x_{*})(-1)^{n}.\)
3.4. Note that \(f''(x)=a\) corresponds to the curvature of the function (smaller \(a\) means less curvature, i.e. flatter function).
a) says that \[\text{GD converges to the global minimizer iff }\eta<\frac{2}{a},\] which means that for functions with stronger curvature one must take smaller steps to converge.
b) says that when \(\eta<\frac{2}{a}\) the convergence is exponentially fast, with rate \(\kappa\) governed by the curvature \(a\) and step-size \(\eta.\) Smaller curvature or smaller step-size leads to smaller \(\kappa\) (i.e slower convergence).
c) says that when \(\eta>\frac{2}{a},\) GD instead diverges exponentially fast away from \(x_{*}.\)
d) says that if \(\eta=\frac{1}{a}\) then in fact one step of GD takes you directly to \(x_{*}.\)
e) says that if \(\eta\le\frac{1}{a},\) then \(x_{n}\) converges to \(x_{*}\) monotonically (i.e. from one side).
f) says that if \(\eta>\frac{1}{a},\) then the sign of \(x_{n}-x_{*}\) alternates. That is, the \(x_{n}\) jump back and forth between each side of \(x_{*}.\) Thus if \(\eta\in(\frac{1}{a},\frac{2}{a})\) the sequence \(x_{n}\) converges to \(x_{*}\) non-monotonically.
Proof. \(\:\)
a) Since \(f'(x)=ax+b\) the iterates of GD are given by \[x_{n+1}=x_{n}-\eta(ax_{n}+b),\quad n\ge0\tag{3.7}\] (see lines 2 and 3 of Algorithm [algo:GD]). The unique local and global minimizer of \(f\) is \(x_{*}=-\frac{b}{2a}.\) Let \[y_{n}=x_{n}-x_{*}\] be the distance to this minimizer of the \(n\)-th iterate. Substituting \(y_{n}+x_{*}\) for \(x_{n}\) in (3.7) and using that \(ax_{*}+b=f'(x_{*})=0\) we obtain \[y_{n+1}=y_{n}-a\eta y_{n},\quad n\ge0.\] This can be rewritten as \(y_{n+1}=y_{n}(1-a\eta),\) from which it is obvious that \[y_{n}=(1-a\eta)^{n}y_{0}=\theta^{n}y_{0}\quad n\ge0,\tag{3.8}\] where \[\theta=1-a\eta.\]
We have \(\theta^{n}y_{0}\to0\) iff \(\theta\in(-1,1),\) and \(\theta\in(-1,1)\) iff \(\eta\in(0,2a^{-1}).\) Since \(y_{n}\to0\) iff \(x_{n}\to x_{*},\) this proves a).
b)We have \[|x_{n}-x_{*}|=|y_{n}|=|\theta|^{n}|y_{0}|=|\theta|^{n}|x_{0}-x_{*}|\quad n\ge0.\] The claim then follows since if \(\eta\ne a^{-1}\) then \(|\theta|^{n}=\exp(\log|\theta|n)=\exp(-\kappa n),\) and if in addition \(\eta<2a^{-1}\) then \(\log|\theta|=\log|1-2a\eta|>0.\)
c)Similar to b).
d)Exercise.
e)If \(\eta\le a^{-1}\) then \(\theta\in(0,1],\) and then \(\theta^{n}y_{0}\) is a monotone sequence, so \(x_{n}=\theta^{n}y_{0}+x_{*}\) is also monotone.
f)If \(\eta>a^{-1}\) then \(\theta<0,\) and so \[{\rm sign}(y_{n})={\rm sign}(\theta)^{n}{\rm sign}(y_{0})=(-1)^{n}{\rm sign}(y_{0}).\]
If \(f:\mathbb{R}\to\mathbb{R}\) is twice continuously differentiable, then in a small enough neighborhood of any non-degenerate local minimum \(x_{*}\) Taylor expansion gives the quadratic approximation \(f(x)\approx\frac{1}{2}a(x-x_{*})^{2}+c\) with \(a=f''(x_{*})>0,\) \(c=f(x_{*}).\) If gradient descent is started close enough to \(x_{*},\) then the iterates will converge to \(x_{*}\) at a speed approximately as in Lemma 3.3.
Next let us investigate the situation when the objective function has a degenerate global minimizer, which is a global minimizer \(x_{*}\) with \(f''(x_{*})=0.\) As a model we consider the functions \(f(x)=\frac{a}{k}|x|^{k}\) for \(k>2.\)
(GD for univariate \(f\) with degenerate minimum). \(\,\)
Let \(k>2,a>0,\) and \(f:\mathbb{R}\to\mathbb{R}\) be given by \[f(x)=\frac{a}{k}|x|^{k},\] and let \(x_{*}=0\) denote the unique global and local minimizer of \(f.\) Let \(x_{n}\) denote the iterates of gradient descent (Algorithm [algo:GD]) applied to this \(f,\) starting at some \(x_{0}\ne x_{*}.\)
a) If \(\eta a|x_{0}|^{k-2}\in(0,1),\) then \(x_{n}\to x_{*}.\)
b) If \(\eta a|x_{0}|^{k-2}\in(0,1),\) then \[|x_{n}-x_{0}|\sim(s\eta a)^{-1/s}n^{-1/s}\text{ as }n\to\infty,\tag{3.10}\] where \[s=k-2.\]
3.6. The notation \(\sim\) in (3.10) means that \(|x_{n}-x_{0}|\) is asymptotic to (s\eta a)^{-1/s}n^{-1/s}, i.e. that \[\lim_{n\to\infty}\frac{|x_{n}-x_{0}|}{(s\eta a)^{-1/s}n^{-1/s}}=1.\]
The take home message is that the convergence is at rate \(n^{-1/s},\) which is much slower than the exponential rate for the non-degenerate local minimizer in Lemma 3.3: the very low curvature around the degenerate minimum \(x_{*}\) of this \(f\) slows gradient descent down significantly.
Below in (3.12) we give the iteration of gradient descent for this \(f(x).\) The convergence can be investigated graphically by considering the graph of the functions \(g(x)=x-\eta ax^{k-1}{\rm sign}(x)\) and \(g(x)=x,\) see Figure 3.11.
Proof of Lemma 3.9 a). We have \(f'(x)=a|x|^{k-1}{\rm sign}(x).\) Thus the iterates of GD are given by \[x_{n+1}=x_{n}-\eta a|x_{n}|^{k-1}{\rm sign}(x_{n})=x_{n}(1-\eta a|x_{n}|^{s}),\quad n\ge0,\tag{3.12}\] where \(s=k-2.\) By assumption we have \[|x_{1}|=|x_{0}||1-\eta a|x|_{0}^{s}|\le|x_{0}|,\] and \[\eta a|x_{1}|^{2}=\eta a|x|_{0}^{s}(1-\eta a|x|_{0}^{s})^{s}\in(0,\eta a|x_{0}|^{s})\subset(0,1).\] Also if for some \(n\ge1\) \[|x_{n}|\le|x_{n-1}|\quad\text{and}\quad\eta a|x_{n}|^{s}\in(0,1),\] then \[|x_{n+1}|=|x_{n}|(1-\eta a|x_{n}|^{s})\le|x_{n}|,\] and \[\eta a|x_{n+1}|^{s}=\eta a|x_{n}|^{s}(1-\eta a|x_{n}|^{s})^{s}\le\eta a|x_{n}|^{s}\in(0,1).\] Thus by induction \[|x_{n}|\le|x_{n-1}|\quad\forall n\ge1.\] Since the sequence \(|x_{n}|\) is monotone decreasing and bounded blow, the limit \[x_{*}:=\lim_{n\to\infty}|x_{n}|\in[0,\infty)\] exists, and satisfies \[\eta a|x_{*}|^{s}\in(0,1).\] Assume for contradiction that \(x_{*}>0.\) Then \(|x_{n}|\ge x_{*}>0\) for all \(n,\) and \[|x_{n+1}|\le|x_{n}||1-\eta a|x_{n}|^{s}|\le x_{n}(1-\eta ax_{*}^{s}).\] But this implies that \[|x_{n}|\le|x_{0}|(1-\eta ax_{*}^{s})^{n}\quad\forall n\ge0,\] which implies that \(\lim_{n\to\infty}x_{n}=0,\) which is a contradiction. Thus indeed \(x_{*}=0.\)
Sketch of proof of Lemma 3.9 b). This is a heuristic argument for the rate \(n^{-1/s}\) of convergence, without rigorous proofs. Assume for simplicity that \(x_{0}>0.\) To investigate the speed of convergence we note that (3.12) implies that \[x_{n}=x_{0}(1-\eta ax_{0}^{s})(1-\eta ax_{1}^{s})\ldots(1-\eta ax_{n-1}^{s}),\quad\text{ for all }n\ge0,\] (where \(s=k-2\)).The product invites the application of a \(\log\) to both sides, yielding \[\log x_{n}=\log x_{k}+\sum_{m=k}^{n}\log(1-\eta ax_{m}^{s}),\quad\text{for all }n\ge k,\tag{3.13}\] provided \(1-\eta ax_{m}^{s}>-1\) for all \(m\ge k,\) which will certainly hold if \(x_{n}\) converges to \(x_{*}=0.\) In that case we can furthermore use the approximation \(\log(1-x)\approx-x\) for \(k\) large enough to heuristically obtain \[\log x_{n}\approx\log x_{k}-\eta a\sum_{m=k}^{n}x_{m}^{s}.\tag{3.14}\] We expect the convergence to be slower than in Lemma 3.3, where it was exponential, and therefore make the ansatz “\(x_{n}\approx n^{-\alpha}\)” for some \(\alpha>0.\) Under this ansatz we would have \[\log x_{n}\approx-\alpha\log n,\tag{3.15}\] and \[\eta a\sum_{m=k}^{n}x_{m}^{s}\approx\eta a\sum_{m=k}^{n}m^{-\alpha s}.\tag{3.16}\] Recall that \[\sum_{m=k}^{n}m^{-\beta}<\sum_{m=k}^{\infty}m^{-\beta}<\infty\quad\text{if}\quad\beta>1,\] and \[\sum_{m=k}^{n}m^{-\beta}\approx\int_{k}^{n}m^{-\beta}dm\propto n^{1-\beta}\quad\text{if}\quad\beta>1.\] Thus (3.14), (3.15) and (3.16) are inconsistent if \(\alpha s\ne1,\) since they would imply that \(-\alpha\log n\approx\log x_{k}+O(1)\) resp. \(-\alpha\log n\approx\log x_{k}+n^{1-\beta}.\) This indicates that the ansatz \(x_{n}\approx n^{-\alpha}\) is incorrect if \(\alpha\ne\frac{1}{s}.\) If \(\alpha=\frac{1}{s}\) on the other hand, the ansatz amounts to \(x_{m}^{s}\approx m^{-1},\) and since \(\sum_{m=k}^{n}m^{-1}\approx\log n\) this means that (3.14), (3.15) and (3.16) are more consistent in that all terms depending on \(n\) are of order \(\log n.\) This suggests that \(x_{n}\) is proportional to \(n^{-1/s},\) and the iterates of GD converge at rate \(n^{-1/s}.\) The constant of proportionality can be guessed by making the ansatz \(x_{n}\approx hn^{-1/s}\) for \(h>0,\) under which (3.14), (3.15) and (3.16) combine to \[-\frac{1}{s}\log n+\log h\approx\log x_{k}-\eta ah^{s}\log n.\] This is consistent if \(\eta ah^{s}=\frac{1}{s},\) so we expect that if convergent, the iterates of GD converge as \[|x_{n}-x_{*}|\approx|x_{0}-x_{*}|\frac{1}{(s\eta a)^{1/s}}n^{-1/s}.\]
Let us now investigate the case \(d\ge2.\) The next lemma deals with quadratic functions and generalizes Lemma 3.3 to higher dimension.
Note that any quadratic in \(d\) variables can be written as \(f(x)=\frac{1}{2}x^{T}Ax+b^{T}x+c\) for a symmetric matrix \(A\in\mathbb{R}^{d\times d},\) vector \(b\in\mathbb{R}^{d}\) and \(c\in\mathbb{R}.\) For such a matrix there always exists an orthonormal basis in which it is diagonal. If \(A\) has a negative eigenvalue the infimum of this \(f\) is \(-\infty.\) If \(A\) is positive definite (i.e. all eigenvalues positive), then the infimum is finite and is achieved at the unique local and global minimizer \(x_{*}=-A^{-1}b.\) If \(\lambda\) is an eigenvalue of \(A,\) the eigenspace of \(\lambda\) is the subspace of \(\mathbb{R}^{d}\) spanned by all eigenvectors with that eigenvalue (it can have any dimension between \(1\) and \(d\)).
(GD for higher-dimensional quadratic \(f\) with non-degenerate minimum). \(\:\)
Let \(d\ge1\) and \(f:\mathbb{R}^{d}\to\mathbb{R}\) be given by \[f(x)=\frac{1}{2}x^{T}Ax+b^{T}x+c,\] where \(A\in\mathbb{R}^{d\times d}\) is positive definite and symmetric, and \(b\in\mathbb{R}^{d},c\in\mathbb{R}.\) Let \[0<\lambda_{\min}=\lambda_{1}\le\ldots\le\lambda_{d}=\lambda_{\max}\] denote the eigenvalues of \(A,\) and let \[x_{*}=-A^{-1}b\] denote the unique global and local minimizer of \(f.\) Let \(x_{n},n\ge0,\) denote the iterates of gradient descent with step-size \(\eta\) and starting at some \(x_{0}\in\mathbb{R}^{d}\setminus\{x_{*}\}.\)
a) If \(\eta<\frac{2}{\lambda_{\max}}\) then \[|x_{n}-x_{*}|\le e^{-\kappa n}|x_{0}-x_{*}|\quad\text{for all}\quad n\ge0,\tag{3.18}\] where \[\kappa=-\log\theta>0\quad\text{for}\quad\theta:=\max(\eta\lambda_{\max}-1,1-\eta\lambda_{\min}),\] (if \(\theta=1\) then \(\eta\lambda_{i}=1\) for all \(i,\) and \(|x_{n}-x_{*}|=0\) \(\forall n\ge1\)).
b) If \(\eta\lambda_{\max}>2\) then \[|x_{n}-x_{*}|\ge e^{\gamma n}|\tilde{x}_{0}-x_{*}|,\tag{3.19}\] where \[\gamma=\log|1-\eta\lambda_{\max}|>0\] and where \(\tilde{x}_{0}\) denotes the projection of \(x_{0}\) onto the eigenspace of \(A\) corresponding to the eigenvalue \(\lambda_{\max}.\)
3.8. a) says that convergence is exponentially fast if \(\eta<\frac{2}{\lambda_{\max}}.\) The curvature of \(A\) is encoded by the eigenvalues \(\lambda_{1},\ldots,\lambda_{d},\) so as in Lemma 3.3 higher curvature means a smaller step-size is required for convergence. Also the rate of convergence depends on the curvature - for \(\eta\) small it depends however on \(\lambda_{\min}\) rather than \(\lambda_{\max}.\) So stability/convergence of the iteration is determined by the curvature in the most curved direction (\(\lambda_{\max}\)), while convergence speed depends on the curvature in the flattest direction \((\lambda_{\min}).\) Also, if the matrix \(A\) has a large condition number (ratio \(\lambda_{\max}/\lambda_{\min}\)), then convergence will be slow.
Proof. Since \(\nabla f(x)=Ax+b\) the iterates of gradient descent are given by \[x_{n+1}=x_{n}-\eta(Ax_{n}+b),n\ge0.\tag{3.20}\] Let \(y_{n}=x_{n}-x_{*}\) be the difference between iterate and minimizer. Substituting \(y_{n}+x_{*}\) for \(x_{n}\) in (3.20) and using that \(Ax_{*}+b=\nabla f(x_{*})=0\) we obtain \[y_{n+1}=y_{n}-\eta Ay_{n},\quad n\ge0.\] Let \(e_{1},\ldots,e_{d}\) be an orthonormal eigenbasis with corresponding eigenvalues \(\lambda_{\min}=\lambda_{1}\le\lambda_{2}\le\ldots\le\lambda_{d}=\lambda_{\max}.\) Letting \(z_{n}\) denote \(y_{n}\) written in this basis (i.e. \(z_{n,i}=y_{n}\cdot e_{i}\)), this turns into \[z_{n+1,i}=z_{n,i}-\eta\lambda_{i}z_{n,i},\quad i=1,\ldots,d,n\ge0.\] This puts us back in the setting of Lemma 3.3. As in the proof of that lemma, we obtain that \[z_{n,i}=(1-\eta\lambda_{i})^{n}z_{0,i},\quad i=1,\ldots,d,n\ge0.\] a) The sequence \(z_{n,i}\) converges to zero iff \(|1-\eta\lambda_{i}|<1.\) We have \[1-\eta\lambda_{\max}\le1-\eta\lambda_{i}\le1-\eta\lambda_{1},\] and thus \[|1-\eta\lambda_{i}|\le\max(1-\eta\lambda_{1},\eta\lambda_{\max}-1)=\theta.\] Thus all the sequences \(z_{n,i}\) converge to zero if \(\theta<1.\) If \(\eta<\frac{2}{\lambda_{\max}},\) then \(|1-\lambda_{\max}|<1,\) so indeed \(\theta<1,\) and \(z_{n,i}\to0\) for all \(i=1,\ldots,d.\) Furthermore \[\begin{array}{ccccccl} |x_{n}-x_{*}|^{2} & = & |y_{n}|^{2} & = & |z_{n}|^{2} & = & \sum_{i=1}^{d}|1-\eta\lambda_{i}|^{2n}z_{0,i}^{2}\\ & & & & & \le & \sum_{i=1}^{d}\theta^{2n}z_{0,i}^{2}\\ & & & & & = & \theta^{2n}|z_{0}|^{2}\\ & & & & & = & e^{-2\kappa n}|z_{0}|^{2}\\ & & & & & = & e^{-2\kappa n}|x_{0}-x_{*}|^{2}. \end{array}\] Taking the square root of both sides we obtain (3.18).
b) If \(\eta\lambda_{\max}>2,\) then \(\theta=|1-\eta\lambda_{\max}|>1,\) so \[\begin{array}{cccclcl} |x_{n}-x_{*}|^{2} & = & |y_{n}|^{2} & = & |z_{n}|^{2} & = & \sum_{i=1}^{d}|1-\eta\lambda_{i}|^{2n}z_{0,i}^{2}\\ & & & & & \ge & \sum_{i:\lambda_{i}=\lambda_{\max}}|1-\eta\lambda_{\max}|^{2n}z_{0,i}^{2}\\ & & & & & = & \theta^{2n}\sum_{i:\lambda_{i}=\lambda_{\max}}z_{0,i}^{2}\\ & & & & & = & e^{2\kappa n}|\tilde{x}_{0}-x_{*}|^{2}. \end{array}\] Taking the square root of both sides we obtain (3.19).
This result extends to a more general class of \(f,\) namely those that have Lipschitz continuous gradient and are
. The meaning of strongly convex is as follows.
\(\:\)
A differentiable function \({f:\mathbb{R}^{d}\to\mathbb{R}},\) \(d\ge1,\) is \(\mu\)-strongly convex if \(\mu>0\) and \[f(x+p)\ge f(x)+p^{T}\nabla f(x)+\frac{\mu}{2}|p|^{2}\quad\text{for all }\quad x,p\in\mathbb{R}^{d}.\]
A function like \(f(x)=x^{4}\) is strictly convex but not strongly convex. However every strongly convex function is strictly convex. For a twice differentiable function \(\mu\)-strong convexity is equivalent to \(\nabla^{2}f(x)\ge\mu I\) for all \(x\in\mathbb{R}^{d}.\)
We will use the following lemmas for functions with \(L\)-Lipschitz gradient.
If \(d\ge1\) and \(f:\mathbb{R}^{d}\to\mathbb{R}\) is differentiable with \(L\)-Lipschitz continuous gradient, then \[f(x+p)\le f(x)+p^{T}\nabla f(x)+\frac{L}{2}|p|^{2}\quad\text{for all }\quad x,p\in\mathbb{R}^{d}.\]
Proof. Since \(f\) is differentiable \[\begin{array}{ccl} f(x+p) & = & \int_{0}^{1}\partial_{t}\{f(x+tp)\}dt\\ & = & \int_{0}^{1}\nabla f(x+tp)\cdot pdt\\ & = & \nabla f(x)\cdot p+p\cdot\int_{0}^{1}(\nabla f(x+tp)-\nabla f(x))dt. \end{array}\] Since \(\nabla f\) is \(L\)-Lipschitz. \[|\nabla f(x+tp)-\nabla f(x)|\le L|tp|\le Lt|p|.\] Thus \[\left|p\cdot\int_{0}^{1}(\nabla f(x+tp)-\nabla f(x))dt\right|\le L|p|^{2}\int_{0}^{1}tdt=\frac{L}{2}|p|^{2}.\]
If \(d\ge1\) and \(f:\mathbb{R}^{d}\to\mathbb{R}\) is differentiable with \(L\)-Lipschitz continuous gradient, and has a global minimizer \(x_{*},\) then \[f(x)-f(x_{*})\ge\frac{1}{2L}|\nabla f(x)|^{2}\quad\text{for all }\quad x\in\mathbb{R}^{d}.\]
Proof. Let \(x_{0}=x\) and \(x_{1}=x-\eta\nabla f(x),\) for arbitrary \(\eta>0.\) By Lemma 3.21 \[\begin{array}{ccl} f(x_{1}) & \le & f(x_{0})-\eta\nabla f(x)^{T}\nabla f(x)+\frac{L}{2}\eta^{2}|\nabla f(x)|^{2}\\ & = & f(x_{0})+\left(\frac{L}{2}\eta^{2}-\eta\right)|\nabla f(x)|^{2}. \end{array}\] Optimizing over \(\eta\) leads to the choice \(\eta=\frac{1}{L},\) giving \[f(x_{1})\le f(x_{0})-\frac{1}{2L}|\nabla f(x)|^{2},\tag{3.23}\] or equivalently \[f(x_{0})-f(x_{1})\ge\frac{1}{2L}|\nabla f(x)|^{2}.\] Since \(x_{*}\) is a global minimizer \(f(x_{0})-f(x_{*})\ge f(x_{0})-f(x_{1}),\) so the claim follows.
Assume \(d\ge1,\) and let \(f:\mathbb{R}^{d}\to\mathbb{R}\) be \(\mu\)-strongly convex and differentiable with \(L\)-Lipschitz continuous gradient. Let \(x_{*}\) denote the global minimum of \(f.\) Let \(\eta>0,\) \(x_{0}\in\mathbb{R}^{d},\) and let \(x_{n},n\ge0,\) denote the iterates of gradient descent with step size \(\eta\) and starting at \(x_{0},\) i.e. \(x_{n+1}=x_{n}-\eta\nabla f(x_{n}),n\ge0.\) If \(\eta\le\min(\frac{1}{L},\frac{1}{\mu})\) then \[|x_{n}-x_{*}|\le|1-\eta\mu|^{\frac{n}{2}}|x_{0}-x_{*}|\quad\text{for all}\quad n\ge0,\] i.e. \(x_{n}\) converges exponentially fast to \(x_{*}.\)
Proof. We prove that \[|x_{n+1}-x_{*}|^{2}\le(1-\eta u)^{2}|x_{n}-x_{*}|^{2}\quad\text{ for all }\quad n\ge0.\tag{3.25}\] The claim then follows by induction. To prove (3.25), note that \[\begin{array}{ccl} |x_{n+1}-x_{*}|^{2} & = & |x_{n}-\eta\nabla f(x_{n})-x_{*}|^{2}\\ & = & |x_{n}-x_{*}|^{2}-2\eta\nabla f(x_{n})^{T}(x_{n}-x_{*})+\eta^{2}|\nabla f(x_{n})|^{2}. \end{array}\tag{3.26}\] Consider \(\nabla f(x_{n})^{T}(x_{n}-x_{*}).\) By the strong convexity of \(f\) \[f(x_{n})\ge f(x_{*})-(x_{n}-x_{*})^{T}\nabla f(x_{n})+\frac{\mu}{2}|x_{n}-x_{*}|^{2},\] which can be rearranged to \[f(x_{*})-f(x_{n})-\frac{\mu}{2}|x_{n}-x_{*}|^{2}\ge-(x_{n}-x_{*})^{T}\nabla f(x_{n}).\] Thus (3.26) is at most \[\begin{array}{l} |x_{n}-x_{*}|^{2}+2\eta(f(x_{*})-f(x_{n}))-\mu\eta|x_{n}-x_{*}|^{2}+\eta^{2}|\nabla f(x_{n})|^{2}\\ =(1-\mu\eta)|x_{n}-x_{*}|^{2}+2\eta(f(x_{*})-f(x_{n}))+\eta^{2}|\nabla f(x_{n})|^{2}. \end{array}\] Using Lemma 3.22 to bound the final term we get that the last line is at most \[\begin{array}{l} (1-\mu\eta)|x_{n}-x_{*}|^{2}+2\eta(f(x_{*})-f(x_{n}))+2L\eta^{2}(f(x_{n})-f(x_{*}))\\ =(1-\mu\eta)|x_{n}-x_{*}|^{2}-2\eta(1-\eta L)(f(x_{n})-f(x_{*})). \end{array}\] Since the last term is positive for \(\eta\le\frac{1}{L},\) this proves (3.25).
Without convexity, we can not provide any guarantee of convergence to a global minimizer, nor of fast convergence (recall Figure 3.4). Proposition 3.32 in the appendix is an optional, non-examinable weaker result, which does not assume convexity but gives conditions under which at least \(|\nabla f(x_{n})|\) becomes small as \(n\to\infty.\)
0.3.3 Newton-Raphson method
The Newton-Raphson method is a variant of gradient descent that adapts the step-size to the flatness of the function. It uses the iteration \[x_{n+1}=x_{n}-\eta(\nabla^{2}f(x_{n}))^{-1}\nabla f(x_{n}),\quad n\ge0,\tag{3.28}\] see Algorithm 3.27.
Indeed, an important drawback of gradient descent is its sensitivity to step-size, both in terms of stability and convergence speed. There are many optimization methods that can be thought of as extensions of gradient descent with a varying, adaptive step-size. The first such method we consider is the Newton-Raphson method. It adapts the step-size based on the second derivatives of the objective function at the current point.
More broadly, the Newton-Raphson method is an algorithm to numerically solve the system of equations \[g(x)=0\quad\text{for a differentiable}\quad g:\mathbb{R}^{d}\to\mathbb{R}^{d}.\] Given a starting point \(x_{0}\in\mathbb{R}^{d}\) it uses the iteration \[x_{n+1}=x_{n}-(Dg(x_{n}))^{-1}g(x_{n}),\quad n\ge0,\] where \(Dg(x)\) denotes the Jacobian of \(g\) at \(x,\) i.e. the \(d\times d\) matrix \((\partial_{j}g_{i}(x))_{i,j=1,\ldots,d}.\) Under favorable circumstances this iteration converges to a \(x_{*}\) s.t. \(g(x_{*})=0.\) It can be used for optimization of a twice differentiable objective function \(f:\mathbb{R}^{d}\to\mathbb{R}\) by replacing \(g\) with \(\nabla f,\) so that the method approximates solutions of the critical point equation \(\nabla f(x)=0.\) Then the iteration reads \[x_{n+1}=x_{n}-(\nabla^{2}f(x_{n}))^{-1}\nabla f(x_{n}),\quad n\ge0,\tag{3.29}\] where \(\nabla^{2}f(x)\) is the Hessian of \(f\) at \(x\) (i.e. the symmetric matrix \((\partial_{ij}f(x))_{i,j=1,\ldots,d}).\) One would generally use it for objective functions such that \(\nabla^{2}f>0,\) at least in a region of the local minima. If \(\nabla^{2}f(x)\) is degenerate for some \(x,\) then in the usually unlikely event that \(x_{n}\) equals one of these \(x\) one can take say a standard GD step for that \(n.\) Newton-Raphson is motivated by quadratic approximation of \(f\) in a neighborhood of \(x_{n}\): Taylor expansion gives \[f(x)\approx f(x_{n})+b\cdot(x-x_{n})+\frac{1}{2}(x-x_{n})^{T}A(x-x_{n})\] for \(x\) close to \(x_{n},\) where \[b=\nabla f(x_{n})\quad\text{and}\quad A=\nabla^{2}f(x_{n}).\] If \(A>0\) the quadratic \(x\to f(x_{n})+b\cdot(x-x_{n})+\frac{1}{2}(x-x_{n})^{T}A(x-x_{n})\) has a unique global minimizer at \(x=x_{n}-A^{-1}b.\) Note that this is precisely the r.h.s. of (3.29). The Newton-Raphson iteration thus moves to the minimum of the best quadratic approximation of the function at the current point.
The method (3.29) can diverge for some starting points \(x_{0},\) even if \(f\) is a very nice function, say strongly convex. To increase the stability of the method one introduces a step-size \(\eta>0\) akin to that of GD, arriving at the damped Newton-Raphson method (3.28). For a univariate function \(f:\mathbb{R}\to\mathbb{R}\) the damped Newton-Raphson reduces to \[x_{n+1}=x_{n}-\eta\frac{f'(x_{n})}{f''(x_{n})},\quad n\ge0.\tag{3.30}\] Note that each step of (3.30) can be thought of as a step of gradient descent with effective step-size \(\eta_{{\rm eff}}=\frac{\eta}{f''(x_{n})}.\) This effective step-size adapts to the flatness of the function, as measured by the second derivative: a flatter function has smaller second derivative, so the the step-size is increased.
If one writes the higher-dimensional rule (3.28) in a basis \((e_{1},\ldots,e_{d})\) where \(A=\nabla^{2}f(x_{n})\) is a diagonal matrix with diagonal entries given by its eigenvalues \(\lambda_{i}(x_{n})=\partial_{e_{i}}^{2}f''(x_{n})\) (assumed to be positive), it turns into \[x_{n+1,i}=x_{n,i}-\eta\frac{\partial_{e_{i}}f(x_{n})}{\partial_{e_{i}}^{2}f(x_{n})}=x_{n,i}-\eta\frac{\partial_{e_{i}}f(x_{n})}{\lambda_{i}(x_{n})},\quad i=1,\dots,d,n\ge0.\tag{3.31}\] The rule (3.31) can be thought of as GD in each direction \(e_{i},\) with effective step-size \(\eta_{{\rm eff},i}=\frac{\eta}{\partial_{e_{i}}^{2}f(x_{n})},\) which adapts to the flatness of the function in each direction.
numpy.linalg.solve in the Python library NumPy, as linsolve in Matlab/Octave and LinearSolve in Mathematica/Mathics
Under suitable conditions the Newton-Raphson converges faster than gradient descent. The price paid for this faster convergence is that one needs to compute the Hessian and solve the the corresponding system of linear equations.
\(\,\)
Let \(f\) given by \[f(x)=\frac{a}{2}x^{2}+bx+c,\] for some \(a>0,b,c\in\mathbb{R}.\) Denote the unique global minimizer of \(f\) by \(x_{*}=-\frac{b}{a}.\) Denote the iterates of the damped Newton-Raphson method with step-size \(\eta\) starting at some \(x_{0}\ne x_{*}\) by \(x_{n},n\ge0.\)
a) If \(\eta\in(0,1)\cup(1,2),\) then \[|x_{n}-x_{*}|=|x_{0}-x_{*}|e^{-\kappa n}\qquad\forall n\ge0,\] where \[\kappa=-\log|1-\eta|>0.\]
b) If \(\eta=1,\) then \(|x_{n}-x_{*}|=0\) for all \(n\ge1.\)
3.14. a) says that if \(\eta\in(0,1)\cup(1,2)\) then the iterates converge to \(x_{*}\) exponentially fast. Note that the rate of convergence depends only on \(\eta,\) not on the curvature \(f''(x)=a\) as for GD in Lemma 3.3 b). This reflects the fact that the Newton-Raphson automatically adapts the step-size based on the curvature.
Proof. We have \(f'(x)=ax+b\) and \(f''(x)=a,\) so \[\frac{f'(x)}{f''(x)}=x+\frac{b}{a}=x-x_{*}\] Thus the iterates satisfy \[x_{n+1}=x_{n}-\eta(x-x_{*}),\quad n\ge0.\] This can be written as \[x_{n+1}-x_{*}=(x_{n}-x_{*})(1-\eta),\quad n\ge0.\] Thus \[|x_{n+1}-x_{*}|=|x_{0}-x_{*}||1-\eta|^{n},\quad n\ge0.\] For \(\eta\in(0,1)\cup(1,2)\) we have \[\kappa=-\log|1-\eta|>0\] and \[|1-\eta|^{n}=e^{-\kappa n},\] giving the claim.
\(\:\)
For \(d\ge1\) and \(f:\mathbb{R}^{d}\to\mathbb{R}\) be given by the general quadratic \[f(x)=\frac{1}{2}x^{T}Ax+b^{T}x+c,\] where \(A\in\mathbb{R}^{d\times d}\) is symmetric and positive definite, and \(b\in\mathbb{R}^{d},c\in\mathbb{R}.\) Denote the unique local and global minimizer of \(f\) by \(x_{*}=-A^{-1}b.\)
a) If \(\eta\in(0,1)\cup(1,2)\) then \[|x_{n}-x_{*}|\le|x_{0}-x_{*}|e^{-\kappa n},\] where \[\kappa=-\log|1-\eta|>0.\] b) If \(\eta=1\) then \(|x_{n}-x_{*}|=0\) for all \(n\ge1.\)
3.16. a) says that if \(\eta\in(0,1)\cup(1,2)\) then the iterates converge exponentially fast, at a rate which depends only on the step-size and not on the curvature (i.e. not on the eigenvalues of \(A,\) as opposed to in Lemma 3.17) This reflects the per-dimension adaptive effective step-size discussed above (see (3.31)). The take-away from this calculation is that for objective functions with ill-conditioned Hessian, the Newton-Raphson method can accommodate a much larger step-size, and will converge much faster than gradient descent.
Proof of Lemma 3.32. We have \(\nabla f(x)=Ax+b\) and \(\nabla^{2}f(x)=A.\) The iterates of damped Newton-Raphson are \[x_{n+1}=x_{n}-\eta A^{-1}(Ax_{n}+b)=x_{n}(1-\eta)+\eta x_{*},\quad n\ge0.\] This can be rewritten as \[x_{n+1}-x_{*}=(x_{n}-x_{*})(1-\eta),\quad n\ge0,\] from which it follows that for \(\eta\in(0,1)\cup(1,2)\) \[|x_{n}-x_{*}|\le|x_{0}-x_{*}||1-\eta|^{n}=|x_{0}-x_{*}|e^{-\kappa n},\quad n\ge0.\]
For \(a>0,k>2\) let \(f:\mathbb{R}\to\mathbb{R}\) be given by \[f(x)=\frac{a}{k}|x|^{k},\] with the unique local and global minimizer \(x_{*}=0.\) Let \(x_{n},n\ge0\) denote the iterates of damped Newton-Raphson starting with step-size \(\eta>0,\) starting from some \(x_{0}\ne x_{*}.\)
a) If \(\eta\in(0,1)\cup(1,2),\) then \[|x_{n}-x_{*}|=e^{-\kappa n}|x_{0}-x_{*}|\] where \[\kappa=-\log|1-\eta|>0.\] b) If \(\eta=1\) then \(|x_{n}-x_{*}|=0\) \(\forall n\ge0.\)
3.18. a) says that damped Newton-Raphson converges exponentially fast for \(\eta\in(0,1)\cup(1,2),\) as opposed to gradient descent which converges at the much slower rate \(\sim n^{-\frac{1}{k-2}}\) (recall Lemma 3.9).
Proof. We have \(f'(x)=a|x|^{k-1}{\rm sign}(x)\) and \(f''(x)=a|x|^{k-2},\) so the iterates of Newton-Raphson read \[x_{n+1}=x_{n}-\eta\frac{a|x_{n}|^{k-1}{\rm sign}(x_{n})}{a|x_{n}|^{k-2}}=x_{n}-\eta x_{n}=(1-\eta)x_{n}.\] Thus \[|x_{n}-x_{*}|=|1-\eta|^{n}|x_{0}-x_{*}|=e^{-\kappa n}|x_{0}-x_{*}|.\]
The big drawback of the Newton-Raphson method is that it requires the computation of the \(d\times d\) Hessian \(\nabla^{2}f(x_{n}),\) and the solution of the linear equation of the form \(Ax=b.\) This does not scale well to large dimension \(d.\) For instance if \(d\) is in the millions, the Hessian is a matrix with millions of rows and columns, and \(\ge10^{12}\) entries. This is too large to feasibly compute (unless the Hessian has special structure that can be exploited to efficiently solve the equation \(Ax=b\)). In the following sections we present variants of GD that avoid using the Hessian, and make sure to adaptively set the step-size using only “first order” information (i.e. information about the gradient, rather than the Hessian).
0.3.4 Gradient descent with momentum
Set \(x_{-1} \leftarrow x_0.\)
Gradient descent with momentum is a variant of the algorithm which aims to speed up optimization progress by the addition of a momentum term to the update rule. The update rule turns into \[x_{n+1}=x_{n}-\eta\nabla f(x_{n})+\beta(x_{n}-x_{n-1}),\quad n\ge0,\tag{3.35}\] where \(\beta\in[0,1)\) is a parameter governing the strength of the momentum effect, see Algorithm 3.34. We use the convention \(x_{-1}=x_{0},\) so that the first step is just standard GD. A common choice for the parameter is \(\beta=0.9.\) The following plot illustrates the smoothing and accelerating properties of momentum.
\(\:\)
Let \(u,v\in\mathbb{R}\) and let \(z_{n},n\ge0,\) be a real sequence such that \[z_{n+1}=uz_{n}+vz_{n-1}\quad\forall n\ge0,\tag{3.36}\] and let \(r_{1},r_{2}\) be the (possibly complex) roots of the quadratic \[r^{2}=ur+v.\tag{3.37}\]
a) If \(r_{2}\ne r_{1},\) then \[z_{n}=\frac{r_{2}z_{0}-z_{1}}{r_{2}-r_{1}}r_{1}^{n}+\frac{-r_{1}z_{0}+z_{1}}{r_{2}-r_{1}}r_{2}^{n},\quad\forall n\ge0.\tag{3.38}\]
b) If \(r_{1}=r_{2}=:r\) then \[z_{n}=z_{0}r^{n}+\left(\frac{z_{1}}{r}-z_{0}\right)nr^{n}\quad\forall n\ge0\tag{3.39}\]
Proof. The equation (3.36) is a linear difference equation, as encountered in the course M12 about ODEs. To find the general family of solutions of such an equation, we consider its characteristic equation (3.37). If its roots \(r_{1},r_{2}\) satisfy \(r_{1}\ne r_{2},\) then the sequence \(z_{n}\) must satisfy \[z_{n}=ar_{1}^{n}+br_{2}^{n},\quad n\ge0\tag{3.40}\] for some \(a,b\in\mathbb{C}.\) Even without assuming any results from M12, this can easily be proved using induction. From the cases \(n=0\) and \(n=1\) of (3.40) we derive the equations \[\begin{cases} a+b=z_{0},\\ ar_{1}+br_{2}=z_{1}, \end{cases}\] whose unique solution is \[a=\frac{r_{2}z_{0}-z_{1}}{r_{2}-r_{1}},\qquad b=\frac{-r_{1}z_{0}+z_{1}}{r_{2}-r_{1}}.\] This proves (3.40).
Similarly, if \(r_{1}=r_{2}\) then the sequence \(z_{n}\) must satisfy \[z_{n}=ar+bnr^{n},\quad n\ge0.\tag{3.41}\] for some \(a,b\in\mathbb{C},\) which similarly implies that (3.39) must hold.
Let us investigate the iterates of GD with momentum for a univariate quadratic \(f.\)
\(\:\)
Let \(f:\mathbb{R}\to\mathbb{R}\) be the quadratic given by \[f(x)=\frac{a}{2}x^{2}+bx+c,\] for some \(a>0,\) \(b,c\in\mathbb{R}.\) Let \(x_{*}=-\frac{b}{a}\) denote the unique local and global minimizer of \(f,\) and let \(x_{n},n\ge0,\) denote the iterates of gradient descent with momentum, starting at some \(x_{0}\in\mathbb{R}\setminus\{x_{*}\},\) with step-size \(\eta>0\) and momentum parameter \(\beta\in(0,1).\)
a) If \(\eta\le a^{-1}(1+\sqrt{\beta})^{2},\) then there are constants \(c,\kappa>0\) (which depend on \(\eta a\) and \(\beta\)) s.t. \[|x_{n}-x_{*}|\le ce^{-\kappa n}|x_{0}-x_{*}|,\qquad\forall n\ge0.\tag{3.42}\]
b) If \(\eta<a^{-1}(1-\sqrt{\beta})^{2},\) then \[\kappa>-\log|1-\eta a|.\tag{3.43}\]
3.21. a) says that GD with momentum converges exponentially fast for step-sizes up to \(a^{-1}(1+\sqrt{\beta})^{2},\) which if \(\beta>(\sqrt{2}-1)^{2}\approx0.17\) includes step-sizes for which GD diverges (cf. Lemma 3.3 a,c)). This shows that momentum can have a stabilizing effect, by damping large oscillations.
b) says that if \(\eta<a^{-1}(1-\sqrt{\beta})^{2}\) then GD with momentum converges exponentially fast at a rate faster than that of GD without momentum (cf. Lemma 3.3 b)).
c) The conditions on \(\eta\) in a) and b) are not sharp: there are some larger \(\eta\) for which the conclusions also hold. The complete picture can be investigate graphically, see Figure 3.46.
Proof. We have \(f'(x)=ax+b,\) so the iterates satisfy \[x_{n+1}=x_{n}-\eta(ax_{n}+b)+\beta(x_{n}-x_{n-1}),\quad n\ge0.\] Letting \(y_{n}=x_{n}-x_{*}\) this turns into \[y_{n+1}=(1-a\eta)y_{n}+\beta(y_{n}-y_{n-1}),\quad n\ge0,\] where \(y_{0}=y_{-1},\) which in turn can be rewritten as \[y_{n+1}=\theta y_{n}-\beta y_{n-1},\quad n\ge0,\] for \[\theta:=1-a\eta+\beta.\] Let \[r_{1}=\frac{1}{2}(\theta-\sqrt{\theta^{2}-4\beta}),\quad r_{2}=\frac{1}{2}(\theta+\sqrt{\theta^{2}-4\beta}),\tag{3.44}\] denote the roots of the quadratic \(r^{2}=\theta r-\beta.\)
We have \[r_{1}=r_{2}\iff\theta^{2}=4\beta\iff\eta a=(1-\sqrt{\beta})^{2}.\] Thus if \(\eta a\ne(1-\sqrt{\beta})^{2},\) then using Lemma 3.19 a) for \(z_{n}=y_{n+1},n\ge0,\) (so that \(z_{0}=z_{1}=y_{0}\)) proves that \[y_{n}=y_{0}\left(\frac{1-r_{1}}{r_{2}-r_{1}}r_{2}^{n+1}-\frac{1-r_{2}}{r_{2}-r_{1}}r_{1}^{n+1}\right),\quad\forall n\ge0.\tag{3.45}\] From (3.45) we obtain \[|y_{n}|\le|y_{0}|\left(\frac{|1-r_{1}|}{|r_{2}-r_{1}|}|r_{2}|^{n+1}+\frac{|1-r_{2}|}{|r_{2}-r_{1}|}|r_{1}|^{n+1}\right)\le c\max(|r_{1}|,|r_{2}|)^{n}|y_{0}|,\] for the constant \(c=\frac{|1-r_{1}|+|1-r_{2}|}{|r_{2}-r_{1}|}\max(|r_{1}|,|r_{2}|).\) Thus (3.42) holds for \[\kappa=-\log\max(|r_{1}|,|r_{2}|).\] To show that \(\kappa>-\log|1-\eta a|>0\) for \(\eta a<(1-\sqrt{\beta})^{2},\) note that for such \(\eta a\) the roots \(r_{1},r_{2}\) are both real and \(0<r_{1}<r_{2},\) so \[\max(|r_{1}|,|r_{2}|)=|r_{2}|.\] Also \[|r_{2}|<\frac{1}{2}(\theta+\sqrt{\theta^{2}-4(1-\eta a)\beta})=\frac{1}{2}(1-\eta a+\beta+\sqrt{(1-\eta a-\beta)^{2}}).\] Furthermore \[1-\eta a-\beta\ge1-(1-\sqrt{\beta})^{2}-\beta=2\sqrt{\beta}-2\beta=2\sqrt{\beta}(1-\sqrt{\beta})>0,\] so \[\frac{1}{2}(1-\eta a+\beta+\sqrt{(1-\eta a-\beta)^{2}})=\frac{1}{2}(1-\eta a+\beta+1-\eta a-\beta)=1-\eta a.\] Thus \[|r_{2}|<|1-\eta a|<1.\] This proves (3.43) and that \(\kappa>0\) for \(\eta a\in(0,(1-\sqrt{\beta})^{2}).\)
If \(\eta a\in((1-\sqrt{\beta})^{2},(1+\sqrt{\beta})^{2}),\) then \(r_{1}\) and \(r_{2}\) are both complex and in fact \[r_{1}=\frac{1}{2}(\theta-i\sqrt{4\beta-\theta^{2}}),\quad r_{2}=\frac{1}{2}(\theta+i\sqrt{4\beta-\theta^{2}}).\] Then \[|r_{1}|=|r_{2}|=\frac{1}{2}\sqrt{\theta^{2}+(4\beta-\theta^{2})}=\beta\in(0,1).\] Since \(-\log\beta>0\) this proves that \(\kappa>0\) when \(a\eta\in((1-\sqrt{\beta})^{2},(1+\sqrt{\beta})^{2}).\)
If \(\eta a=(1-\sqrt{\beta})^{2},\) then \[r_{1}=r_{2}=r:=\frac{\theta}{2}=\frac{1-\eta a+\beta}{2}=\frac{1-(1-\sqrt{\beta})^{2}+\beta}{2}=\beta,\] and by Lemma 3.19 b) we have \[y_{n}=y_{0}(\beta^{n+1}+(\beta^{-1}-1)(n+1)\beta^{n+1}),\quad\forall n\ge0.\] Thus \[|y_{n}|\le|y_{0}|(|r|^{n+1}+|\beta^{-1}(n+1)|r|^{n+1})\le c_{1}(n+1)\beta^{n},\] where \(c_{1}=\beta+1.\) Since \(\beta<\sqrt{\beta},\) there exists a constant \(c_{2}>0\) (depending on \(\beta\)) s.t. \((n+1)\beta^{n}\le c_{2}(\sqrt{\beta})^{n}\) for all \(n\ge0.\) Thus \[|y_{n}|\le c(\sqrt{\beta})^{n}\quad\forall n\ge0,\] for \(c=c_{1}c_{2},\) so (3.42) holds with \(\kappa=-\log\sqrt{\beta}>0\) when \(\eta a=(1-\sqrt{\beta})^{2}\)
\(\:\)
Let \(d\ge1\) and let \(f:\mathbb{R}^{d}\to\mathbb{R}\) be given by the quadratic \[f(x)=\frac{1}{2}x^{T}Ax+b^{T}x+c,\] where \(A\in\mathbb{R}^{d\times d}\) is symmetric and positive definite, and \(b\in\mathbb{R}^{d},c\in\mathbb{R}.\) Denote the unique global minimum of \(f\) by \(x_{*}=-A^{-1}b.\)
a) If \(\eta\in(0,\lambda_{\max}^{-1}(1+\sqrt{\beta})^{2}],\) then there are constant \(c,\kappa>0\) (which depend on \(\eta a,\) \(\beta\) and \(d\)) s.t. \[|x_{n}-x_{*}|\le ce^{-\kappa n}|x_{0}-x_{*}|,\quad\forall n\ge0.\tag{3.47}\]
b) If \(\eta<\lambda_{\max}^{-1}(1-\sqrt{\beta})^{2},\) then \[\kappa>-\log|1-\eta\lambda_{\min}|>0.\tag{3.48}\]
3.23. a) says that convergence of GD with momentum on this \(f\) is guaranteed also for some \(\eta\) for which plain GD diverges (provided \((1+\sqrt{\beta})^{2}>2,\) cf. Lemma 3.17 and Remark 3.21 a).
b) says that if \(\eta<\lambda_{\max}^{-1}(1-\sqrt{\beta})^{2},\) then the convergence rate of GD with momentum is higher than that of plain GD, cf. Lemma 3.17 and Remark 3.21 b).
In b) we also see that convergence/stability is controlled by \(\lambda_{\max}\) (the curvature of the most curved direction), and the convergence rate is controlled by \(\lambda_{\min}\) (the curvature of the flattest direction), as for plain GD (recall Remark 3.8).
Proof. We have \(\nabla f(x)=Ax+b,\) so the iterates of GD with momentum are \[x_{n+1}=x_{n}-\eta(Ax_{n}+b)+\beta(x_{n}-x_{n-1}),\quad n\ge0,\] which in terms of \(y_{n}=x_{n}-x_{*}\) reads \[y_{n+1}=y-\eta Ay_{n}+\beta(y_{n}-y_{n-1}),\quad n\ge0.\tag{3.49}\] As in Lemma 3.32 and Lemma 3.17 , we can diagonalize \(A\) with eigenvalues \(0<\lambda_{\min}\le\lambda_{1}\le\ldots\le\lambda_{n}\le\lambda_{\max}\) and express (3.49) in the diagonalizing basis. This yields \[y_{n+1.i}=y_{i}-\eta\lambda_{i}y_{n,i}+\beta(y_{n,i}-y_{n-1,i}),\quad,i=1,\ldots,d,n\ge0,\] where the update in each dimension is decoupled. The updates in each dimension are described precisely as in Lemma 3.20 with \(a=\lambda_{i}.\) Thus if \(\eta<\lambda_{\max}^{-1}(1+\sqrt{\beta})^{2},\) so that \(\eta<\lambda_{i}^{-1}(1+\sqrt{\beta})^{2}\) for all \(i=1,\ldots,d,\) then constants \(c_{1,}\kappa_{1},c_{2},\kappa_{2},\ldots,c_{d},\kappa_{d}>0\) exist such that \[|y_{n,i}|\le c_{d}e^{-\kappa_{i}n}\quad\forall n\ge0,i=1,\ldots,d.\] Thus \[|y_{n}|^{2}=\sum_{i=1}^{d}y_{n,i}^{2}\le ce^{-2\kappa n}\quad\forall n\ge0,\] where \(c=c_{1}+\ldots+c_{d},\) and \[\kappa=\min(\kappa_{1},\ldots,\kappa_{d})>0.\] If in addition \(\eta<\lambda_{\max}^{-1}(1-\sqrt{\beta})^{2},\) so that \(\eta<\lambda_{i}^{-1}(1-\sqrt{\beta})^{2}\) for all \(i,\) then \(\kappa_{i}>-\log|1-\lambda_{i}\eta|\ge-\log|1-\lambda_{\min}\eta|\) for all \(i,\) so indeed (3.48) holds.
\(\:\)
For \(k>2\) let \(f:\mathbb{R}\to\mathbb{R}\) be given by \[f(x)=\frac{a}{k}|x|^{k},\] with the unique local and global minimizer \(x_{*}=0.\) Let \(x_{n},n\ge0\) denote the iterates of gradient descent with step-size \(\eta>0\) and momentum parameter \(\beta\in(0,1),\) starting from some \(x_{0}\ne x_{*}.\)
There is a constant \(c>0\) (depending on \(k\)), such that if \(|x_{0}-x_{*}|\le c,\) then \[|x_{n}-x_{*}|\sim\left(\frac{1-\beta}{s\eta a}\right)^{1/s}n^{-1/s},\] where \(s=k-2.\)
3.25. The rate \(n^{-1/s}\) is not faster than that of plain GD (recall Lemma 3.9), but the constant \(\left(\frac{1-\beta}{s\eta a}\right)^{1/s}\) in front is smaller. GD with momentum does not enjoy the fast convergence for this \(f\)of the Newton-Raphson method (recall Lemma 3.13).
Proof sketch. We have \(f'(x)=|x|^{k-1}{\rm sign}(x),\) so the iterates of GD with momentum are \[x_{n+1}=x_{n}-\eta(|x|^{k-1}{\rm sign}(x)+b)+\beta(x_{n}-x_{n-1}),\quad n\ge0.\] As in the sketch of the proof of Lemma 3.9, let us assume that \(x_{n}\) converges to \(x_{*},\) and heuristically guess the convergence rate. Like in (3.13) we have under the convergence assumption that \[\log x_{n}=\log x_{k}+\sum_{m=k}^{n}\log\left(1-\eta ax_{m}^{s}+\beta\left(1-\frac{x_{m-1}}{x_{m}}\right)\right),\quad\text{for all }n\ge k,\] for \(k\) large enough, where \(s=k-2.\) This leads to the approximation \[\log x_{n}\approx\log x_{k}-\sum_{m=k}^{n}\eta ax_{m}^{s}+\beta\sum_{m=k}^{n}\left(1-\frac{x_{m-1}}{x_{m}}\right),\quad\text{for all }n\ge k.\] With the ansatz \(x_{n}=hn^{-1/s}\) for some constant \(h>0,\) which we argued was the correct rate of \(x_{n}\downarrow0\) for GD without momentum in Example 3.9 for \(h=(s\eta a)^{-1/s},\) we have in this case \[1-\frac{x_{m-1}}{x_{m}}=1-\left(1-\frac{1}{m}\right)^{-1/s}=-\frac{1}{s}\frac{1}{m}+O(m^{-2}).\] This gives \[-\frac{1}{s}\log n\approx-\left(h^{s}\eta a+\frac{\beta}{s}\right)\log n\] after neglecting terms that are \(O(1).\) This is consistent provided \(\frac{1}{s}=h^{s}\eta a+\frac{\beta}{s}\iff h=\left(\frac{1-\beta}{s\eta a}\right)^{1/s}.\) Thus we expect the rate of convergence to be proportional to \(n^{-1/s},\) as for GD without momentum, but with an improved constant of proportionality, so that \(x_{n}\sim\left(\frac{1-\beta}{s\eta a}\right)^{1/s}n^{-1/s}.\)