0.3.5 ADAM

The ADAM method combines momentum with a step-size adapted to each coordinate of the iterates \(x_{n}.\) The latter is somewhat like the Newton-Raphson method, but ADAM method uses only the gradient of the objective function, not the Hessian. ADAM is an evolution of its predecessors AdaGrad and RMSProp, and is best understood its this historical context.

The AdaGrad (adaptive gradient) method keeps track of the magnitude of the gradients in each dimension through the sums \[v_{n,i}=\sum_{m=0}^{n}(\partial f_{i}(x_{m}))^{2},\quad i=1,\ldots,d,n\ge0.\tag{3.51}\] They are used to adapt the step-size using the update rule \[x_{n+1}=x_{n}-\eta\nabla f(x_{n})\odot v_{n}^{-1/2}\tag{3.52}\] where the inverse square root is taken component-wise, and \(\odot\) denotes component-wise multiplication. The effect of AdaGrad is to decrease the effective step-size in directions with large gradients, which should increase stability, while increasing it in directions with small gradients, which should speed up convergence. Note that (3.51) sums over the entire past, so in principle \(v_{n,i}\) can grow without bound with increasing \(n,\) though if the iterates get close to a local minimum the derivatives should become small, thus putting a bound on this growth. Furthermore, a few very large derivatives early on in the iteration has an effect which lasts for all \(n\ge0.\)

RMSProp (Root Mean Square Propagation) adds a “forgetting mechanism” to AdaGrad, which emphasizes recent derivatives in the computation of the effective step-size. For a parameter \(\gamma\in(0,1)\) it defines the vectors \(v_{n}\in\mathbb{R}^{d}\) by \[v_{n,i}=\gamma v_{n-1,i}+(1-\gamma)\partial_{i}f(x_{n})^{2},\quad i=1,\ldots,d,n\ge0,\tag{3.53}\] which means that (3.51) is replaced by \[v_{n,i}=(1-\gamma)\sum_{m=0}^{n}\gamma^{n-m}(\partial f_{i}(x_{n}))^{2},\quad i=1,\ldots,d,n\ge0.\] The update rule for \(x_{n}\) of RMSProp is the same as for AdaGrad, i.e. (3.52).

Finally ADAM combines the mechanism of RMSProp with momentum. The vector \(v_{n}\) is still defined as for RMSProp in (3.53). A normalized version \(\hat{v}_{n}\) is defined by \[\hat{v}_{n,i}=\frac{v_{n,i}}{1-\gamma^{n+1}}=\frac{\sum_{m=0}^{n}\gamma^{n-m}(\partial f_{i}(x_{n}))^{2}}{\sum_{m=0}^{n}\gamma^{n-m}},\quad i=1,\ldots,d,n\ge0\] The number \(\hat{v}_{n,i}\) can be seen as a temporally smoothed estimate of the square derivative \(\partial_{i}f(x_{n})^{2}.\) The momentum mechanism is implemented using the iteration \[\mu_{n,i}=\beta\mu_{n-1,i}+(1-\beta)\partial_{i}f(x_{n}),\quad i=1,\ldots,d,n\ge0,\] for which \[\mu_{n,i}=(1-\beta)\sum_{m=0}^{n}\beta^{n-m}\partial_{i}f(x_{n}),\quad i=1,\ldots,d,n\ge0.\] Its normalized version \[\hat{\mu}_{n,i}=\frac{\mu_{n,i}}{1-\beta^{n+1}}=\frac{\sum_{m=0}^{n}\beta^{n-m}\partial_{i}f(x_{n})}{\sum_{m=0}^{n}\beta^{n-m}},\quad i=1,\ldots,d,n\ge0,\] can be seen as a temporally smoothed estimate of the derivative \(\partial_{i}f(x_{n}).\) The update rule of ADAM is then \[x_{n+1}=x_{n}-\eta\odot\frac{\hat{\mu}_{n}}{\sqrt{\hat{v}_{n}}+\varepsilon},\quad n\ge0,\tag{3.54}\] where all operations are component-wise, and \(\varepsilon>0\) is used to prevent numerical issues that may arise from the division if \(\hat{v}_{n}\) has very small entries. This update uses the temporally smoothed estimates of \(\partial f_{i}(x_{n})\) to determine the direction of the update, which has a similar effect has momentum for GD, and uses essentially the same adaptive step-size mechanism based on an estimate of the a smoother estimate of \(\partial_{i}f(x_{n})^{2}\) as RMSProp does. Typical parameter values are \(\varepsilon=10^{-8},\beta=0.9,\gamma=0.999.\)

When training artificial neural networks in modern machine learning one essentially always uses either GD with momentum or ADAM (or rather their stochastic variants - see the next section).

0.3.6 Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is a version of gradient descent for optimizing (the expectation of a) random objective function. One assumes that \(f_{0},f_{1},f_{2},\ldots\) is a sequence of random objective functions from \(\mathbb{R}^{d}\) to \(\mathbb{R}\) which are identically distributed, mutually independent (i.e. i.i.d.) and almost surely differentiable (i.e. \(\mathbb{P}(f_{m}\text{ is differentiable})=1\)). The goal is to minimize the expected objective function, i.e. to solve the optimization problem \[\text{minimize }\quad\mathbb{E}[f_{m}(x)]\quad\text{for}\quad x\in\mathbb{R}^{d},\tag{3.55}\] (the expectation is independent of \(m\) since the \(f_{m}\) are identically distributed). The update rule is as for gradient descent but using the random gradient \(\nabla f_{n}\) is the \(n\)-th step, i.e. \[x_{n+1}=x_{n}-\eta\nabla f_{n}(x_{n}),\quad n\ge0.\tag{3.56}\]

0.3.6.1 Motivation

The SGD algorithm is extremely important in modern machine learning (ML), where it is used to train e.g. artificial neural networks. Some context is needed to understand how one ends up with a sequence of random objective functions in this setting. We will study this in detail in the ML part of the course, but anticipate the most relevant aspects here. In ML (more specifically so called “supervised learning”) the learning step involves minimizing an (non-random) objective function \(\overline{f}(x),\) where \(x\in\mathbb{R}^{d}\) is a vector of parameters of the “model” (like the parameters \(a,b\) of the linear regression in Figure 2), which depends on a dataset \(z_{i},i=1,\ldots,p\) of vectors in \(\mathbb{R}^{\nu},\) for some \(\nu\ge1.\) More specifically, this objective function takes the form \[\overline{f}(x)=\frac{1}{p}\sum_{i=1}^{p}L(z_{i},x),\] where \(L:\mathbb{R}^{\nu}\times\mathbb{R}^{d}\to\mathbb{R}\) is a so called “loss function”, which is differentiable and such that \(L(z_{i},x)\) measures how well the model with parameter vector \(x\) fits the datapoint \(z_{i}\) (for linear regression this is the squared distance between the line with parameters \(x=(a,b)\) and the datapoint \(z_{i}=(\tilde{x}_{i},y_{i})\)). One can apply gradient descent to minimize this objective function, with updates \[x_{n+1}=x_{n}-\eta\frac{1}{p}\sum_{i=1}^{p}\nabla_{x}L(z_{i},x).\] For a large dataset, say \(p=10^{6},\) one step of GD requires the computation of the gradient \(\nabla_{x}L(z_{i},x)\) of the loss function \(p=10^{6}\) times. This is a lot of computation for just taking one step of gradient descent. To speed up the computation, it is common to instead pick just one random datapoint out of the dataset and take a step based only on the gradient of the loss function for that datapoint, i.e. \[x_{n+1}=x_{n}-\eta\nabla_{x}L(z_{i_{n}},x),\tag{3.57}\] where \(i_{n},n\ge0,\) is a i.i.d. sequence of independent integer random variables that are uniformly distributed on \(\{1,\ldots,p\}.\) Note that \[\mathbb{E}\left[L(z_{i_{m}},x)\right]=\frac{1}{p}\sum_{i=1}^{p}L(z_{i},x)=\overline{f}(x),\] and \[\mathbb{E}\left[\nabla_{x}L(z_{i_{m}},x)\right]=\frac{1}{p}\sum_{i=1}^{p}\nabla_{x}L(z_{i},x)=\nabla\overline{f}(x).\] Thus the direction of the update in (3.57) is \[\underset{\text{gradient of non-random obj. func.}}{\underbrace{\nabla\overline{f}(x)}}+\underset{\text{random noise with mean zero}}{\underbrace{E_{n}}}\] for \[E_{n}=\nabla_{x}L(z_{i_{m}},x)-\nabla\overline{f}(x).\] The sequence \(E_{n},n\ge0,\) is a i.i.d. sequence of mean-zero random noise that perturbs the the descent direction \(-\nabla\overline{f}(x)\) for the non-random objective function. Thus (3.57) amounts to a noisy version of GD. It is also an instance of the SGD algorithm as defined by the update rule (3.56), for \[f_{n}(x)=L(z_{i_{n}},x),\quad n\ge0.\]

3.26 • Continuation of Example [examp:ch4_GD_point_cloud_example]

In a similar manner, SGD can be applied to the optimization problem ([eq:ch4_GD_first_example_obj_fun])-([eq:ch4_GD_first_example_min_prob_statement]) of Example [examp:ch4_GD_point_cloud_example]. The objective function is \[f(x)=f(y_{1},\ldots,y_{20})=\frac{1}{20}\sum_{i=1}^{20}g(y_{i})+\frac{1}{20^{2}}\sum_{j,k=1}^{20}h(y_{j},y_{k}),\tag{3.59}\] where \(x=(y_{1},\ldots,y_{20}),\) and \[g(y)=g(a,b)=\exp(a^{10}+b^{10}),\text{ and }h(y,y')=\exp(-10|y-y'|^{2}),\] where \(y=(a,b).\) To minimize this using SGD, let \(i_{n},j_{n},k_{n},n\ge0\) be i.i.d. random variables uniformly distributed on \(\{1,\ldots,20\},\) and define the random functions \(f_{n}\) by \[f_{n}(x)=f_{n}(y_{1},\ldots,y_{20})=g(y_{i_{n}})+h(y_{j_{n}},y_{k_{n}}).\tag{3.60}\] Note that since \(\mathbb{P}(i_{n}=i)=\frac{1}{20}\) and \(\mathbb{P}(j_{n}=j,k_{n}=k)=\mathbb{P}(j_{n}=j)\mathbb{P}(k_{n}=k)=\frac{1}{20^{2}}\) we have \[\mathbb{E}[f_{n}(x)]=\sum_{i=1}^{20}\mathbb{P}(i_{n}=i)g(y_{i})+\sum_{j,k=1}^{20}\mathbb{P}(j_{n}=j,k_{n}=k)h(y_{j},y_{k})=f(x)\] for all \(n\ge0\) and \(x\in\mathbb{R}^{40},\) cf. (3.55). Thus SGD using the random functions \(f_{n}\) indeed minimizes \(f(x)\) from (3.59). The update rule is (3.56), for the \(f_{n}\) defined in (3.60). Let us compute it more concretely for this example. Let \[x_{n}=(a_{n,1},b_{n,1},\ldots,a_{n,20},b_{n,20}),\quad n\ge0,\] so that \(a_{n,i},b_{n,i}\) are the entries of the iterates \(x_{n}.\) Rewriting the rule (3.56) in terms of the entries we obtain \[\begin{array}{cccl} a_{n+1,i} & = & a_{n,i}-\eta\partial_{a_{i}}f_{n}(x_{n}), & \quad i=1,\ldots,20,n\ge0,\\ b_{n+1,i} & = & b_{n,i}-\eta\partial_{b_{i}}f_{n}(x_{n}), & \quad i=1,\ldots,20,n\ge0. \end{array}\tag{3.61}\] Let \[\begin{array}{ccccc} g_{a}(y) & := & \partial_{a}g(y) & = & 10a\exp(a^{10}+b^{10}),\\ g_{b}(y) & := & \partial_{b}g(y) & = & 10b\exp(a^{10}+b^{10}), \end{array}\] and \[\begin{array}{ccccl} h_{a}(y,y') & := & \partial_{a}h((a,b),(a',b')) & = & -20(a-a')\exp(-10|y-y'|^{2}),\\ h_{b}(y,y') & := & \partial_{b}h((a,b),(a',b')) & = & -20(b-b')\exp(-10|y-y'|^{2}), \end{array}\] where \(y'=(a',b').\) Note that \(\partial_{a'}h(y,y')=-h_{a}(y,y')\) and \(\partial_{b'}h(y,y')=-h_{b}(y,y').\) Thus \[\begin{array}{rcl} \partial_{a_{i}}f_{n}(x) & = & \delta_{i=i_{n}}g_{a}(y_{i})+\delta_{i=j_{n}}h_{a}(y_{i},y_{k_{n}})-\delta_{i=k_{n}}h_{a}(y_{j_{n}},y_{i}),\\ \partial_{b_{i}}f_{n}(x) & = & \delta_{i=i_{n}}g_{b}(y_{i})+\delta_{i=j_{n}}h_{b}(y_{i},y_{k_{n}})-\delta_{i=k_{n}}h_{b}(y_{j_{n}},y_{i}). \end{array}\tag{3.62}\] Therefore even more concretely, (3.61) is equivalent to \[\begin{array}{ccl} a_{n+1,i} & = & a_{n,i}-\eta\left\{ \delta_{i=i_{n}}g_{a}(y_{i})+\delta_{i=j_{n}}h_{a}(y_{j_{n}},y_{i})-\delta_{i=k_{n}}h_{a}(y_{j_{n}},y_{i})\right\} ,\\ b_{n+1,i} & = & b_{n,i}-\eta\left\{ \delta_{i=i_{n}}g_{b}(y_{i})+\delta_{i=j_{n}}h_{b}(y_{j_{n}},y_{i})-\delta_{i=k_{n}}h_{b}(y_{j_{n}},y_{i})\right\} , \end{array}\] for \(i=1,\ldots,20,n\ge0.\) The following pseudo-code implements this minimization algorithm.

Subroutine for computing gradient of first term for argument \(y\in \mathbb{R}^2,\) returning a vector in \(\mathbb{R}^2\):

Subroutine for computing gradient of second term for arguments \(y, y'\in \mathbb{R}^2,\) returning a vector in \(\mathbb{R}^2\):

The aforementioned motivation for SGD is computational efficiency. Another possible motivation for SGD is to improve stability or speed up convergence. If an objective function \(f:\mathbb{R}^{d}\to\mathbb{R}\) is very rough it may be unsuitable for GD, for instance if it has many non-global local minima that can trap GD. Adding a suitable random noise \(g(x)\) to \(f(x)\) at each point \(x\) and using SGD may allow the stochasticity of the iterates of SGD to avoid such “spurious” local minima. It may also encourage more exploration of the possible descent directions, and therefore the discovery of better directions that GD misses.

0.3.6.2 (Lack of) convergence and step-size schedules

An obvious difference between SGD and GD is that for SGD the iterates \(x_{n}\) are random variables. A very important difference between SGD and GD is that SGD with constant step-size (and non-trivial noise) never converges. Indeed, imagine starting GD at a global minimizer \(x_{*}\): then the iterates \(x_{n}\) all equal \(x_{*}\) and GD stays at \(x_{*}\) forever. But if SGD is started at \(x_{*}\) and \(\nabla f_{n}(x_{*})\) has non-trivial variance, then even though \(x_{0}=x_{*}\) the next iterate \(x_{1}\) will jump away from \(x_{*}.\) Instead of converging, SGD with constant step-size will in the best case settle into a region of a minimizer \(x_{*},\) where it will jump around randomly. This is illustrated in the following figure.

Figure 3.8: Non-convergence of SGD. The histograms show how for smaller step-size the statistical distribution of iterates is more tightly concentrated around the global minimizer.

In such a scenario, the iterates \(x_{n}\) will converge in distribution as \(x_{n}\to\infty\) to some limiting random variable \(x_{\infty},\) whose variance corresponds to the typical distance between \(x_{n}\) and \(x_{*}\) (for \(n\) large). This variance will depend on the variance of the increments \(-\eta\nabla f_{n}(x)\) in the neighborhood of \(x_{*},\) which in turn depend on the variance of \(\nabla f_{n}(x)\) and on \(\eta>0.\) For smaller \(\eta>0\) the variance of the limiting distribution will be smaller, and the region where \(x_{n}\) settles down will be smaller, so \(x_{n}\) will typically be closer to \(x_{*}.\) But it will never converge to \(x_{*}.\)

If convergence is desired, one can achieve it with a step-size schedule, i.e. an \(n\)-dependent decreasing step-size. The rule (3.56) is then replaced with \[x_{n+1}=x_{n}-\eta_{n}\nabla f_{n}(x_{n}),\quad n\ge0,\] for a decreasing sequence \(\eta_{n},n\ge0.\) One can for instance take \(\eta_{n}\) proportional to \(n^{-1}.\) Below we will prove the convergence of SGD for such a schedule and strongly convex functions \(f_{n}\) with Lipschitz gradients.

In practice, a common approach is to start with some initial step-size that does not cause divergence, e.g. \(\eta_{0}=\eta=1,\) and then adaptively decrease \(\eta\) once a certain condition is met. For instance, if the objective function value \(f_{n}(x_{n})\) averaged over the last \(100\) steps did not decrease compared to average over the \(100\) steps before that, then \(\eta\) is decreased by half.

0.3.6.3 Momentum and ADAM

Momentum and ADAM generalize straight-forwardly from GD to SGD (in fact ADAM was introduces as a variant of SGD). By replacing the deterministic gradient \(\nabla f\) with the random gradient \(\nabla f_{n},\) the rule (3.35) for GD with momentum turns into the rule \[x_{n+1}=x_{n}-\eta\nabla f_{n}(x_{n})+\beta(x_{n}-x_{n-1}),\quad n\ge0,\] for SGD with momentum. Typically one uses \(\beta=0.9.\)

The ADAM rules for GD from (3.53)-(3.54) turn into \[v_{n,i}=\gamma v_{n-1,i}+(1-\gamma)\partial_{i}f_{n}(x_{n})^{2},\quad i=1,\ldots,d,n\ge0,\tag{3.64}\] \[\hat{v}_{n,i}=\frac{v_{n,i}}{1-\gamma^{n+1}},\quad i=1,\ldots,d,n\ge0,\] \[\mu_{n,i}=\beta\mu_{n-1,i}+(1-\beta)\partial_{i}f_{n}(x_{n}),\quad i=1,\ldots,d,n\ge0,\] \[\hat{\mu}_{n,i}=\frac{\mu_{n,i}}{1-\beta^{n+1}},\quad i=1,\ldots,d,n\ge0,\] \[x_{n+1}=x_{n}-\eta\odot\frac{\hat{\mu}_{n}}{\sqrt{\hat{v}_{n}}+\varepsilon},\quad n\ge0,\tag{3.65}\] with typical parameter values \(\varepsilon=10^{-8},\beta=0.9,\gamma=0.999.\)

These can of course also be combined with a step-size schedule by replacing \(\eta\) by an \(n\)-dependent \(\eta_{n}.\)

The rest of this section is concerned with the theoretical analysis of SGD.

0.3.6.4 Theoretical analysis

We start by investigating SGD it in the simplest case possible, namely for a univariate (\(d=1\)) quadratic objective function with Gaussian noise.

3.27

(Fixed step-size SGD for univariate quadratic \(f\) with non-degenerate minimum). Let \(d=1,\) \(\sigma>0,a\in(0,\infty)\) and let \(f_{n}:\mathbb{R}\to\mathbb{R}\) be given by \(f_{n}(x)=\frac{a}{2}x^{2}+x\varepsilon_{n},\) where \(\varepsilon_{n}\) are i.i.d. Gaussian random variables with mean zero and variance \(\sigma^{2}.\) Consider the numerical minimization of the quadratic objective function \(\overline{f}(x)=\mathbb{E}[f_{n}(x)]=a\frac{x^{2}}{2},\) with global minimum \(x_{*}=0,\) using SGD with step-size \(\eta>0,\) starting at \(x_{0}\in\mathbb{R}.\) Since \(f_{n}'(x)=ax+\varepsilon_{n},\) the iterates of SGD are given by \[x_{n+1}=x_{n}-\eta(ax_{n}+\varepsilon_{n}),\quad n\ge0.\tag{3.66}\] E.g. using induction one can verify that \[x_{n}=(1-\eta a)^{n}x_{0}+\eta\sum_{m=0}^{n-1}(1-a\eta)^{n-1-m}\varepsilon_{m},\quad n\ge0.\] Let \[\overline{x}_{n}:=(1-\eta a)^{n}x_{0}\] denote the deterministic first term, and \[\Delta_{n}=\eta\sum_{m=0}^{n-1}(1-a\eta)^{n-1-m}\varepsilon_{m}\] the random second term, so that \[x_{n}=\overline{x}_{n}+\Delta_{n},\quad n\ge0.\] If \(\eta a\in(0,2),\) then the deterministic term \(\overline{x}_{n}\) converges exponentially fast to \(x_{*}=0.\) Since \(\varepsilon_{0},\ldots,\varepsilon_{n-1}\) are independent Gaussians, it follows that \(\Delta_{n}\) is a Gaussian random variable for each \(n.\) The mean of \(\Delta_{n}\) is \[\mathbb{E}[\Delta_{n}]=\eta\sum_{m=0}^{n-1}(1-a\eta)^{n-1-m}\mathbb{E}[\varepsilon_{m}]=0.\] To compute the variance, first note that by independence and since the variance of each \(\varepsilon_{n}\) is \(\sigma^{2}\) we have \[\mathbb{E}[\varepsilon_{m}\varepsilon_{m'}]=\delta_{m=m'}\sigma^{2}\quad\text{for all}\quad m,m'\ge0,\] so that for any \(n\ge1\) and coefficients \(b_{0},\ldots,b_{n-1}\) it holds that \[\begin{array}{cll} \mathbb{E}\left[\left(\sum_{m=0}^{n-1}b_{m}\varepsilon_{m}\right)^{2}\right] & = & \mathbb{E}\left[\sum_{m,m'=0}^{n-1}b_{m}b_{m'}\varepsilon_{m}\varepsilon_{m'}\right]\\ & = & \sum_{m,m'=0}^{n-1}b_{m}b_{m'}\mathbb{E}[\varepsilon_{m}\varepsilon_{m'}]\\ & = & \sum_{m,m'=0}^{n-1}b_{m}b_{m'}\delta_{m=m'}\sigma^{2}\\ & = & \sigma^{2}\sum_{m=0}^{n-1}b_{m}^{2}. \end{array}\tag{3.67}\] Using this with \(b_{m}=(1-a\eta)^{n-1-m}\) we obtain the variance of \(\Delta_{n}\) as \[\begin{array}{cll} \mathbb{E}[\Delta_{n}^{2}] & = & \mathbb{E}\left[\left(\eta\sum_{m=0}^{n-1}(1-a\eta)^{n-1-m}\varepsilon_{m}\right)^{2}\right]\\ & = & \eta^{2}\sigma^{2}\sum_{m=0}^{n-1}(1-a\eta)^{2(n-1)-2m}\\ & = & \eta^{2}\sigma^{2}\sum_{l=0}^{n-1}(1-a\eta)^{2l}\\ & = & \eta^{2}\sigma^{2}\frac{1-(1-a\eta)^{2n}}{1-(1-a\eta)^{2}}. \end{array}\] Note that if \(\eta a\in(0,2)\) then \[\mathbb{E}[\Delta_{n}^{2}]\to\frac{\eta^{2}\sigma^{2}}{1-(1-a\eta)^{2}}\quad\text{as}\quad n\to\infty.\] This implies that \[\mathbb{E}[|x_{n}-x_{*}|^{2}]\to\frac{\eta^{2}\sigma^{2}}{1-(1-a\eta)^{2}},\quad\text{as }n\to\infty,\] (since \(|x_{n}-x_{*}|^{2}=|x_{n}|^{2}=|\overline{x_{n}}|^{^{2}}+2\overline{x_{n}}\Delta_{n}+|\Delta_{n}|^{2},\) \(\mathbb{E}[\Delta_{n}]=0\) and \(\overline{x}_{n}\to0\)). Thus for \(n\) large, the average squared distance between \(x_{n}\) and the global minimizer \(x_{*}=0\) is governed by the variance \(\sigma^{2}>0\) and step-size \(\eta>0.\) The “typical” distance between \(x_{n}\) and \(x_{*}\) is described by the standard deviation of \(x_{n}-x_{*},\) which is \(\frac{\eta\sigma}{\sqrt{1-(1-a\eta)^{2}}}.\)

3.28

Furthermore, for a sequence of Gaussian random variables the convergence of mean and variance implies convergence in distribution (a.k.a. weak convergence). Thus letting \(\Delta\) denote a Gaussian random variable with mean zero and variance \(\frac{\eta^{2}\sigma^{2}}{1-(1-a\eta)^{2}},\) we have \[\Delta_{n}\to\Delta\quad\text{in distribution},\quad\text{as }n\to\infty.\] Since \(\overline{x}_{n}\to0\) it follows that \[x_{n}\to\Delta\quad\text{in distribution},\quad\text{as }n\to\infty,\] (the rigorous justification uses Slutsky’s theorem) Thus while the random sequence \(x_{n}\) itself does not converge, the probability distribution of \(x_{n}\) converges to a Gaussian with mean zero and variance \(\frac{\eta^{2}\sigma^{2}}{1-(1-a\eta)^{2}}.\)

Next we consider SGD and the same objective function, but with a decreasing step-size schedule.

3.29

(Decreasing step-size SGD for univariate quadratic \(f\) with non-degenerate minimum). Let \(d,a,\sigma,f_{n},\overline{f},x_{0}\) be as in Example 3.27. If the decreasing step-size schedule \[\eta_{n}=\frac{1}{a(n+1)},\quad n\ge0,\] is used, then the iterates of SGD are \[x_{n+1}=x_{n}-\eta_{n}(ax_{n}+\varepsilon_{n}),\quad n\ge0.\tag{3.69}\]

By induction \[x_{n}=\underset{=:\overline{x}_{n}}{\underbrace{x_{0}\prod_{m=0}^{n-1}(1-\eta_{m}a)}}+\underset{=:\Delta_{n}}{\underbrace{\eta_{n}\sum_{m=0}^{n-1}\left(\prod_{l=m+1}^{n-1}(1-\eta_{l}a)\right)\varepsilon_{m}}},\quad n\ge0.\] The first term \(\overline{x}_{n}\) is deterministic and can be bounded by \[|x_{0}|\prod_{m=0}^{n-1}|1-\eta_{m}a|=|x_{0}|\prod_{m=0}^{n-1}\left(1-\frac{1}{m+1}\right)\le|x_{0}|\prod_{m=0}^{n-1}e^{-\frac{1}{m+1}}=|x_{0}|e^{-\sum_{m=0}^{n-1}\frac{1}{m+1}}.\] Since \[\sum_{m=0}^{\infty}\frac{1}{m+1}=\sum_{m=1}^{\infty}\frac{1}{m}=\infty,\] it holds that \[\overline{x}_{n}\to0=x_{*},\quad\text{as }\quad n\to\infty.\] As in Example 3.27, the random term \(\Delta_{n}\) is Gaussian with mean zero for each \(n\ge0.\) By using (3.67) we can compute its variance as \[\mathbb{E}[\Delta_{n}^{2}]=\mathbb{E}\left[\left(\eta_{n}\sum_{m=0}^{n-1}\left(\prod_{l=m+1}^{n-1}(1-\eta_{l}a)\right)\varepsilon_{m}\right)^{2}\right]=\eta_{n}^{2}\sigma^{2}\sum_{m=0}^{n-1}\prod_{l=m+1}^{n-1}(1-\eta_{l}a)^{2}.\] This can be bounded as \[\mathbb{E}[\Delta_{n}^{2}]=\frac{\sigma^{2}}{a^{2}n^{2}}\sum_{m=0}^{n-1}\prod_{l=m+1}^{n-1}(1-\eta_{l}a)^{2}\le\frac{\sigma^{2}}{a^{2}n^{2}}\sum_{m=0}^{n-1}1=\frac{\sigma^{2}}{a^{2}n},\] which implies that \[\mathbb{E}[\Delta_{n}^{2}]\to0\quad\text{as}\quad n\to\infty.\] This in turn implies that \[\Delta_{n}\to0\text{ in distribution, as }n\to\infty,\] (or equivalently \(\Delta_{n}\to0\) in probability), and that \[x_{n}\to x_{*}\text{ in distribution, as }n\to\infty.\] Thus for this decreasing step-size schedule the iterates do converge to the global minimizer (in distribution/probability).

Next we consider a higher dimensional quadratic objective function with isotropic Gaussian noise. Isotropic means invariant to direction. For a \(\mathbb{R}^{d}\)-valued Gaussian random vector \(\varepsilon=(\varepsilon_{1},\ldots,\varepsilon_{d})\) it means that the entries \(\varepsilon_{i}\) are i.i.d. Gaussians with mean zero. Then the inner products \(\varepsilon\cdot e_{i}=\varepsilon_{i}\) with any of the standard basis vectors \(e_{1},\ldots,e_{d}\) of \(\mathbb{R}^{d}\) are Gaussians with the same variance and mean zero. More generally, for any fixed unit vector \(v\in\mathbb{R}^{d},\) the inner product \(\varepsilon\cdot v\) is Gaussian with that variance and mean zero. This follows from the fact that any linear combination of jointly Gaussian random variables is again Gaussian, and the following easy computation of the mean and variance of \(\varepsilon\cdot v\): \[\mathbb{E}[\varepsilon\cdot v]=\mathbb{E}\left[\sum_{i=1}^{d}\varepsilon_{i}v_{i}\right]=\sum_{i=1}^{d}\mathbb{E}[\varepsilon_{i}]v_{i}=\sum_{i=1}^{d}0\cdot v_{i}=0,\] (mean) and using (3.67) \[\mathbb{E}[(\varepsilon\cdot v)^{2}]=\mathbb{E}\left[\left(\sum_{i=1}^{d}\varepsilon_{i}v_{i}\right)^{2}\right]=\mathbb{E}[\varepsilon_{1}]^{2}\sum_{i,j=1}^{d}v_{i}^{2}=\mathbb{E}[\varepsilon_{1}]^{2},\] (variance). Thus the distribution of the inner product \(\varepsilon\cdot v\) is independent of the direction of \(v,\) which is an expression of the isotropy of the distribution of \(\varepsilon.\) Moreover, if \(w_{1},\ldots,w_{d}\) is any orthonormal basis of \(\mathbb{R}^{d},\) then the inner products \(\varepsilon\cdot w_{1},\ldots,\varepsilon\cdot w_{n}\) are independent. One can verify this by showing that the covariances \(\mathbb{E}[(\varepsilon\cdot w_{i})(\varepsilon\cdot w_{j})]\) are zero for \(i\ne j,\) using a computation similar to (3.67). This means that the the joint probability distribution of the coordinates of \(\varepsilon\) is the same regardless of the orthonormal basis used to compute them. This is an even stronger expression of the isotropy of \(\varepsilon.\)

3.30 • Higher dimensional quadratic

For \(d\ge1\) and \({f_{n}:\mathbb{R}^{d}\to\mathbb{R}}\) consider the random quadratic \(f_{n}(x)=\frac{1}{2}x^{T}Ax+\varepsilon_{n}\cdot x,\) where \(A\in\mathbb{R}^{d\times d}\) is symmetric and positive definite, and \(\varepsilon_{n}\) is a sequence of i.i.d. random vectors, each with independent Gaussian entries with mean-zero and variance \(\sigma^{2}>0.\) We have \[\overline{f}(x)=\mathbb{E}[f_{n}(x)]=\frac{1}{2}x^{T}Ax,\] which is our deterministic objective function, with unique global minimum at \(x_{*}=0.\) Since \(\nabla f_{n}(x)=Ax+\varepsilon_{n}\) the iterates of SGD are \[x_{n+1}=x_{n}-\eta(Ax_{n}+\varepsilon_{n}),\quad n\ge0,\tag{3.71}\] As in Example 3.32, Example 3.22 and Lemma 3.17, we can diagonalize \(A\) with eigenvalues \(0<\lambda_{1}\le\ldots\le\lambda_{n}\) and express (3.71) in the diagonalizing basis. This yields \[x_{n+1.i}=x_{n,i}-\eta(\lambda_{i}x_{n,i}+\varepsilon_{n,i}),\quad i=1,\ldots,d,n\ge0.\tag{3.72}\] By the discussion preceding the example, the coordinates \(\varepsilon_{n,i}\) are independent Gaussians with mean zero and variance \(\sigma^{2}.\) Thus in (3.72) each coordinate \(x_{n,i}\) evolves independently of the others. We can therefore applying the calculation in Example 3.27 to each dimension, we see that provided \(\eta\in(0,\frac{1}{\lambda_{\max}})\) \[x_{n,i}\to\Delta_{i}\text{ in distribtion, as }n\to\infty,i=1,\ldots,d,\] where \(\Delta_{i}\) is a mean zero Gaussian random variable with variance \(\frac{\sigma^{2}\eta^{2}\lambda_{i}^{2}}{1-(1-\lambda_{i}\eta)^{2}}.\) Since \(x_{n,i}\) are independent for different \(i,\) it follows that \[(x_{n,1},\ldots,x_{n,d})\to\Delta\text{ in distribtion, as }n\to\infty,\] where \(\Delta=(\Delta_{1},\ldots,\Delta_{d})\) is a \(\mathbb{R}^{d}\)-valued Gaussian vector with independent mean-zero entries, with \(i\)-th entry having variance \(\frac{\sigma^{2}\eta^{2}\lambda_{i}^{2}}{1-(1-\lambda_{i}\eta)^{2}}.\) One can also show that \[\mathbb{E}[|x_{n}-x_{*}|^{2}]\to\sigma^{2}\eta^{2}\sum_{i=1}^{d}\frac{\lambda_{i}^{2}}{1-(1-\eta\lambda_{i})^{2}},\quad\text{ as }n\to\infty.\] Similarly applying the calculation from Example 3.68 to each coordinate, we obtain that with decreasing step-size schedule of the form \[\eta_{n}=\frac{1}{\lambda_{\max}(n+1)}\] it holds that for all \(i=1,\ldots,d,\) \[x_{n,i}\to x_{*,i}=0\text{ in distribution, as }n\to\infty.\] This in turn implies that \[x_{n}\to x_{*}=0\in\mathbb{R}^{d}\text{ in distribution, as }n\to\infty.\]

Appendix

0.3.7 Proof of Lagrange Multiplier Theorem

Proof of Theorem 2.20. Assume \(x_{*}\) is a local extremum of \(f|_{I_{h}}\) - in particular assume that \(h(x_{*})=0.\) By the condition of the theorem \(\nabla h_{1}(x_{*}),\ldots,\nabla h_{k}(x_{*})\in\mathbb{R}^{d}\) are linearly independent. Thus there exists a subset \(K\) of the coordinate indices \(\{1,\ldots,d\}\) of size \(k\) such that the \(k\times k\) matrix \((\partial_{j}h_{i}(x_{*}))_{i=1,\ldots,k,j\in K}\) has full rank. Write \(e_{1},\ldots,e_{d}\) for the standard basis vectors of \(\mathbb{R}^{d}.\) Let us change (permute) the basis so that the basis vectors \(e_{j},j\in K\) are mapped to \(e_{1},\ldots,e_{k}.\) In this new basis, it is the \(k\times k\) matrix \(A:=(\partial_{j}h_{i}(x_{*}))_{i,j=1,\ldots,k}\) that has full rank. In the rest of the proof all vectors and matrices are written in the new basis

We now use the implicit function theorem to parameterize the constraint set in a neighborhood of \(x_{*}.\) Let \(m=d-k\) and consider the map \[H:\mathbb{R}^{k}\times\mathbb{R}^{m}\to\mathbb{R}^{k}\] given by \[H(u,v)=h(\underset{\in\mathbb{R}^{d}}{\underbrace{(u,v)}}),\quad\text{for }u\in\mathbb{R}^{k},v\in\mathbb{R}^{m}.\] Then \(H\) is continuously differentiable since \(h\) is. Denote by \(D_{u}H=(\partial_{j}h_{i}((u,v)))_{i,j=1,\ldots,k}\) the Jacobian of \(H\) w.r.t. to \(u,\) which is a \(k\times k\) matrix whose \(i\)-th row consists of the first \(k\) coordinates of \(\nabla h_{i}((u,v)).\) Denote by \(D_{v}H=(\partial_{\acute{\imath}}h_{j}((u,v)))_{i=k+1,\ldots d,j=1,\ldots,k}\) the Jacobian w.r.t. to \(v,\) which is a \(k\times m\) matrix whose \(i\)-th row consists of the last \(m\) coordinates of \(\nabla h_{i}((u,v)).\)

For fixed \(v\in\mathbb{R}^{m}\) consider the equation \[H(u,v)=0\quad\text{for the unknown}\quad u\in\mathbb{R}^{k}.\tag{3.73}\] Writing any point \(x\) on the constraint surface as \(x=(u',v')\) for \(u'\in\mathbb{R}^{k},v'\in\mathbb{R}^{m},\) the point \(u=u'\) is a solution to the equation (3.73). In particular, letting \(u_{*}\in\mathbb{R}^{k},v_{*}\in\mathbb{R}^{m}\) be s.t. \(x_{*}=(u_{*},v_{*})\) we have \[H(u_{*},v_{*})=0.\] Now consider solutions \(\overline{u}(v)\) to(3.73) for \(v\) close to \(v_{*}.\) Letting \(J_{u}:=D_{u}H|_{(u,v)=(u_{*},v_{*})}\) and \(J_{v}=D_{v}H|_{(u,v)=(u_{*},v_{*})}\) denote the Jacobians w.r.t. to \(u\) and \(v\) at \((u_{*},v_{*}),\) Taylor expansion yields the approximation \[H(u,v)\approx\underset{=0\in\mathbb{R}^{k}}{\underbrace{H(u_{*},v_{*})}}+\underset{k\times m}{\underbrace{J_{v}}}\underset{\in\mathbb{R}^{m}}{\underbrace{(v-v_{*})}}+\underset{k\times k}{\underbrace{J_{u}}}\underset{\in\mathbb{R}^{k}}{\underbrace{(u-u_{*})}}\] for \(u,v\) close to \(u_{*},v_{*}.\) This suggests that \((u,v)\) close to \((u_{*},v_{*})\) s.t. that \(H(u,v)=0,\) are approximated by solutions to the linear equation \(J_{v}(v-v_{*})+J_{u}(u-u_{*})=0.\) Now \(J_{u}=A\) is non-degenerate, so the latter equation has the unique solution \(u-u_{*}=-J_{u}^{-1}J_{v}(v-v_{*}),\) indicating that solutions \(\overline{u}(v)\) of (3.73) exist for \(v\) close to \(v_{*},\) and that \(\overline{u}(v)\approx u_{*}+J_{u}^{-1}J_{v}(v-v_{*}).\) The rigorous version of this heuristic approximation uses the implicit function theorem. Since \(H\) is continuously differentiable and \(J_{u}\) is non-degenerate, the implicit function theorem implies the existence of open balls \(U,V\) s.t \(u_{*}\in U\subset\mathbb{R}^{k},v_{*}\in V\subset\mathbb{R}^{m}\) and a continuously differentiable function \(\overline{u}:V\to U\) such that \[u\in U,v\in V,H(u,v)=0\iff u\in U,v\in V,u=\overline{u}(v).\] In particular, \[u_{*}=\overline{u}(v_{*})\quad\text{and}\quad x_{*}=(\overline{u}(v_{*}),v_{*}).\] Furthermore the Jacobian \(D\overline{u}=(\partial_{j}\overline{u}_{i}(v))_{i=1,\ldots,k,j=1,\ldots,m}\in\mathbb{R}^{k\times m}\) of \(\overline{u}\) evaluated at \(v=v_{*}\) satisfies \[\underset{k\times m}{\underbrace{D\overline{u}|_{v=v_{*}}}}=-\underset{k\times k}{\underbrace{J_{u}^{-1}}}\underset{k\times m}{\underbrace{J_{v}}}.\tag{3.74}\] (the natural formula from the point of view of the aforementioned heuristic approximation \(\overline{u}(v)\approx u_{*}+J_{u}^{-1}J_{v}(v-v_{*})\)).

Consider the continuously differentiable map \(\overline{f}:V\to\mathbb{R}\) given by \[\overline{f}(v)=f(\overline{u}(v),v).\] If \(x_{*}\) is a local minimum of \(f|_{I_{h}},\) then \(v_{*}\) must be a local minimum of \(\overline{f}.\) Similarly, if \(x_{*}\) is a local maximum then \(v_{*}\) is a local maximum of \(f.\) In either case, Lemma 1.7 implies that \[\{\nabla_{v}\overline{f}\}(v_{*})=0.\tag{3.75}\] Write \(\partial_{u_{i}}f\) for \(\partial_{i}f\) for \(i=1,\ldots,k\) and \(\nabla_{u}f=(\partial_{u_{1}}f,\ldots,\partial_{u_{k}}f)\in\mathbb{R}^{k},\) and similarly \(\partial_{v_{i}}f\) for \(\partial_{k+i}f\) for \(i=1,\ldots,m\) and \(\nabla_{v}f=(\partial_{v_{1}}f,\ldots,\partial_{v_{m}}f)\in\mathbb{R}^{m}.\) Then \(\nabla f=(\nabla_{u}f,\nabla_{v}f).\) By the chain rule \[\partial_{i}\overline{f}(v)=\sum_{j=1}^{k}\{\partial_{i}\overline{u}_{j}\}(v)\{\partial_{u_{j}}f\}(\overline{u}(v),v)+\{\partial_{v_{i}}f\}(\overline{u}(v),v),\] for \(i=1,\ldots,m,\) or in vector notation \[\underset{\in\mathbb{R}^{m}}{\underbrace{\nabla_{v}\overline{f}(v)}}=\underset{m\times k}{\underbrace{(D\overline{u}|_{v})^{T}}}\underset{\in\mathbb{R}^{k}}{\underbrace{\{\nabla_{u}f\}(\overline{u}(v),v)}}+\underset{\in\mathbb{R}^{m}}{\underbrace{\{\nabla_{v}\}f(\overline{u}(v),v)}}.\] Thus (3.75) implies that \[\underset{m\times k}{\underbrace{(D\overline{u}|_{v=v_{*}})^{T}}}\underset{\in\mathbb{R}^{k}}{\underbrace{\{\nabla_{u}f\}(x_{*})}}=-\underset{\in\mathbb{R}^{m}}{\underbrace{\{\nabla_{v}f\}(x_{*})}},\] which by (3.74) in turn implies \[\underset{m\times k}{\underbrace{J_{v}^{T}}}\underset{k\times k}{\underbrace{(J_{u}^{-1})^{T}}}\underset{\in\mathbb{R}^{k}}{\underbrace{\{\nabla_{u}f\}(x_{*})}}=\underset{\in\mathbb{R}^{m}}{\underbrace{\{\nabla_{v}f\}(x_{*})}}.\] Letting \(\lambda=(J_{u}^{-1})^{T}\{\nabla_{u}f\}(x_{*})\in\mathbb{R}^{k}\) and using that the columns of \(J_{v}^{T}\) are the vectors \(\nabla_{v}h_{i}(x_{*}),i=1,\ldots,k\) it follows that \[\nabla_{v}f(x_{*})=\sum_{i=1}^{k}\nabla_{v}h_{i}(x_{*})\lambda_{i}.\tag{3.76}\] Furthermore \(\{\nabla_{u}f\}(x_{*})=J_{u}^{T}\lambda,\) which similarly implies that \[\nabla_{u}f(x_{*})=\sum_{i=1}^{k}\nabla_{u}h_{i}(x_{*})\lambda_{i}.\tag{3.77}\] Combining (3.76) and (3.77) shows that (2.21) with \(x=x_{*}.\) Since this holds for any local minimum of \(f|_{I_{h}}\) the proof is complete.

3.31. In the proof we essentially proved that \(I_{h}\) is \(m\)-dimensional manifold, as studied in the course on differential geometry. In the language of differential geometry, the condition (2.21) is equivalent to the covariant gradient of the function vanishes at \(x.\)

0.3.8 Additional code for gradient descent Example 3.2

For a complicated objective function, it is easy to make a mistake when computing the derivatives, or when implementing the computation in code. This can lead to strange behavior, such as the objective function increasing as GD progresses. To find such errors it is convenient to compare the output of ones gradient computation routines with numerically computed approximate gradients. The code below gives an example of this.

import numpy as np

def g(a,b,x):
    return b*np.exp(-0.5*(x-a)**2)

# Gradient of g wrt to a
def grad_g_a(a,b,x):
    return (x-a)*g(a,b,x)

# Gradient of g wrt to b
def grad_g_b(a,b,x):
    return np.exp(-0.5*(x-a)**2)

def h(avec,bvec,x):
    ret = 0
    for i in range(len(avec)):
        ret += g(avec[i],bvec[i],x)
    return ret

# Gradient of h
def grad_h(avec,bvec,x):
    reta = np.zeros(len(avec))
    retb = np.zeros(len(bvec))
    for i in range(len(avec)):
        reta[i] = grad_g_a(avec[i],bvec[i],x)
        retb[i] = grad_g_b(avec[i],bvec[i],x)
    return (reta,retb)

def L(avec,bvec,xvec,yvec):
    ret = 0
    for k in range(len(xvec)):
        ret += (h(avec,bvec,xvec[k])-yvec[k])**2/2.0
    return ret

# Gradient of L
def grad_L(avec,bvec,xvec,yvec):
    reta = np.zeros(len(avec))
    retb = np.zeros(len(bvec))
    for k in range(len(xvec)):
        (ga,gb) = grad_h(avec,bvec,xvec[k])
        reta += (h(avec,bvec,xvec[k])-yvec[k])*ga
        retb += (h(avec,bvec,xvec[k])-yvec[k])*gb
    return (reta,retb)


xvec = np.array( [-1,0,1] )
yvec = xvec


# Code to debug gradient calculation:
# compare to numerically computed derivatives
avec = np.random.uniform( -1, 1, num_bumps )
bvec = np.random.uniform( -1, 1, num_bumps )
evec = np.array( [0,1,0,0])
eps = 0.00001
# numerically computed derivative wrt to a_1
print( (h(avec+eps*evec,bvec,xvec[0])-h(avec,bvec,xvec[0]))/eps )
# numerically computed derivative wrt to b_2
print( (h(avec,bvec+eps*evec,xvec[0])-h(avec,bvec,xvec[0]))/eps ) 
# exact gradient
print( grad_h(avec,bvec,xvec[0]) )

The following code can be used to create plots based on the output of gradient descent.

import matplotlib.pyplot as plt

# Data points
xvec = np.array( [-1,0,1] )
yvec = xvec

# Initial parameters
avec0 = np.random.uniform(-1, 1, num_bumps)
bvec0 = np.zeros(num_bumps)

# Run GD
(histL,hista,histb) = GD_minimize(avec0,bvec0, \
                                 xvec,yvec,num_steps)



# Plot results

plt.plot(histL)
plt.title("Objective value $L(a_n,b_n)$ for each step of GD")
plt.xlabel("Step")
plt.ylabel("$L$")
plt.show()

plt.semilogy(histL)
plt.title("Objective value $L(a_n,b_n)$ for each step of GD")
plt.xlabel("Step")
plt.ylabel("$L$ (log scale)")
plt.show()


xs = np.linspace( -2, 2, 100 )
for n in [0,1,4,10,len(hista)-1]:
    plt.plot( xs,
             [h(hista[n],histb[n],x) for x in xs],
             label=f"$n={n}$" )
plt.scatter(xvec, yvec,color='black')
plt.title( "Function $x \\to h(a_n,b_n,x)$" + \
           "encoded by values $(a_n,b_n)$ at $n$-th step of GD" )
plt.show()

for i in range(4):    
    colors = ['blue', 'red', 'green', 'purple']
    plt.scatter([hista[0][i]],
                [histb[0][i]],
                marker='x', color=colors[i])
    plt.scatter([hista[-1][i]],
                [histb[-1][i]],
                marker='o', color=colors[i])
    plt.plot([avec[i] for avec in hista],
             [bvec[i] for bvec in histb],
             color=colors[i], label=f"$i={i+1}$")
last_n = len(hista)-1
plt.title(f"Trajectories of $(a_{{n,i}},b_{{n,i}})$" + \
          f"in plane throughout optimization\n" + \
          f"X: starting point $(a_{{0,i}},b_{{0,i}}),$" + \
          f"Dot: ending point " + \
          f"$(a_{{{last_n},i}},b_{{{last_n},i}})$")
plt.show()

0.3.9 SGD for non-convex function

The best we can do is the following proposition, which guarantees that the gradient \(\nabla f(x_{n})\) decays under appropriate conditions. For technical reasons it actually bounds the minimum of \(|\nabla f(x_{m})|^{2}\) over \(m=0,\ldots,n.\)

3.32

Assume \(d\ge1,\) and let \(f:\mathbb{R}^{d}\to\mathbb{R}\) be a function which is bounded below and differentiable with \(L\)-Lipschitz gradient. Let \(f_{*}=\inf_{x\in\mathbb{R}^{d}}f(x).\) Let \(\eta>0,\) \(x_{0}\in\mathbb{R}^{d},\) and let \(x_{n},n\ge0,\) denote the iterates of gradient descent with step size \(\eta\) and starting at \(x_{0}.\) If \(\eta\in(0,\frac{1}{L}]\) then \[\min_{m=0}^{n}|\nabla f(x_{m})|^{2}\le\frac{\frac{2}{\eta}(f(x_{0})-f_{*})}{n+1}.\]

Proof. As in (3.23) and using that \(\eta\le\frac{1}{L}\) we obtain \[\frac{\eta}{2}|\nabla f(x)|^{2}\le f(x_{m})-f(x_{m+1})\quad\text{for all }\quad m\ge0.\] Summing this for \(m=0,\ldots,n\) yields \[\frac{\eta}{2}\sum_{m=0}^{n}|\nabla f(x_{m})|^{2}\le f(x_{0})-f(x_{n+1}).\] Since \(f(x_{n+1})\ge f_{*}\) and bounding \(|\nabla f(x_{m})|^{2}\) by the minimum over \(m\) we obtain \[\frac{\eta}{2}(n+1)\min_{m=0}^{n}|\nabla f(x)|^{2}\le f(x_{0})-f(x_{*}).\] This proves the claim.