30
On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization Dongruo Zhou *† Jinghui Chen *Yuan Cao *§ Yiqi Tang Ziyan Yang Quanquan Gu ** Abstract Adaptive gradient methods are workhorses in deep learning. However, the convergence guarantees of adaptive gradient methods for nonconvex optimization have not been thoroughly studied. In this paper, we provide a fine-grained convergence analysis for a general class of adaptive gradient methods including AMSGrad, RMSProp and AdaGrad. For smooth nonconvex functions, we prove that adaptive gradient methods in expectation converge to a first-order stationary point. Our convergence rate is better than existing results for adaptive gradient methods in terms of dimension, and is strictly faster than stochastic gradient decent (SGD) when the stochastic gradients are sparse. To the best of our knowledge, this is the first result showing the advantage of adaptive gradient methods over SGD in nonconvex setting. In addition, we also prove high probability bounds on the convergence rates of AMSGrad, RMSProp as well as AdaGrad, which have not been established before. Our analyses shed light on better understanding the mechanism behind adaptive gradient methods in optimizing nonconvex objectives. 1 Introduction Stochastic gradient descent (SGD) (Robbins and Monro, 1951) and its variants have been widely used in training deep neural networks. Among those variants, adaptive gradient methods (AdaGrad) (Duchi et al., 2011; McMahan and Streeter, 2010), which scale each coordinate of the gradient by a function of past gradients, can achieve better performance than vanilla SGD in practice when the gradients are sparse. An intuitive explanation for the success of AdaGrad is that it automatically adjusts the learning rate for each feature based on the partial gradient, which accelerates the convergence. However, AdaGrad was later found to demonstrate degraded performance especially in * Equal Contribution Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: [email protected] Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: [email protected] § Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: [email protected] Department of Computer Science, Ohio State University, Columbus, OH 43210, USA; e-mail: [email protected] Department of Computer Science, University of Virginia, Charlottesville, VA 22904, USA; e-mail: [email protected] ** Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: [email protected] 1 arXiv:1808.05671v3 [cs.LG] 19 Oct 2020

On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

On the Convergence of Adaptive Gradient Methods for

Nonconvex Optimization

Dongruo Zhou∗† Jinghui Chen∗‡ Yuan Cao∗§ Yiqi Tang¶ Ziyan Yang‖ Quanquan Gu∗∗

Abstract

Adaptive gradient methods are workhorses in deep learning. However, the convergence

guarantees of adaptive gradient methods for nonconvex optimization have not been thoroughly

studied. In this paper, we provide a fine-grained convergence analysis for a general class of

adaptive gradient methods including AMSGrad, RMSProp and AdaGrad. For smooth nonconvex

functions, we prove that adaptive gradient methods in expectation converge to a first-order

stationary point. Our convergence rate is better than existing results for adaptive gradient

methods in terms of dimension, and is strictly faster than stochastic gradient decent (SGD)

when the stochastic gradients are sparse. To the best of our knowledge, this is the first result

showing the advantage of adaptive gradient methods over SGD in nonconvex setting. In addition,

we also prove high probability bounds on the convergence rates of AMSGrad, RMSProp as

well as AdaGrad, which have not been established before. Our analyses shed light on better

understanding the mechanism behind adaptive gradient methods in optimizing nonconvex

objectives.

1 Introduction

Stochastic gradient descent (SGD) (Robbins and Monro, 1951) and its variants have been widely

used in training deep neural networks. Among those variants, adaptive gradient methods (AdaGrad)

(Duchi et al., 2011; McMahan and Streeter, 2010), which scale each coordinate of the gradient by a

function of past gradients, can achieve better performance than vanilla SGD in practice when the

gradients are sparse. An intuitive explanation for the success of AdaGrad is that it automatically

adjusts the learning rate for each feature based on the partial gradient, which accelerates the

convergence. However, AdaGrad was later found to demonstrate degraded performance especially in

∗Equal Contribution†Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail:

[email protected]‡Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail:

[email protected]§Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail:

[email protected]¶Department of Computer Science, Ohio State University, Columbus, OH 43210, USA; e-mail: [email protected]‖Department of Computer Science, University of Virginia, Charlottesville, VA 22904, USA; e-mail:

[email protected]∗∗Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: [email protected]

1

arX

iv:1

808.

0567

1v3

[cs

.LG

] 1

9 O

ct 2

020

Page 2: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

cases where the loss function is nonconvex or the gradient is dense, due to rapid decay of learning

rate. This problem is especially exacerbated in deep learning due to the huge number of optimization

variables. To overcome this issue, RMSProp (Tieleman and Hinton, 2012) was proposed to use

exponential moving average rather than the arithmetic average to scale the gradient, which mitigates

the rapid decay of the learning rate. Kingma and Ba (2014) proposed an adaptive momentum

estimation method (Adam), which incorporates the idea of momentum (Polyak, 1964; Sutskever

et al., 2013) into RMSProp. Other related algorithms include AdaDelta (Zeiler, 2012) and Nadam

(Dozat, 2016), which combine the idea of exponential moving average of the historical gradients,

Polyak’s heavy ball (Polyak, 1964) and Nesterov’s accelerated gradient descent (Nesterov, 2013).

Recently, by revisiting the original convergence analysis of Adam, Reddi et al. (2018) found that for

some handcrafted simple convex optimization problem, Adam does not even converge to the global

minimizer. In order to address this convergence issue of Adam, Reddi et al. (2018) proposed a new

variant of the Adam algorithm named AMSGrad, which has guaranteed convergence in the convex

setting. The update rule of AMSGrad is as follows1:

xt+1 = xt − αtmt√vt + ε

, with vt = max(vt−1,vt), (1.1)

where αt > 0 is the step size, ε is a small number to ensure numerical stability, xt ∈ Rd is the iterate

in the t-th iteration, and mt,vt ∈ Rd are the exponential moving averages of the gradient and the

squared gradient at the t-th iteration respectively2:

mt = β1mt−1 + (1− β1)gt, (1.2)

vt = β2vt−1 + (1− β2)g2t . (1.3)

Here β1, β2 ∈ [0, 1] are algorithm hyperparameters, and gt is the stochastic gradient at xt.

Despite the successes of adaptive gradient methods for training deep neural networks, the

convergence guarantees for these algorithms are mostly restricted to online convex optimization

(Duchi et al., 2011; Kingma and Ba, 2014; Reddi et al., 2018). Therefore, there is a huge gap

between existing online convex optimization guarantees for adaptive gradient methods and the

empirical successes of adaptive gradient methods in nonconvex optimization. In order to bridge this

gap, there are a few recent attempts to prove the nonconvex optimization guarantees for adaptive

gradient methods. More specifically, Basu et al. (2018) proved the convergence rate of RMSProp

and Adam when using deterministic gradient rather than stochastic gradient. Li and Orabona

(2018) proved the convergence rate of AdaGrad, assuming the gradient is L-Lipschitz continuous.

Ward et al. (2018) proved the convergence rate of AdaGrad-Norm where the moving average of

the norms of the gradient vectors is used to adjust the gradient vector in both deterministic and

stochastic settings for smooth nonconvex functions. Nevertheless, the convergence guarantees in

Basu et al. (2018); Ward et al. (2018) are still limited to simplified algorithms. Another attempt

to obtain the convergence rate under stochastic setting is prompted recently by Zou and Shen

(2018), in which they only focus on the condition when the momentum vanishes. Chen et al. (2018a)

1With slight abuse of notation, here we denote by√vt the element-wise square root of the vector vt, mt/

√vt the

element-wise division between mt and√vt, and max(vt−1,vt) the element-wise maximum between vt−1 and vt.

2Here we denote by g2t the element-wise square of the vector gt.

2

Page 3: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

studies the convergence properties of adaptive gradient methods in the nonconvex setting, however,

its convergence rate has a quadratic dependency on the problem dimension d. Defossez et al.

(2020) proves the convergence of Adam and Adagrad in nonconvex smooth optimization under

the assumption of almost sure uniform bound on the L∞ norm of the gradients. In this paper,

we provide a fine-grained convergence analysis of the adaptive gradient methods. In particular,

we analyze several representative adaptive gradient methods, i.e., AMSGrad (Reddi et al., 2018),

which fixed the non-convergence issue in Adam and the RMSProp (fixed version via Reddi et al.

(2018)), and prove its convergence rate for smooth nonconvex objective functions in the stochastic

optimization setting. Moreover, existing theoretical guarantees for adaptive gradient methods are

mostly bounds in expectation over the randomness of stochastic gradients, and are therefore only

on-average convergence guarantees. However, in practice the optimization algorithm is usually

only run once, and therefore the performance cannot be guaranteed by the in-expectation bounds.

To deal with this problem, we also provide high probability convergence rates for AMSGrad and

RMSProp, which can characterize the performance of the algorithms on single runs.

1.1 Our Contributions

The main contributions of our work are summarized as follows:

• We prove that the convergence rate of AMSGrad to a stationary point for stochastic nonconvex

optimization is

O

(d1/2

T 3/4−s/2 +d

T

), (1.4)

when ‖g1:T,i‖2 ≤ G∞Ts. Here g1:T,i = [g1,i, g2,i, . . . , gT,i]

> with gtTt=1 being the stochastic

gradients satisfying ‖gt‖∞ ≤ G∞, and s ∈ [0, 1/2] is a parameter that characterizes the growth

rate of the cumulative stochastic gradient g1:T,i. When the stochastic gradients are sparse,

i.e., s < 1/2, AMSGrad achieves strictly faster convergence rate than that of vanilla SGD

(Ghadimi and Lan, 2016) in terms of iteration number T .

• Our result implies that the worst case (i.e., s = 1/2) convergence rate for AMSGrad is

O

(√d

T+d

T

),

which has a better dependence on the dimension d and T than the convergence rate proved in

Chen et al. (2018a), i.e.,

O

(log T + d2√

T

).

• We also establish high probability bounds for adaptive gradient methods. To the best of our

knowledge, it is the first high probability convergence guarantees for AMSGrad and RMSProp

in nonconvex stochastic optimization setting.

3

Page 4: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

2 Additional Related Work

Here we briefly review other related work that is not covered before.

Adaptive gradient methods: Mukkamala and Hein (2017) proposed SC-Adagrad and SC-

RMSprop, which derives logarithmic regret bounds for strongly convex functions. Chen et al. (2018b)

proposed SADAGRAD for solving stochastic strongly convex optimization and more generally

stochastic convex optimization that satisfies the second order growth condition. Zaheer et al. (2018)

studied the effect of adaptive denominator constant ε and minibatch size in the convergence of

adaptive gradient methods. Chen et al. (2020) proposed a partially adaptive gradient method and

proved its convergence in nonconvex settings.

Nonconvex Stochastic Optimization: Ghadimi and Lan (2013) proposed a randomized

stochastic gradient (RSG) method, and proved its O(1/√T ) convergence rate to a stationary point.

Ghadimi and Lan (2016) proposed an randomized stochastic accelerated gradient (RSAG) method,

which achieves O(1/T + σ2/√T ) convergence rate, where σ2 is an upper bound on the variance of

the stochastic gradient. Motivated by the success of stochastic momentum methods in deep learning

(Sutskever et al., 2013), Yang et al. (2016) provided a unified convergence analysis for both stochastic

heavy-ball method and the stochastic variant of Nesterov’s accelerated gradient method, and proved

O(1/√T ) convergence rate to a stationary point for smooth nonconvex functions. Reddi et al. (2016);

Allen-Zhu and Hazan (2016) proposed variants of stochastic variance-reduced gradient (SVRG)

method (Johnson and Zhang, 2013) that is provably faster than gradient descent in the nonconvex

finite-sum setting. Lei et al. (2017) proposed a stochastically controlled stochastic gradient (SCSG),

which further improves convergence rate of SVRG for finite-sum smooth nonconvex optimization.

Recently, Zhou et al. (2018) proposed a new algorithm called stochastic nested variance-reduced

gradient (SNVRG), which achieves strictly better gradient complexity than both SVRG and SCSG

for finite-sum and stochastic smooth nonconvex optimization.

Stochastic Smooth Nonconvex Optimization: There is another line of research in stochastic

smooth nonconvex optimization, which makes use of the λ-nonconvexity of a nonconvex function

f (i.e., ∇2f −λI). More specifically, Natasha 1 (Allen-Zhu, 2017b) and Natasha 1.5 (Allen-

Zhu, 2017a) have been proposed, which solve a modified regularized problem and achieve faster

convergence rate to first-order stationary points than SVRG and SCSG in the finite-sum and

stochastic settings respectively. In addition, Allen-Zhu (2018) proposed an SGD4 algorithm, which

optimizes a series of regularized problems, and is able to achieve a faster convergence rate than

SGD.

High Probability Bounds: There are only few work on high probability results. Kakade and

Tewari (2009) proved high probability bounds for PEGASOS algorithm via Freeman’s inequality.

Harvey et al. (2019a,b) proved convergence bounds for non-smooth, strongly convex case via

generalized Freeman’s inequality. Jain et al. (2019) makes the last iterate of SGD information

theoretically optimal by providing a high probability bound. Li and Orabona (2020) presented a

high probability analysis for Delayed AdaGrad algorithm with momentum in the smooth nonconvex

setting.

4

Page 5: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

3 Notation

Scalars are denoted by lower case letters, vectors by lower case bold face letters, and matrices

by upper case bold face letters. For a vector x = [xi] ∈ Rd, we denote the `p norm (p ≥ 1) of x

by ‖x‖p =(∑d

i=1 |xi|p)1/p

, the `∞ norm of x by ‖x‖∞ = maxdi=1 |xi|. For a sequence of vectors

gjtj=1, we denote by gj,i the i-th element in gj . We also denote g1:t,i = [g1,i, g2,i, . . . , gt,i]>. With

slightly abuse of notation, for any two vectors a and b, we denote a2 as the element-wise square,

ap as the element-wise power operation, a/b as the element-wise division and max(a,b) as the

element-wise maximum. For a matrix A = [Aij ] ∈ Rd×d, we define ‖A‖1,1 =∑d

i,j=1 |Aij |. Given

two sequences an and bn, we write an = O(bn) if there exists a constant 0 < C < +∞ such that

an ≤ C bn. We use notation O(·) to hide logarithmic factors.

4 Algorithms

We mainly consider the following three algorithms: AMSGrad (Reddi et al., 2018), a corrected

version of RMSProp (Tieleman and Hinton, 2012; Reddi et al., 2018), and AdaGrad (Duchi et al.,

2011).

Algorithm 1 AMSGrad (Reddi et al., 2018)

Input: initial point x1, step size αtTt=1, β1, β2, ε.

1: m0 ← 0, v0 ← 0, v0 ← 02: for t = 1 to T do3: gt = ∇f(xt, ξt)4: mt = β1mt−1 + (1− β1)gt5: vt = β2vt−1 + (1− β2)g2

t

6: vt = max(vt−1,vt)

7: xt+1 = xt − αtV−1/2t mt with Vt = diag(vt + ε)8: end for

Output: Choose xout from xt, 2 ≤ t ≤ T with probability αt−1/∑T−1

i=1 αi.

The AMSGrad algorithm is originally proposed by Reddi et al. (2018) to fix the non-convergence

issue in the original Adam optimizer (Kingma and Ba, 2014). Specifically, in Algorithm 1, the

effective learning rate of AMSGrad is αtV−1/2t where Vt = diag(vt), while in original Adam, the

effective learning rate is αtV−1/2t where Vt = diag(vt). This choice of effective learning rate

guarantees that it is non-increasing and thus fix the possible convergence issue. In Algorithm 2 we

present the corrected version of RMSProp (Tieleman and Hinton, 2012) where the effective learning

rate is also set as αtV−1/2t .

In Algorithm 3 we further present the AdaGrad algorithm (Duchi et al., 2011), which adopts the

summation of past stochastic gradient squares instead of the running average of them to compute

the effective learning rate.

5

Page 6: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

Algorithm 2 RMSProp (Tieleman and Hinton, 2012) (corrected version by Reddi et al. (2018))

Input: initial point x1, step size αtTt=1, β, ε.

1: v0 ← 0, v0 ← 02: for t = 1 to T do3: gt = ∇f(xt, ξt)4: vt = βvt−1 + (1− β)g2

t

5: vt = max(vt−1,vt)

6: xt+1 = xt − αtV−1/2t gt with Vt = diag(vt + ε)7: end for

Output: Choose xout from xt, 2 ≤ t ≤ T with probability αt−1/∑T−1

i=1 αi.

Algorithm 3 AdaGrad (Duchi et al., 2011)

Input: initial point x1, step size αtTt=1, ε.

1: v0 ← 02: for t = 1 to T do3: gt = ∇f(xt, ξt)4: vt = vt−1 + g2

t

5: xt+1 = xt − αtV−1/2t gt with Vt = diag(vt + ε)6: end for

Output: Choose xout from xt, 2 ≤ t ≤ T with probability αt−1/∑T−1

i=1 αi.

5 Convergence of Adaptive Gradient Methods in Expectation

In this section we present our main theoretical results on the convergence of AMSGrad, RMSProp

as well as AdaGrad. We study the following stochastic nonconvex optimization problem

minx∈Rd

f(x) := Eξ[f(x; ξ)

],

where ξ is a random variable satisfying certain distribution, f(x; ξ) : Rd → R is a L-smooth

nonconvex function. In the stochastic setting, one cannot directly access the full gradient of f(x).

Instead, one can only get unbiased estimators of the gradient of f(x), which is ∇f(x; ξ). This

setting has been studied in Ghadimi and Lan (2013, 2016).

Assumption 5.1 (Bounded Gradient). f(x) = Eξf(x; ξ) has G∞-bounded stochastic gradient.

That is, for any ξ, we assume that ‖∇f(x; ξ)‖∞ ≤ G∞.

It is worth mentioning that Assumption 5.1 is slightly weaker than the `2-boundedness assump-

tion ‖∇f(x; ξ)‖2 ≤ G2 used in Reddi et al. (2016); Chen et al. (2018a). Since ‖∇f(x; ξ)‖∞ ≤‖∇f(x; ξ)‖2 ≤

√d‖∇f(x; ξ)‖∞, the `2-boundedness assumption implies Assumption 5.1 with

G∞ = G2. Meanwhile, G∞ will be tighter than G2 by a factor of√d when each coordinate

of ∇f(x; ξ) almost equals to each other.

6

Page 7: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

Assumption 5.2 (L-smooth). f(x) = Eξf(x; ξ) is L-smooth: for any x,y ∈ Rd, we have

∣∣f(x)− f(y)− 〈∇f(y),x− y〉∣∣ ≤ L

2‖x− y‖22.

Assumption 5.2 is a standard assumption in the analysis of gradient-based algorithms. It is

equivalent to the L-gradient Lipschitz condition, which is often written as ‖∇f(x) −∇f(y)‖2 ≤L‖x− y‖2.

We are now ready to present our main result.

Theorem 5.3 (AMSGrad). Suppose β1 < β1/22 , αt = α and ‖g1:T,i‖2 ≤ G∞T s for t = 1, . . . , T, 0 ≤

s ≤ 1/2. Then under Assumptions 5.1 and 5.2, the iterates xt of AMSGrad satisfy that

1

T − 1

T∑t=2

E[‖∇f(xt)‖22

]≤ M1

Tα+M2d

T+αM3d

T 1/2−s , (5.1)

where Mi3i=1 are defined as follows:

M1 = 2G∞∆,M2 =2G3∞ε−1/2

1− β1+ 2G2

∞,M3 =2LG2

ε1/2(1− β2)1/2(1− β1/β1/22 )

(1 +

2β211− β1

),

and ∆ = f(x1)− infx f(x).

Remark 5.4. Note that in Theorem 5.3 we have a condition that ‖g1:T,i‖2 ≤ G∞Ts. Here s

characterizes the growth rate condition (Liu et al., 2019) of g1:T,i, i.e., the cumulative stochastic

gradient. In the worse case where the stochastic gradients are not sparse at all, s = 1/2, while in

practice when the stochastic gradients are sparse, we have s < 1/2.

Remark 5.5. If we choose α = Θ(d1/2T 1/4+s/2

)−1, then (5.1) implies that AMSGrad achieves

O

(d1/2

T 3/4−s/2 +d

T

)convergence rate. In cases where the stochastic gradients are sparse, i.e., s < 1/2, we can see that

the convergence rate of AMSGrad is strictly better than that of nonconvex SGD (Ghadimi and Lan,

2016), i.e., O(√d/T + d/T ). In the worst case when s = 1/2, this result matches the convergence

rate of nonconvex SGD (Ghadimi and Lan, 2016). Note that Chen et al. (2018a) also provided

similar bound for AMSGrad that

1

T − 1

T∑t=2

E[‖∇f(xt)‖22

]= O

(log T + d2√

T

).

It can be seen that the dependence of d in their bound is quadratic, which is worse than the linear

dependence implied by (5.1). A very recent work (Defossez et al., 2020) discussed the convergence

issue of Adam by showing that the bound consists of a constant term and does not converge to zero.

In comparison, our result for AMSGrad does not have such a constant term and converges to zero

7

Page 8: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

in a rate O(d1/2T 3/4−s/2). This demonstrates that the convergence issue of Adam is indeed fixed in

AMSGrad.

Corollary 5.6 (corrected version of RMSProp). Under the same conditions of Theorem 5.3, if

αt = α and ‖g1:T,i‖2 ≤ G∞Ts for t = 1, . . . , T, 0 ≤ s ≤ 1/2, then the the iterates xt of RMSProp

satisfy that

1

T − 1

T∑t=2

E[‖∇f(xt)‖22

]≤ M1

Tα+M2d

T+αM3d

T 1/2−s ,

where Mi3i=1 are defined as follows:

M1 = 2G∞∆f, M2 = 2G3∞ε−1/2 + 2G2

∞, M3 =6LG2

∞ε1/2(1− β)1/2

,

Corollary 5.7 (AdaGrad). Under the same conditions of Theorem 5.3, if αt = α and ‖g1:T,i‖2 ≤G∞T

s for t = 1, . . . , T, 0 ≤ s ≤ 1/2, then the the iterates xt of AdaGrad satisfy that

1

T − 1

T∑t=2

E[‖∇f(xt)‖22

]≤ M1

Tα+M2d

T+αM3d

T 1/2−s ,

where Mi3i=1 are defined as follows:

M1 = 2G∞∆f, M2 = 2G3∞ε−1/2 + 2G2

∞, M3 = 6LG2∞ε−1/2,

Remark 5.8. Corollaries 5.6, 5.7 imply that RMSProp and AdaGrad algorithm achieve the same

rate of convergence as AMSGrad. In worst case where s = 1/2, both algorithms again, achieves

O(√d/T + d/T ) convergence rate, which matches the convergences rate of nonconvex SGD given

by Ghadimi and Lan (2016).

Remark 5.9. Defossez et al. (2020) gave a bound O(α−1T−1/2 + (1 + α)dT−1/2) for Adagrad,

which gives the following rate with optimal α:

O

(1√T

+d√T

).

For the ease of comparison, we calculate the iteration complexity of both results. To converge to a

εtarget-approximate first order stationary point, the result of Defossez et al. (2020) suggests that

AdaGrad requires Ω(ε−2targetd2) iterations. In sharp contrast, our result suggests that AdaGrad only

requires Ω(ε−2targetd) iterations. Evidently, our iteration complexity is better than theirs by a factor

of d.

8

Page 9: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

6 Convergence of Adaptive Gradient Methods with High Proba-

bility

Theorem 5.3, Corollaries 5.6 and 5.7 bound the expectation of full gradients over the randomness of

stochastic gradients. In other words, these bounds can only guarantee the average performance of a

large number of trials of the algorithm, but cannot rule out extremely bad solutions. What’s more,

for practical applications such as training deep neural networks, usually we only perform one single

run of the algorithm since the training time can be fairly large. Hence, it is essential to get high

probability bounds which guarantee the performance of the algorithm on single runs. To overcome

this limitation, in this section, we further establish high probability bounds of the convergence rate

for AMSGrad and RMSProp as well as AdaGrad. We make the following additional assumption.

Assumption 6.1. The stochastic gradients are sub-Gaussian random vectors (Jin et al., 2019):

Eξ[exp(〈v,∇f(x)−∇f(x, ξ)〉)] ≤ exp(‖v‖22σ2/2)

for all v ∈ Rd and all x.

Remark 6.2. Sub-Gaussian gradient assumptions are commonly considered when studying high

probability bounds (Li and Orabona, 2020). Note that our Assumption 6.1 is weaker than Assumption

B2 in Li and Orabona (2020): for the case when ∇f(x)−∇f(x, ξ) is a standard Gaussian vector,

σ2 defined in Li and Orabona (2020) is of order O(d), while σ2 = O(1) in our definition.

Theorem 6.3 (AMSGrad). Suppose β1 < β1/22 , αt = α ≤ 1/2 and ‖g1:T,i‖2 ≤ G∞T

s for t =

1, . . . , T, 0 ≤ s ≤ 1/2. Then for any δ > 0, under Assumptions 5.1, 5.2 and 6.1, with probability at

least 1− δ, the iterates xt of AMSGrad satisfy that

1

T − 1

T∑t=2

‖∇f(xt)‖22 ≤M1

Tα+M2d

T+αM3d

T 1/2−s , (6.1)

where Mi3i=1 are defined as follows:

M1 = 4G∞∆f + C ′G∞ε−1σ2G∞ log(2/δ),M2 =

4G3∞ε−1/2

1− β1+ 4G2

∞,

M3 =4LG2

ε1/2(1− β2)1/2(1− β1/β1/22 )

(1 +

2β211− β1

),

and ∆f = f(x1)− infx f(x).

Remark 6.4. Similar to the discussion in Remark 5.5, we can choose α = Θ(d1/2T 1/4+s/2

)−1,

to achieve an O(d1/2/T 3/4−s/2 + d/T ) convergence rate. When s < 1/2, this rate of AMSGrad is

strictly better than that of nonconvex SGD (Ghadimi and Lan, 2016).

We also have the following corollaries characterizing the high probability bounds for RMSProp

and AdaGrad.

9

Page 10: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

Corollary 6.5 (corrected version of RMSProp). Under the same conditions of Theorem 6.3, if

αt = α ≤ 1/2 and ‖g1:T,i‖2 ≤ G∞Ts for t = 1, . . . , T, 0 ≤ s ≤ 1/2, then for any δ > 0, with

probability at least 1− δ, the iterates xt of RMSProf satisfy that

1

T − 1

T∑t=2

‖∇f(xt)‖22 ≤M1

Tα+M2d

T+αM3d

T 1/2−s , (6.2)

where Mi3i=1 are defined as follows:

M1 = 4G∞∆ + C ′G∞ε−1σ2G∞ log(2/δ),M2 = 4G3

∞ε−1/2 + 4G2

∞, M3 =4LG2

∞ε1/2(1− β)1/2

,

and ∆ = f(x1)− infx f(x).

Corollary 6.6 (AdaGrad). Under the same conditions of Theorem 6.3, if αt = α ≤ 1/2 and

‖g1:T,i‖2 ≤ G∞T s for t = 1, . . . , T, 0 ≤ s ≤ 1/2, then for any δ > 0, with probability at least 1− δ,the iterates xt of AdaGrad satisfy that

1

T − 1

T∑t=2

‖∇f(xt)‖22 ≤M1

Tα+M2d

T+αM3d

T 1/2−s , (6.3)

where Mi3i=1 are defined as follows:

M1 = 4G∞∆ + C ′G∞ε−1σ2G∞ log(2/δ),M2 = 4G3

∞ε−1/2 + 4G2

∞, M3 =4LG2

∞ε1/2

,

and ∆ = f(x1)− infx f(x).

7 Proof Sketch of the Main Theory

In this section, we provide a proof sketch of Theorems 5.3 and 6.3, and the complete proofs as well

as proofs for other corollaries and technical lemmas can be found in the appendix. Compared with

the analysis of standard stochastic gradient descent, the main difficulty of analyzing the convergence

rate of adaptive gradient methods is caused by the existence of stochastic momentum mt and

adaptive stochastic gradient V−1/2t gt. To address this challenge, following Yang et al. (2016), we

define an auxiliary sequence zt: let x0 = x1, and for each t ≥ 1,

zt = xt +β1

1− β1(xt − xt−1) =

1

1− β1xt −

β11− β1

xt−1. (7.1)

The following lemma shows that zt+1 − zt can be represented by mt,gt and V−1/2t . This indicates

that by considering the sequence zt, it is possible to analyze algorithms which include stochastic

momentum, such as AMSGrad.

Lemma 7.1. Let zt be defined in (7.1). Then for t ≥ 2, we have the following expression for

10

Page 11: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

zt+1 − zt.

zt+1 − zt =β1

1− β1

[I−

(αtV

−1/2t

)(αt−1V

−1/2t−1

)−1](xt−1 − xt)− αtV−1/2t gt.

We can also represent zt+1 − zt as the following:

zt+1 − zt =β1

1− β1(αt−1V

−1/2t−1 − αtV

−1/2t

)mt−1 − αtV−1/2t gt.

For t = 1, we have z2 − z1 = −α1V−1/21 g1.

With Lemma 7.1, we have the following two lemmas giving upper bounds for ‖zt+1 − zt‖2 and

‖∇f(zt)−∇f(xt)‖2 , which are useful to the proof of the main theorem.

Lemma 7.2. Let zt be defined in (7.1). For t ≥ 2, we have

‖zt+1 − zt‖2 ≤∥∥αV−1/2t gt

∥∥2

+β1

1− β1‖xt−1 − xt‖2.

Lemma 7.3. Let zt be defined in (7.1). For t ≥ 2, we have

‖∇f(zt)−∇f(xt)‖2 ≤ L( β1

1− β1

)· ‖xt − xt−1‖2.

We also need the following lemma to bound ‖∇f(x)‖∞, ‖vt‖∞ and ‖mt‖∞. Basically, it shows

that these quantities can be bounded by G∞.

Lemma 7.4. Let vt and mt be as defined in Algorithm 1. Then under Assumption 5.1, we have

‖∇f(x)‖∞ ≤ G∞, ‖vt‖∞ ≤ G2∞ and ‖mt‖∞ ≤ G∞.

Last, we need the following lemma that provides upper bounds for ‖V−1/2t mt‖2 and ‖V−1/2t gt‖2.

More specifically, it shows that we can bound ‖V−1/2t mt‖2 and ‖V−1/2t gt‖2 with∑d

i=1 ‖g1:T,i‖2.

Lemma 7.5. Let β1, β2 be the weight parameters, αt, t = 1, . . . , T be the step sizes in Algorithm

1. We denote γ = β1/β1/22 . Suppose that αt = α and γ ≤ 1, then under Assumption 5.1, we have

the following two results:

T∑t=1

α2t

∥∥V−1/2t mt

∥∥22≤ T 1/2α2

t (1− β1)2ε1/2(1− β2)1/2(1− γ)

d∑i=1

‖g1:T,i‖2,

and

T∑t=1

α2t

∥∥V−1/2t gt∥∥22≤ T 1/2α2

t

2ε1/2(1− β2)1/2(1− γ)

d∑i=1

‖g1:T,i‖2.

With all lemmas provided, now we are ready to show the proof sketch of Theorem 5.3.

11

Page 12: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

Proof Sketch of Theorem 5.3. Since f is L-smooth, we have:

f(zt+1) ≤ f(zt) +∇f(zt)>(zt+1 − zt) +

L

2‖zt+1 − zt‖22

= f(zt) +∇f(xt)>(zt+1 − zt)︸ ︷︷ ︸I1

+ (∇f(zt)−∇f(xt))>(zt+1 − zt)︸ ︷︷ ︸

I2

+L

2‖zt+1 − zt‖22︸ ︷︷ ︸

I3

. (7.2)

In the following, we bound I1, I2 and I3 separately.

Bounding term I1: We can prove that when t = 1,

∇f(x1)>(z2 − z1) = −∇f(x1)

>α1V−1/2t g1. (7.3)

For t ≥ 2, by Lemma 7.1, we can prove the following result:

∇f(xt)>(zt+1 − zt) ≤

1

1− β1G2∞

(∥∥αt−1v−1/2t−1∥∥1−∥∥αtv−1/2t

∥∥1

)−∇f(xt)

>αt−1V−1/2t−1 gt. (7.4)

Bounding term I2: For t ≥ 1, by Lemma 7.1 and Lemma 7.2, we can prove that

(∇f(zt)−∇f(xt)

)>(zt+1 − zt) ≤ L

∥∥αtV−1/2t gt∥∥22

+ 2L

(β1

1− β1

)2

‖xt − xt−1‖22, (7.5)

Bounding term I3: For t ≥ 1, by Lemma 7.1, we have

L

2‖zt+1 − zt‖22 ≤ L

∥∥αtV−1/2t gt∥∥22

+ 2L

(β1

1− β1

)2

‖xt−1 − xt‖22. (7.6)

Now we get back to (7.2). We provide upper bounds of (7.2) for t = 1 and t > 1 separately. For

t = 1, substituting (7.3), (7.5) and (7.6) into (7.2), taking expectation and rearranging terms, we

have

E[f(z2)− f(z1)] ≤ E[dα1G∞ + 2L

∥∥α1V−1/21 g1

∥∥22

], (7.7)

For t ≥ 2, substituting (7.4), (7.5) and (7.6) into (7.2), taking expectation and rearranging terms,

we have

E[f(zt+1) +

G2∞∥∥αtv−1/2t

∥∥1

1− β1

]− E

[f(zt) +

G2∞∥∥αt−1v−1/2t−1

∥∥1

1− β1

]≤ E

[− αt−1

∥∥∇f(xt)∥∥22(G2p∞)−1 + 2L

∥∥αtV−1/2t gt∥∥22

+ 4L

(β1

1− β1

)2∥∥αt−1V−1/2t−1 mt−1∥∥22

], (7.8)

We now telescope (7.8) for t = 2 to T , and add it with (7.7). Rearranging it, we have

G−1∞

T∑t=2

αt−1E∥∥∇f(xt)

∥∥22≤ E

[∆f +

G2∞α1ε

−1/2d

1− β1+ dα1G∞

]+ 2L

T∑t=1

E∥∥αtV−1/2t gt

∥∥22

12

Page 13: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

+ 4L

(β1

1− β1

)2 T∑t=1

E[∥∥αtV−1/2t mt

∥∥22

]. (7.9)

By using Lemma 7.5, we can further bound∑T

t=1 E‖αtV−1/2t gt‖22 and

∑Tt=1 E‖αtV

−1/2t mt‖22 in

(7.9) with∑d

i=1 ‖g1:T,i‖2, which turns out to be

E‖∇f(xout)‖22 ≤1

Tα2G∞∆f +

2

T

(G3∞ε−1/2d

1− β1+ dG2

)+

2G∞Lα

T 1/2ε1/2(1− γ)(1− β2)1/2E( d∑i=1

‖g1:T,i‖2)·(

1 + 2(1− β1)(

β11− β1

)2),

(7.10)

Finally, rearranging (7.10), and adopting the theorem condition that ‖g1:T,i‖2 ≤ G∞T s, we obtain

E‖∇f(xout)‖22 ≤M1

Tα+M2d

T+αM3d

T 1/2−s

where Mi3i=1 are defined in Theorem 5.3. This completes the proof.

We then show the proof sketch of for high probability result, i.e, Theorem 6.3.

Proof Sketch of Theorem 6.3. Following the same procedure as in the proof for Theorem 5.3 until

(7.6). For t = 1, substituting (7.3), (7.5) and (7.6) into (7.2), rearranging terms, we have

f(z2)− f(z1) ≤ dα1G∞ + 2L∥∥α1V

−1/21 g1

∥∥22, (7.11)

For t ≥ 2, substituting (7.4), (7.5) and (7.6) into (7.2), rearranging terms, we have

f(zt+1) +G2∞∥∥αtV−1/2t

∥∥1,1

1− β1−(f(zt) +

G2∞∥∥αt−1V−1/2t−1

∥∥1,1

1− β1

)≤ −∇f(xt)

>αt−1V−1/2t−1 gt + 2L

∥∥αtV−1/2t gt∥∥22

+ 4L

(β1

1− β1

)2∥∥αt−1V−1/2t−1 mt−1∥∥22. (7.12)

We now telescope (7.12) for t = 2 to T and add it with (7.11). Rearranging it, we have

T∑t=2

αt−1∇f(xt)>V−1/2t−1 gt ≤ ∆f +

G2∞α1ε

−1/2d

1− β1+ dα1G∞ + 2L

T∑t=1

∥∥αtV−1/2t gt∥∥22

+ 4L

(β1

1− β1

)2 T∑t=1

∥∥αtV−1/2t mt

∥∥22. (7.13)

Now consider the filtration Ft = σ(ξ1, . . . , ξt). Since xt and V−1/2t−1 only depend on ξ1, . . . , ξt−1, by

13

Page 14: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

Assumption 6.1 and an martingale concentration argument, we obtain we obtain∣∣∣∣∣T∑t=2

αt−1∇f(xt)>V−1/2t−1 gt −

T∑t=2

αt−1∇f(xt)>V−1/2t−1 ∇f(xt)

∣∣∣∣∣≤ G−1∞

T∑t=2

α2t−1‖∇f(xt)‖22 + Cε−1σ2G∞ log(2/δ), (7.14)

By using Lemma 7.5 and substituting (7.14) into (7.13), we have

T∑t=2

αt−1∇f(xt)>V−1/2t−1 ∇f(xt) ≤ ∆f +

G2∞α1ε

−1/2d

1− β1+ dα1G∞ +G−1∞

T∑t=2

α2t−1‖∇f(xt)‖22

+LT 1/2α2

t

ε1/2(1− β2)1/2(1− γ)

d∑i=1

‖g1:T,i‖2 + Cε−1σ2G∞ log(2/δ)

+

(β1

1− β1

)2 2LT 1/2α2t (1− β1)

ε1/2(1− β2)1/2(1− γ)

d∑i=1

‖g1:T,i‖2.

Moreover, by Lemma 7.4, we have ∇f(xt)>V−1/2t−1 ∇f(xt) ≥ G−1∞ ‖∇f(xt)‖22, and therefore by

choosing αt = α ≤ 1/2 and rearranging terms, we have

1

T − 1

T∑t=2

‖∇f(xt)‖22 ≤4G∞Tα

·∆f +4G3∞ε−1/2

1− β1· dT

+ 4G2∞ ·

d

T

+4G∞Lα

ε1/2(1− β2)1/2(1− γ)T 1/2

d∑i=1

‖g1:T,i‖2

+

(β1

1− β1

)2 8G∞Lα(1− β1)ε1/2(1− β2)1/2(1− γ)T 1/2

d∑i=1

‖g1:T,i‖2

+C ′G∞ε

−1σ2G∞ log(2/δ)

Tα, (7.15)

where C ′ is an absolute constant. Finally, rearranging (7.15) and adopting the theorem condition

‖g1:T,i‖2 ≤ G∞T s, we have

1

T − 1

T∑t=2

‖∇f(xt)‖22 ≤M1

Tα+M2d

T+αM3d

T 1/2−s ,

where Mi3i=1 are defined in Theorem 6.3. This completes the proof.

8 Conclusions

In this paper, we provided a fine-grained analysis of a general class of adaptive gradient methods,

and proved their convergence rates for smooth nonconvex optimization. Our results provide faster

14

Page 15: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

convergence rates of AMSGrad and the corrected version of RMSProp as well as AdaGrad for

smooth nonconvex optimization compared with previous works. In addition, we also prove high

probability bounds on the convergence rates of AMSGrad and RMSProp as well as AdaGrad, which

have not been established before.

A Detailed Proof of the Main Theory

Here we provide the detailed proof of the main theorem.

A.1 Proof of Theorem 5.3

Let x0 = x1. To prove Theorem 5.3, we need the following lemmas:

Lemma A.1 (Restatement of Lemma 7.4). Let vt and mt be as defined in Algorithm 1. Then

under Assumption 5.1, we have ‖∇f(x)‖∞ ≤ G∞, ‖vt‖∞ ≤ G2∞ and ‖mt‖∞ ≤ G∞.

Lemma A.2 (Generalized version of Lemma 7.5). Let β1, β2, β′1, β′2 be the weight parameters such

that

mt = β1mt−1 + (1− β′1)gt,vt = β2vt−1 + (1− β′2)g2

t ,

αt, t = 1, . . . , T be the step sizes. We denote γ = β1/β1/22 . Suppose that αt = α and γ ≤ 1, then

under Assumption 5.1, we have the following two results:

T∑t=1

α2t

∥∥V−1/2t mt

∥∥22≤ T 1/2α2

t (1− β′1)2ε1/2(1− β′2)1/2(1− γ)

d∑i=1

‖g1:T,i‖2,

and

T∑t=1

α2t

∥∥V−1/2t gt∥∥22≤ T 1/2α2

t

2ε1/2(1− β′2)1/2(1− γ)

d∑i=1

‖g1:T,i‖2.

Note that Lemma A.2 is general and applicable to various algorithms. Specifically, set β′1 = β1and β′2 = β2, we recover the case in Algorithm 1. Further set β1 = 0 we recover the case in Algorithm

2. Set β′1 = β1 = 0 and β2 = 1, β′2 = 0 we recover the case in Algorithm 3.

To deal with stochastic momentum mt and stochastic weight V−1/2t , following Yang et al. (2016),

we define an auxiliary sequence zt as follows: let x0 = x1, and for each t ≥ 1,

zt = xt +β1

1− β1(xt − xt−1) =

1

1− β1xt −

β11− β1

xt−1. (A.1)

Lemma A.3 shows that zt+1 − zt can be represented in two different ways.

15

Page 16: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

Lemma A.3 (Restatement of Lemma 7.1). Let zt be defined in (A.1). For t ≥ 2, we have

zt+1 − zt =β1

1− β1

[I−

(αtV

−1/2t

)(αt−1V

−1/2t−1

)−1](xt−1 − xt)− αtV−1/2t gt. (A.2)

and

zt+1 − zt =β1

1− β1(αt−1V

−1/2t−1 − αtV

−1/2t

)mt−1 − αtV−1/2t gt. (A.3)

For t = 1, we have

z2 − z1 = −α1V−1/21 g1. (A.4)

By Lemma A.3, we connect zt+1 − zt with xt+1 − xt and αtV−1/2t gt

The following two lemmas give bounds on ‖zt+1 − zt‖2 and ‖∇f(zt) − ∇f(xt)‖2, which play

important roles in our proof.

Lemma A.4 (Restatement of Lemma 7.2). Let zt be defined in (A.1). For t ≥ 2, we have

‖zt+1 − zt‖2 ≤∥∥αV−1/2t gt

∥∥2

+β1

1− β1‖xt−1 − xt‖2.

Lemma A.5 (Restatement of Lemma 7.3). Let zt be defined in (A.1). For t ≥ 2, we have

‖∇f(zt)−∇f(xt)‖2 ≤ L( β1

1− β1

)· ‖xt − xt−1‖2.

Now we are ready to prove Theorem 5.3.

Proof of Theorem 5.3. Since f is L-smooth, we have:

f(zt+1) ≤ f(zt) +∇f(zt)>(zt+1 − zt) +

L

2‖zt+1 − zt‖22

= f(zt) +∇f(xt)>(zt+1 − zt)︸ ︷︷ ︸I1

+ (∇f(zt)−∇f(xt))>(zt+1 − zt)︸ ︷︷ ︸

I2

+L

2‖zt+1 − zt‖22︸ ︷︷ ︸

I3

(A.5)

In the following, we bound I1, I2 and I3 separately.

Bounding term I1: When t = 1, we have

∇f(x1)>(z2 − z1) = −∇f(x1)

>α1V−1/2t g1. (A.6)

For t ≥ 2, we have

∇f(xt)>(zt+1 − zt)

= ∇f(xt)>[

β11− β1

(αt−1V

−1/2t−1 − αtV

−1/2t

)mt−1 − αtV−1/2t gt

]

16

Page 17: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

=β1

1− β1∇f(xt)

>(αt−1V−1/2t−1 − αtV−1/2t

)mt−1 −∇f(xt)

>αtV−1/2t gt, (A.7)

where the first equality holds due to (A.3) in Lemma A.3. For ∇f(xt)>(αt−1V

−1/2t−1 −αtV

−1/2t )mt−1

in (A.7), we have

∇f(xt)>(αt−1V

−1/2t−1 − αtV

−1/2t )mt−1 ≤ ‖∇f(xt)‖∞ ·

∥∥αt−1V−1/2t−1 − αtV−1/2t

∥∥1,1· ‖mt−1‖∞

≤ G2∞

[∥∥αt−1V−1/2t−1∥∥1,1−∥∥αtV−1/2t

∥∥1,1

]. (A.8)

The first inequality holds because for a positive diagonal matrix A, we have x>Ay ≤ ‖x‖∞ ·‖A‖1,1 · ‖y‖∞. The second inequality holds due to αt−1V

−1/2t−1 αtV

−1/2t 0. Next we bound

−∇f(xt)>αtV

−1/2t gt. We have

−∇f(xt)>αtV

−1/2t gt

= −∇f(xt)>αt−1V

−1/2t−1 gt −∇f(xt)

>(αtV−1/2t − αt−1V−1/2t−1)gt

≤ −∇f(xt)>αt−1V

−1/2t−1 gt + ‖∇f(xt)‖∞ ·

∥∥αtV−1/2t − αt−1V−1/2t−1∥∥1,1· ‖gt‖∞

≤ −∇f(xt)>αt−1V

−1/2t−1 gt +G2

(∥∥αt−1V−1/2t−1∥∥1,1−∥∥αtV−1/2t

∥∥1,1

). (A.9)

The first inequality holds because for a positive diagonal matrix A, we have x>Ay ≤ ‖x‖∞ · ‖A‖1,1 ·‖y‖∞. The second inequality holds due to αt−1V

−1/2t−1 αtV

−1/2t 0. Substituting (A.8) and (A.9)

into (A.7), we have

∇f(xt)>(zt+1 − zt) ≤ −∇f(xt)

>αt−1V−1/2t−1 gt +

1

1− β1G2∞

(∥∥αt−1V−1/2t−1∥∥1,1−∥∥αtV−1/2t

∥∥1,1

).

(A.10)

Bounding term I2: For t ≥ 1, we have(∇f(zt)−∇f(xt)

)>(zt+1 − zt)

≤∥∥∇f(zt)−∇f(xt)

∥∥2· ‖zt+1 − zt‖2

≤(∥∥αtV−1/2t gt

∥∥2

+β1

1− β1‖xt−1 − xt‖2

)· β1

1− β1· L‖xt − xt−1‖2

= Lβ1

1− β1∥∥αtV−1/2t gt

∥∥2· ‖xt − xt−1‖2 + L

(β1

1− β1

)2

‖xt − xt−1‖22

≤ L∥∥αtV−1/2t gt

∥∥22

+ 2L

(β1

1− β1

)2

‖xt − xt−1‖22, (A.11)

where the second inequality holds because of Lemma A.3 and Lemma A.4, the last inequality holds

due to Young’s inequality.

17

Page 18: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

Bounding term I3: For t ≥ 1, we have

L

2‖zt+1 − zt‖22 ≤

L

2

[∥∥αtV−1/2t gt∥∥2

+β1

1− β1‖xt−1 − xt‖2

]2≤ L

∥∥αtV−1/2t gt∥∥22

+ 2L

(β1

1− β1

)2

‖xt−1 − xt‖22. (A.12)

The first inequality is obtained by introducing Lemma A.3.

For t = 1, substituting (A.6), (A.11) and (A.12) into (A.5), taking expectation and rearranging

terms, we have

E[f(z2)− f(z1)]

≤ E[−∇f(x1)

>α1V−1/21 g1 + 2L

∥∥α1V−1/21 g1

∥∥22

+ 4L

(β1

1− β1

)2

‖x1 − x0‖22]

= E[−∇f(x1)>α1V

−1/21 g1 + 2L

∥∥α1V−1/21 g1

∥∥22]

≤ E[dα1G∞ + 2L∥∥α1V

−1/21 g1

∥∥22], (A.13)

where the last inequality holds because

−∇f(x1)>V−1/21 g1 ≤ d · ‖∇f(x1)‖∞ · ‖V−1/21 g1‖∞ ≤ dG∞.

For t ≥ 2, substituting (A.10), (A.11) and (A.12) into (A.5), taking expectation and rearranging

terms, we have

E[f(zt+1) +

G2∞∥∥αtV−1/2t

∥∥1,1

1− β1−(f(zt) +

G2∞∥∥αt−1V−1/2t−1

∥∥1,1

1− β1

)]≤ E

[−∇f(xt)

>αt−1V−1/2t−1 gt + 2L

∥∥αtV−1/2t gt∥∥22

+ 4L

(β1

1− β1

)2

‖xt − xt−1‖22]

= E[−∇f(xt)

>αt−1V−1/2t−1 ∇f(xt) + 2L

∥∥αtV−1/2t gt∥∥22

+ 4L

(β1

1− β1

)2∥∥αt−1V−1/2t−1 mt−1∥∥22

]≤ E

[− αt−1

∥∥∇f(xt)∥∥22G−1∞ + 2L

∥∥αtV−1/2t gt∥∥22

+ 4L

(β1

1− β1

)2∥∥αt−1V−1/2t−1 mt−1∥∥22

], (A.14)

where the equality holds because E[gt] = ∇f(xt) conditioned on ∇f(xt) and V−1/2t−1 , the second

inequality holds because of Lemma A.1. Telescoping (A.14) for t = 2 to T and adding with (A.13),

we have

G−1∞

T∑t=2

αt−1E∥∥∇f(xt)

∥∥22

≤ E[f(z1) +

G2∞∥∥α1V

−1/21

∥∥1,1

1− β1+ dα1G∞ −

(f(zT+1) +

G2∞∥∥αT v−1/2T

∥∥1

1− β1

)]

18

Page 19: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

+ 2L

T∑t=1

E∥∥αtV−1/2t gt

∥∥22

+ 4L

(β1

1− β1

)2 T∑t=2

E[∥∥αt−1V−1/2t−1 mt−1

∥∥22

]≤ E

[∆f +

G2∞α1ε

−1/2d

1− β1+ dα1G∞

]+ 2L

T∑t=1

E∥∥αtV−1/2t gt

∥∥22

+ 4L

(β1

1− β1

)2 T∑t=1

E[∥∥αtV−1/2t mt

∥∥22

]. (A.15)

By Lemma A.2, we have

T∑t=1

α2tE[‖V−1/2t mt‖22

]≤ T 1/2α2

t (1− β1)2ε1/2(1− β2)1/2(1− γ)

E( d∑i=1

‖g1:T,i‖2), (A.16)

where γ = β1/β1/22 . We also have

T∑t=1

α2tE[‖V−1/2t gt‖22

]≤ T 1/2α2

t

2ε1/2(1− β2)1/2(1− γ)E( d∑i=1

‖g1:T,i‖2). (A.17)

Substituting (A.16) and (A.17) into (A.15), and rearranging (A.15), we have

E‖∇f(xout)‖22

=1∑T

t=2 αt−1

T∑t=2

αt−1E∥∥∇f(xt)

∥∥22

≤ G∞∑Tt=2 αt−1

E[∆f +

G2∞α1ε

−1/2d

1− β1+ dα1G∞

]

+2LG∞∑Tt=2 αt−1

· T 1/2α2t

2ε1/2(1− β2)1/2(1− γ)E( d∑i=1

‖g1:T,i‖2)1−q

+4LG∞∑Tt=2 αt−1

(β1

1− β1

)2 T 1/2α2t (1− β1)

2ε1/2(1− β2)1/2(1− γ)E( d∑i=1

‖g1:T,i‖2)1−q

≤ 1

Tα2G∞∆f +

2

T

(G3∞ε−1/2d

1− β1+ dG2

)+

2G∞Lα

T 1/2ε1/2(1− γ)(1− β2)1/2E( d∑i=1

‖g1:T,i‖2)(

1 + 2(1− β1)(

β11− β1

)2), (A.18)

where the second inequality holds because αt = α. Rearranging (A.18), and note that in the theorem

condition we have ‖g1:T,i‖2 ≤ G∞T s, we obtain

E‖∇f(xout)‖22 ≤M1

Tα+M2d

T+αM3d

T 1/2−s ,

19

Page 20: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

where Mi3i=1 are defined in Theorem 5.3. This completes the proof.

A.2 Proof of Corollary 5.6

Proof of Corollary 5.6. Following the proof for Theorem 5.3, setting β′1 = β1 = 0 and β′2 = β2 = β

in Lemma A.2 we get the conclusion.

A.3 Proof of Corollary 5.7

Proof of Corollary 5.7. Following the proof for Theorem 5.3, setting β′1 = β1 = 0, β2 = 1 and β′2 = 0

in Lemma A.2 we get the conclusion.

A.4 Proof of Theorem 6.3

Proof of Theorem 6.3. Since f is L-smooth, we have:

f(zt+1) ≤ f(zt) +∇f(zt)>(zt+1 − zt) +

L

2‖zt+1 − zt‖22

= f(zt) +∇f(xt)>(zt+1 − zt)︸ ︷︷ ︸I1

+ (∇f(zt)−∇f(xt))>(zt+1 − zt)︸ ︷︷ ︸

I2

+L

2‖zt+1 − zt‖22︸ ︷︷ ︸

I3

(A.19)

In the following, we bound I1, I2 and I3 separately.

Bounding term I1: When t = 1, we have

∇f(x1)>(z2 − z1) = −∇f(x1)

>α1V−1/2t g1. (A.20)

For t ≥ 2, we have

∇f(xt)>(zt+1 − zt)

= ∇f(xt)>[

β11− β1

(αt−1V

−1/2t−1 − αtV

−1/2t

)mt−1 − αtV−1/2t gt

]=

β11− β1

∇f(xt)>(αt−1V−1/2t−1 − αtV

−1/2t

)mt−1 −∇f(xt)

>αtV−1/2t gt, (A.21)

where the first equality holds due to (A.3) in Lemma A.3. For ∇f(xt)>(αt−1V

−1/2t−1 −αtV

−1/2t )mt−1

in (A.21), we have

∇f(xt)>(αt−1V

−1/2t−1 − αtV

−1/2t )mt−1 ≤ ‖∇f(xt)‖∞ ·

∥∥αt−1V−1/2t−1 − αtV−1/2t

∥∥1,1· ‖mt−1‖∞

≤ G2∞

[∥∥αt−1V−1/2t−1∥∥1,1−∥∥αtV−1/2t

∥∥1,1

]. (A.22)

The first inequality holds because for a positive diagonal matrix A, we have x>Ay ≤ ‖x‖∞ ·‖A‖1,1 · ‖y‖∞. The second inequality holds due to αt−1V

−1/2t−1 αtV

−1/2t 0. Next we bound

20

Page 21: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

−∇f(xt)>αtV

−1/2t gt. We have

−∇f(xt)>αtV

−1/2t gt

= −∇f(xt)>αt−1V

−1/2t−1 gt −∇f(xt)

>(αtV−1/2t − αt−1V−1/2t−1)gt

≤ −∇f(xt)>αt−1V

−1/2t−1 gt + ‖∇f(xt)‖∞ ·

∥∥αtV−1/2t − αt−1V−1/2t−1∥∥1,1· ‖gt‖∞

≤ −∇f(xt)>αt−1V

−1/2t−1 gt +G2

(∥∥αt−1V−1/2t−1∥∥1,1−∥∥αtV−1/2t

∥∥1,1

). (A.23)

The first inequality holds because for a positive diagonal matrix A, we have x>Ay ≤ ‖x‖∞ · ‖A‖1,1 ·‖y‖∞. The second inequality holds due to αt−1V

−1/2t−1 αtV

−1/2t 0. Substituting (A.22) and

(A.23) into (A.21), we have

∇f(xt)>(zt+1 − zt) ≤ −∇f(xt)

>αt−1V−1/2t−1 gt +

1

1− β1G2∞

(∥∥αt−1V−1/2t−1∥∥1,1−∥∥αtV−1/2t

∥∥1,1

).

(A.24)

Bounding term I2: For t ≥ 1, we have(∇f(zt)−∇f(xt)

)>(zt+1 − zt)

≤∥∥∇f(zt)−∇f(xt)

∥∥2· ‖zt+1 − zt‖2

≤(∥∥αtV−1/2t gt

∥∥2

+β1

1− β1‖xt−1 − xt‖2

)· β1

1− β1· L‖xt − xt−1‖2

= Lβ1

1− β1∥∥αtV−1/2t gt

∥∥2· ‖xt − xt−1‖2 + L

(β1

1− β1

)2

‖xt − xt−1‖22

≤ L∥∥αtV−1/2t gt

∥∥22

+ 2L

(β1

1− β1

)2

‖xt − xt−1‖22, (A.25)

where the second inequality holds because of Lemma A.3 and Lemma A.4, the last inequality holds

due to Young’s inequality.

Bounding term I3: For t ≥ 1, we have

L

2‖zt+1 − zt‖22 ≤

L

2

[∥∥αtV−1/2t gt∥∥2

+β1

1− β1‖xt−1 − xt‖2

]2≤ L

∥∥αtV−1/2t gt∥∥22

+ 2L

(β1

1− β1

)2

‖xt−1 − xt‖22. (A.26)

The first inequality is obtained by introducing Lemma A.3.

For t = 1, substituting (A.20), (A.25) and (A.26) into (A.19) and rearranging terms, we have

f(z2)− f(z1) ≤ −∇f(x1)>α1V

−1/21 g1 + 2L

∥∥α1V−1/21 g1

∥∥22

+ 4L

(β1

1− β1

)2

‖x1 − x0‖22

= −∇f(x1)>α1V

−1/21 g1 + 2L

∥∥α1V−1/21 g1

∥∥22

≤ dα1G∞ + 2L∥∥α1V

−1/21 g1

∥∥22, (A.27)

21

Page 22: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

where the last inequality holds because

−∇f(x1)>V−1/21 g1 ≤ d · ‖∇f(x1)‖∞ · ‖V−1/21 g1‖∞ ≤ dG∞.

For t ≥ 2, substituting (A.24), (A.25) and (A.26) into (A.19) and rearranging terms, we have

f(zt+1) +G2∞∥∥αtV−1/2t

∥∥1,1

1− β1−(f(zt) +

G2∞∥∥αt−1V−1/2t−1

∥∥1,1

1− β1

)≤ −∇f(xt)

>αt−1V−1/2t−1 gt + 2L

∥∥αtV−1/2t gt∥∥22

+ 4L

(β1

1− β1

)2

‖xt − xt−1‖22

= −∇f(xt)>αt−1V

−1/2t−1 gt + 2L

∥∥αtV−1/2t gt∥∥22

+ 4L

(β1

1− β1

)2∥∥αt−1V−1/2t−1 mt−1∥∥22. (A.28)

Telescoping (A.28) for t = 2 to T and adding (A.27), we have

T∑t=2

αt−1∇f(xt)>V−1/2t−1 gt ≤ f(z1) +

G2∞∥∥α1V

−1/21

∥∥1,1

1− β1+ dα1G∞ −

(f(zT+1) +

G2∞∥∥αT v−1/2T

∥∥1

1− β1

)

+ 2LT∑t=1

∥∥αtV−1/2t gt∥∥22

+ 4L

(β1

1− β1

)2 T∑t=2

∥∥αt−1V−1/2t−1 mt−1∥∥22

≤ ∆f +G2∞α1ε

−1/2d

1− β1+ dα1G∞ + 2L

T∑t=1

∥∥αtV−1/2t gt∥∥22

+ 4L

(β1

1− β1

)2 T∑t=1

∥∥αtV−1/2t mt

∥∥22. (A.29)

By Lemma A.2, we have

T∑t=1

α2t [‖V

−1/2t mt‖22 ≤

T 1/2α2t (1− β1)

2ε1/2(1− β2)1/2(1− γ)

d∑i=1

‖g1:T,i‖2, (A.30)

where γ = β1/β1/22 . We also have

T∑t=1

α2t ‖V

−1/2t gt‖22 ≤

T 1/2α2t

2ε1/2(1− β2)1/2(1− γ)

d∑i=1

‖g1:T,i‖2. (A.31)

Moreover, consider the filtration Ft = σ(ξ1, . . . , ξt). Since xt and V−1/2t−1 only depend on ξ1, . . . , ξt−1.

For any τ, λ > 0, by Assumption 6.1 with v = λ · αt−1V−1/2t−1 ∇f(xt), we have

E

exp[λαt−1∇f(xt)

>V−1/2t−1 (gt −∇f(xt))

]∣∣∣Ft−1 ≤ exp(σ2α2t−1λ

2‖V−1/2t−1 ∇f(xt)‖22/2).

22

Page 23: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

Denote Zt = αt−1∇f(xt)>V−1/2t−1 (gt −∇f(xt)). Then we have

P(Zt ≥ τ |Ft−1) = P[exp(λZt) ≥ exp(λτ)|Ft−1]= E[1exp(−λτ + λZt) ≥ 1|Ft−1]≤ exp(−λτ) · E[exp(λZt)|Ft−1]

≤ exp(−λτ) · exp(σ2α2t−1λ

2‖V−1/2t−1 ∇f(xt)‖22/2)

= exp(−λτ + σ2α2t−1λ

2‖V−1/2t−1 ∇f(xt)‖22/2).

With exactly the same proof, we also have

P(Zt ≤ −τ |Ft−1) ≤ exp(−λτ + σ2α2t−1λ

2‖V−1/2t−1 ∇f(xt)‖22/2),

and therefore

P(|Zt| ≥ τ |Ft−1) ≤ 2 exp(−λτ + σ2α2t−1λ

2‖V−1/2t−1 ∇f(xt)‖22/2).

Choosing λ = [σ2α2t−1‖V

−1/2t−1 ∇f(xt)‖22]−1τ , we finally obtain

P(|Zt| ≥ τ |Ft−1) ≤ 2 exp(−τ2/(2σ2t )) (A.32)

for all τ > 0, where σt = σαt−1‖V−1/2t−1 ∇f(xt)‖2. The tail bound (A.32) enables the application of

Lemma 6 in Jin et al. (2019), which gives that with probability at least 1− δ,∣∣∣∣∣T∑t=2

Zt

∣∣∣∣∣ ≤T∑t=2

σ2t + C log(2/ε),

where C is an absolute constant. Plugging in the definitions of Zt and σt, we obtain∣∣∣∣∣T∑t=2

αt−1∇f(xt)>V−1/2t−1 gt −

T∑t=2

αt−1∇f(xt)>V−1/2t−1 ∇f(xt)

∣∣∣∣∣≤ (εσ−2G−1∞ ) ·

T∑t=2

σ2α2t−1‖V

−1/2t−1 ∇f(xt)‖22 + C(εσ−2G−1∞ )−1 log(2/δ)

≤ G−1∞T∑t=2

α2t−1‖∇f(xt)‖22 + Cε−1σ2G∞ log(2/δ), (A.33)

where the second inequality is by the fact that the diagonal entries of Vt−1 are all loewr bounded

by ε. Substituting (A.30), (A.31) and (A.33) into (A.29), we have

T∑t=2

αt−1∇f(xt)>V−1/2t−1 ∇f(xt)

23

Page 24: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

≤ ∆f +G2∞α1ε

−1/2d

1− β1+ dα1G∞ +

LT 1/2α2t

ε1/2(1− β2)1/2(1− γ)

d∑i=1

‖g1:T,i‖2

+

(β1

1− β1

)2 2LT 1/2α2t (1− β1)

ε1/2(1− β2)1/2(1− γ)

d∑i=1

‖g1:T,i‖2 +G−1∞

T∑t=2

α2t−1‖∇f(xt)‖22 + Cε−1σ2G∞ log(2/δ).

Moreover, by Lemma A.1, we have ∇f(xt)>V−1/2t−1 ∇f(xt) ≥ G−1∞ ‖∇f(xt)‖22, and therefore by

choosing αt = α and rearranging terms, we have

G−1∞

T∑t=2

α(1− α)‖∇f(xt)‖22 ≤ ∆f +G2∞αε

−1/2d

1− β1+ dαG∞ +

LT 1/2α2

ε1/2(1− β2)1/2(1− γ)

d∑i=1

‖g1:T,i‖2

+

(β1

1− β1

)2 2LT 1/2α2(1− β1)ε1/2(1− β2)1/2(1− γ)

d∑i=1

‖g1:T,i‖2 + Cε−1σ2G∞ log(2/δ).

Therefore when α < 1/2, we have

1

T − 1

T∑t=2

‖∇f(xt)‖22 ≤4G∞Tα

·∆f +4G3∞ε−1/2

1− β1· dT

+ 4G2∞ ·

d

T+

4G∞Lα

ε1/2(1− β2)1/2(1− γ)T 1/2

d∑i=1

‖g1:T,i‖2

+

(β1

1− β1

)2 8G∞Lα(1− β1)ε1/2(1− β2)1/2(1− γ)T 1/2

d∑i=1

‖g1:T,i‖2 +C ′G∞ε

−1σ2G∞ log(2/δ)

Tα,

where C ′ is an absolute constant.

Now by the theorem condition ‖g1:T,i‖2 ≤ G∞T s, we have

1

T − 1

T∑t=2

‖∇f(xt)‖22 ≤M1

Tα+M2d

T+αM3d

T 1/2−s ,

where

M1 = 4G∞∆f + C ′G∞ε−1σ2G∞ log(2/δ),

M2 =4G3∞ε−1/2

1− β1+ 4G2

∞,

M3 =4LG2

ε1/2(1− β2)1/2(1− β1/β1/22 )

(1 +

2β211− β1

)where Mi3i=1 are defined in Theorem 6.3. This completes the proof.

A.5 Proof of Corollary 6.5

Proof of Corollary 6.5. Following the proof for Theorem 6.3, setting β′1 = β1 = 0 and β′2 = β2 = β

in Lemma A.2 we get the conclusion.

24

Page 25: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

A.6 Proof of Corollary 6.6

Proof of Corollary 6.6. Following the proof for Theorem 6.3, setting β′1 = β1 = 0, β2 = 1 and β′2 = 0

in Lemma A.2 we get the conclusion.

B Proof of Technical Lemmas

B.1 Proof of Lemma A.1

Proof of Lemma A.1. Since f has G∞-bounded stochastic gradient, for any x and ξ, ‖∇f(x; ξ)‖∞ ≤G∞. Thus, we have

‖∇f(x)‖∞ = ‖Eξ∇f(x; ξ)‖∞ ≤ Eξ‖∇f(x; ξ)‖∞ ≤ G∞.

Next we bound ‖mt‖∞. We have ‖m0‖∞ = 0 ≤ G∞. Suppose that ‖mt‖∞ ≤ G∞, then for mt+1,

we have

‖mt+1‖∞ = ‖β1mt + (1− β1)gt+1‖∞≤ β1‖mt‖∞ + (1− β1)‖gt+1‖∞≤ β1G∞ + (1− β1)G∞= G∞.

Thus, for any t ≥ 0, we have ‖mt‖∞ ≤ G∞. Finally we bound ‖vt‖∞. First we have ‖v0‖∞ =

‖v0‖∞ = 0 ≤ G2∞. Suppose that ‖vt‖∞ ≤ G2

∞ and ‖vt‖∞ ≤ G2∞. Note that we have

‖vt+1‖∞ = ‖β2vt + (1− β2)g2t+1‖∞

≤ β2‖vt‖∞ + (1− β2)‖g2t+1‖∞

≤ β2G2∞ + (1− β2)G2

= G2∞,

and by definition, we have ‖vt+1‖∞ = max‖vt‖∞, ‖vt+1‖∞ ≤ G2∞. Thus, for any t ≥ 0, we have

‖vt‖∞ ≤ G2∞.

B.2 Proof of Lemma A.2

Proof. Recall that vt,j ,mt,j , gt,j denote the j-th coordinate of vt,mt and gt. We have

α2t ‖V

−1/2t mt‖22 = α2

t

d∑i=1

m2t,i

v1/2t,i

·v1/2t,i

vt,i + ε

≤ α2t

d∑i=1

m2t,i

v1/2t,i

·v1/2t,i

2v1/2t,i ε

1/2

25

Page 26: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

≤ α2t

2ε1/2

d∑i=1

m2t,i

v1/2t,i

=α2t

2ε1/2

d∑i=1

(∑t

j=1(1− β′1)βt−j1 gj,i)

2

(∑t

j=1(1− β′2)βt−j2 g2j,i)

1/2, (B.1)

where the first inequality holds since a+b ≥ 2√ab and the second inequality holds because vt,i ≥ vt,i.

Next we have

α2t

2ε1/2

d∑i=1

(∑t

j=1(1− β′1)βt−j1 gj,i)

2

(∑t

j=1(1− β′2)βt−j2 g2j,i)

1/2≤ α2

t (1− β′1)2

2ε1/2(1− β′2)1/2d∑i=1

(∑t

j=1 βt−j1 )(

∑tj=1 β

t−j1 |gj,i|2)

(∑t

j=1 βt−j2 g2j,i)

1/2

≤ α2t (1− β′1)

2ε1/2(1− β′2)1/2d∑i=1

∑tj=1 β

t−j1 |gj,i|2

(∑t

j=1 βt−j2 g2j,i)

1/2, (B.2)

where the first inequality holds due to Cauchy inequality, and the last inequality holds because∑tj=1 β

t−j1 ≤ (1− β1)−1. Note that

d∑i=1

∑tj=1 β

t−j1 |gj,i|2

(∑t

j=1 βt−j2 g2j,i)

1/2≤

d∑i=1

t∑j=1

βt−j1 |gj,i|2

(βt−j2 g2j,i)1/2

=

d∑i=1

t∑j=1

γt−j |gj,i|, (B.3)

where the equality holds due to the definition of γ. Substituting (B.2) and (B.3) into (B.1), we have

α2t ‖V

−1/2t mt‖22 ≤

α2t (1− β′1)

2ε1/2(1− β′2)1/2d∑i=1

t∑j=1

γt−j |gj,i|. (B.4)

Telescoping (B.4) for t = 1 to T , we have

T∑t=1

α2t ‖V

−1/2t mt‖22 ≤

α2t (1− β′1)

2ε1/2(1− β′2)1/2T∑t=1

d∑i=1

t∑j=1

γt−j |gj,i|

=α2t (1− β′1)

2ε1/2(1− β′2)1/2d∑i=1

T∑j=1

|gj,i|T∑t=j

γt−j

≤ α2t (1− β′1)

2ε1/2(1− β′2)1/2(1− γ)

d∑i=1

T∑j=1

|gj,i|. (B.5)

Finally, we have

d∑i=1

T∑j=1

|gj,i| ≤d∑i=1

( T∑j=1

g2j,i

)1/2· T 1/2 = T 1/2

d∑i=1

‖g1:T,i‖2, (B.6)

26

Page 27: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

where the inequality holds due to Holder’s inequality. Substituting (B.6) into (B.5), we have

T∑t=1

α2t ‖V

−1/2t mt‖22 ≤

T 1/2α2t (1− β′1)

2ε1/2(1− β′2)1/2(1− γ)

d∑i=1

‖g1:T,i‖2.

Specifically, taking β1 = 0, we have mt = gt, then

T∑t=1

α2t ‖V

−1/2t gt‖22 ≤

T 1/2α2t

2ε1/2(1− β′2)1/2(1− γ)

d∑i=1

‖g1:T,i‖2.

B.3 Proof of Lemma A.3

Proof. By definition, we have

zt+1 = xt+1 +β1

1− β1(xt+1 − xt) =

1

1− β1xt+1 −

β11− β1

xt.

Then we have

zt+1 − zt =1

1− β1(xt+1 − xt)−

β11− β1

(xt − xt−1)

=1

1− β1(− αtV−1/2t mt

)+

β11− β1

αt−1V−1/2t−1 mt−1.

The equities above are based on definition. Then we have

zt+1 − zt =−αtV−1/2t

1− β1

[β1mt−1 + (1− β1)gt

]+

β11− β1

αt−1V−1/2t−1 mt−1

=β1

1− β1mt−1

(αt−1V

−1/2t−1 − αtV

−1/2t

)− αtV−1/2t gt

=β1

1− β1αt−1V

−1/2t−1 mt−1

[I−

(αtV

−1/2t

)(αt−1V

−1/2t−1

)−1]− αtV−1/2t gt

=β1

1− β1

[I−

(αtV

−1/2t

)(αt−1V

−1/2t−1

)−1](xt−1 − xt)− αtV−1/2t gt.

The equalities above follow by combining the like terms.

B.4 Proof of Lemma A.4

Proof. By Lemma A.3, we have

‖zt+1 − zt‖2 =

∥∥∥∥ β11− β1

[I− (αtV

−1/2t )(αt−1V

−1/2t−1 )−1

](xt−1 − xt)− αtV−1/2t gt

∥∥∥∥2

≤ β11− β1

∥∥∥I− (αtV−1/2t )(αt−1V

−1/2t−1 )−1

∥∥∥∞,∞

· ‖xt−1 − xt‖2 +∥∥αV−1/2t gt

∥∥2,

27

Page 28: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

where the inequality holds because the term β1/(1− β1) is positive, and triangle inequality. Con-

sidering that αtv−1/2t,j ≤ αt−1v

−1/2t−1,j , when p > 0, we have

∥∥∥I− (αtV−1/2t )(αt−1V

−1/2t−1 )−1

∥∥∥∞,∞

≤ 1.

With that fact, the term above can be bound as:

‖zt+1 − zt‖2 ≤∥∥αV−1/2t gt

∥∥2

+β1

1− β1‖xt−1 − xt‖2.

This completes the proof.

B.5 Proof of Lemma A.5

Proof. For term ‖∇f(zt)−∇f(xt)‖2, we have:

‖∇f(zt)−∇f(xt)‖2 ≤ L‖zt − xt‖2

≤ L∥∥∥ β1

1− β1(xt − xt−1)

∥∥∥2

≤ L( β1

1− β1

)· ‖xt − xt−1‖2,

where the last inequality holds because the term β1/(1− β1) is positive.

References

Allen-Zhu, Z. (2017a). Natasha 2: Faster non-convex optimization than sgd. arXiv preprint

arXiv:1708.08694 .

Allen-Zhu, Z. (2017b). Natasha: Faster non-convex stochastic optimization via strongly non-

convex parameter. In International Conference on Machine Learning.

Allen-Zhu, Z. (2018). How to make the gradients small stochastically. arXiv preprint

arXiv:1801.02982 .

Allen-Zhu, Z. and Hazan, E. (2016). Variance reduction for faster non-convex optimization. In

International Conference on Machine Learning.

Basu, A., De, S., Mukherjee, A. and Ullah, E. (2018). Convergence guarantees for rmsprop and

adam in non-convex optimization and their comparison to nesterov acceleration on autoencoders.

arXiv preprint arXiv:1807.06766 .

Chen, J., Zhou, D., Tang, Y., Yang, Z., Cao, Y. and Gu, Q. (2020). Closing the generalization

gap of adaptive gradient methods in training deep neural networks. In International Joint

Conferences on Artificial Intelligence.

Chen, X., Liu, S., Sun, R. and Hong, M. (2018a). On the convergence of a class of adam-type

algorithms for nonconvex optimization. arXiv preprint arXiv:1808.02941 .

Chen, Z., Xu, Y., Chen, E. and Yang, T. (2018b). Sadagrad: Strongly adaptive stochastic

gradient methods. In International Conference on Machine Learning.

28

Page 29: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

Defossez, A., Bottou, L., Bach, F. and Usunier, N. (2020). On the convergence of adam and

adagrad. arXiv preprint arXiv:2003.02395 .

Dozat, T. (2016). Incorporating nesterov momentum into adam .

Duchi, J., Hazan, E. and Singer, Y. (2011). Adaptive subgradient methods for online learning

and stochastic optimization. Journal of Machine Learning Research 12 2121–2159.

Ghadimi, S. and Lan, G. (2013). Stochastic first-and zeroth-order methods for nonconvex stochastic

programming. SIAM Journal on Optimization 23 2341–2368.

Ghadimi, S. and Lan, G. (2016). Accelerated gradient methods for nonconvex nonlinear and

stochastic programming. Mathematical Programming 156 59–99.

Harvey, N. J., Liaw, C., Plan, Y. and Randhawa, S. (2019a). Tight analyses for non-smooth

stochastic gradient descent. In Conference on Learning Theory. PMLR.

Harvey, N. J., Liaw, C. and Randhawa, S. (2019b). Simple and optimal high-probability bounds

for strongly-convex stochastic gradient descent. arXiv preprint arXiv:1909.00843 .

Jain, P., Nagaraj, D. and Netrapalli, P. (2019). Making the last iterate of sgd information

theoretically optimal. arXiv preprint arXiv:1904.12443 .

Jin, C., Netrapalli, P., Ge, R., Kakade, S. M. and Jordan, M. I. (2019). A short

note on concentration inequalities for random vectors with subgaussian norm. arXiv preprint

arXiv:1902.03736 .

Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive

variance reduction. In Advances in neural information processing systems.

Kakade, S. M. and Tewari, A. (2009). On the generalization ability of online strongly convex

programming algorithms. In Advances in Neural Information Processing Systems.

Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint

arXiv:1412.6980 .

Lei, L., Ju, C., Chen, J. and Jordan, M. I. (2017). Non-convex finite-sum optimization via scsg

methods. In Advances in Neural Information Processing Systems.

Li, X. and Orabona, F. (2018). On the convergence of stochastic gradient descent with adaptive

stepsizes. arXiv preprint arXiv:1805.08114 .

Li, X. and Orabona, F. (2020). A high probability analysis of adaptive sgd with momentum.

arXiv preprint arXiv:2007.14294 .

Liu, M., Mroueh, Y., Ross, J., Zhang, W., Cui, X., Das, P. and Yang, T. (2019). Towards

better understanding of adaptive gradient algorithms in generative adversarial nets. arXiv preprint

arXiv:1912.11940 .

29

Page 30: On the Convergence of Adaptive Gradient Methods for ...convergence rate, where ˙2 is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic

McMahan, H. B. and Streeter, M. (2010). Adaptive bound optimization for online convex

optimization. arXiv preprint arXiv:1002.4908 .

Mukkamala, M. C. and Hein, M. (2017). Variants of rmsprop and adagrad with logarithmic

regret bounds. In ICML.

Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course, vol. 87.

Springer Science & Business Media.

Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR

Computational Mathematics and Mathematical Physics 4 1–17.

Reddi, S. J., Hefny, A., Sra, S., Poczos, B. and Smola, A. (2016). Stochastic variance

reduction for nonconvex optimization 314–323.

Reddi, S. J., Kale, S. and Kumar, S. (2018). On the convergence of adam and beyond. In

International Conference on Learning Representations.

Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of

Mathematical Statistics 22 400–407.

Sutskever, I., Martens, J., Dahl, G. and Hinton, G. (2013). On the importance of initialization

and momentum in deep learning. In International conference on machine learning.

Tieleman, T. and Hinton, G. (2012). Lecture 6.5—RmsProp: Divide the gradient by a running

average of its recent magnitude. COURSERA: Neural Networks for Machine Learning.

Ward, R., Wu, X. and Bottou, L. (2018). Adagrad stepsizes: Sharp convergence over nonconvex

landscapes, from any initialization. arXiv preprint arXiv:1806.01811 .

Yang, T., Lin, Q. and Li, Z. (2016). Unified convergence analysis of stochastic momentum

methods for convex and non-convex optimization. arXiv preprint arXiv:1604.03257 .

Zaheer, M., Reddi, S., Sachan, D., Kale, S. and Kumar, S. (2018). Adaptive methods for

nonconvex optimization. In Advances in neural information processing systems.

Zeiler, M. D. (2012). Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701

.

Zhou, D., Xu, P. and Gu, Q. (2018). Stochastic nested variance reduction for nonconvex

optimization. arXiv preprint arXiv:1806.07811 .

Zou, F. and Shen, L. (2018). On the convergence of adagrad with momentum for training deep

neural networks. arXiv preprint arXiv:1808.03408 .

30