52
ELE 522: Large-Scale Optimization for Data Science Accelerated gradient methods Yuxin Chen Princeton University, Fall 2019

Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

ELE 522: Large-Scale Optimization for Data Science

Accelerated gradient methods

Yuxin Chen

Princeton University, Fall 2019

Page 2: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Outline

• Heavy-ball methods

• Nesterov’s accelerated gradient methods

• Accelerated proximal gradient methods (FISTA)

• Convergence analysis

• Lower bounds

Page 3: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

(Proximal) gradient methods

Iteration complexities of (proximal) gradient methods• strongly convex and smooth problems

O

(κ log 1

ε

)

• convex and smooth problems

O

(1ε

)

Can one still hope to further accelerate convergence?

Accelerated GD 7-3

Page 4: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Issues and possible solutions

Issues:

• GD focuses on improving the cost per iteration, which mightsometimes be too “short-sighted”• GD might sometimes zigzag or experience abrupt changes

Solutions:• exploit information from the history (i.e. past iterates)• add buffers (like momentum) to yield smoother trajectory

Accelerated GD 7-4

Page 5: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Heavy-ball methods— Polyak ’64

Page 6: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Heavy-ball method

B. Polyak

minimizex∈Rn f(x)

xt+1 = xt − ηt∇f(xt) + θt(xt − xt−1)︸ ︷︷ ︸momentum term

• add inertia to the “ball” (i.e. include a momentum term) tomitigate zigzagging

Accelerated GD 7-6

Page 7: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Heavy-ball method

Proof of Lemma 2.5

It follows that

Îxt+1 ≠ xúÎ22 =

..xt ≠ xú ≠ ÷(Òf(xt) ≠ Òf(xú)¸ ˚˙ ˝=0

)..2

2

=..xt ≠ xú..2

2 ≠ 2÷Èxt ≠ xú, Òf(xt) ≠ Òf(xú)͸ ˚˙ ˝Ø 2÷

L ÎÒf(xt)≠Òf(xú)Î22 (smoothness)

+ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2 ≠ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2

Gradient methods 2-36

gradient descent

Proof of Lemma 2.5

It follows that

Îxt+1 ≠ xúÎ22 =

..xt ≠ xú ≠ ÷(Òf(xt) ≠ Òf(xú)¸ ˚˙ ˝=0

)..2

2

=..xt ≠ xú..2

2 ≠ 2÷Èxt ≠ xú, Òf(xt) ≠ Òf(xú)͸ ˚˙ ˝Ø 2÷

L ÎÒf(xt)≠Òf(xú)Î22 (smoothness)

+ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2 ≠ ÷2..Òf(xt) ≠ Òf(xú)..2

2

Æ..xt ≠ xú..2

2

Gradient methods 2-36

heavy-ball method

Accelerated GD 7-7

Page 8: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

State-space models

minimizex12(x− x∗)>Q(x− x∗)

where Q � 0 has a condition number κ

One can understand heavy-ball methods through dynamical systems

Accelerated GD 7-8

Page 9: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

State-space models

Consider the following dynamical system[

xt+1

xt

]=[

(1 + θt)I −θtII 0

] [xt

xt−1

]−[ηt∇f(xt)

0

]

or equivalently,[

xt+1 − x∗

xt − x∗

]

︸ ︷︷ ︸state

=[

(1 + θt)I −θtII 0

] [xt − x∗

xt−1 − x∗

]−[ηt∇f(xt)

0

]

=[

(1 + θt)I − ηtQ −θtII 0

]

︸ ︷︷ ︸system matrix

[xt − x∗

xt−1 − x∗

]

Accelerated GD 7-9

Page 10: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

System matrix

[xt+1 − x∗

xt − x∗

]=[ (

1 + θt)I − ηtQ −θtII 0

]

︸ ︷︷ ︸=:Ht (system matrix)

[xt − x∗

xt−1 − x∗

](7.1)

implication: convergence of heavy-ball methods depends on thespectrum of the system matrix Ht

key idea: find appropriate stepsizes ηt and momentum coefficients θtto control the spectrum of Ht

Accelerated GD 7-10

Page 11: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Convergence for quadratic problems

Theorem 7.1 (Convergence of heavy-ball methods for quadraticfunctions)

Suppose f is a L-smooth and µ-strongly convex quadratic function.Set ηt ≡ 4/(

√L+√µ)2, θt ≡ max

{|1−√ηtL|, |1−√ηtµ|}2, and

κ = L/µ. Then∥∥∥∥∥

[xt+1 − x∗

xt − x∗

]∥∥∥∥∥2.

(√κ− 1√κ+ 1

)t ∥∥∥∥∥

[x1 − x∗

x0 − x∗

]∥∥∥∥∥2

• iteration complexity: O(√κ log 1

ε )• significant improvement over GD: O(

√κ log 1

ε ) vs. O(κ log 1ε )

• relies on knowledge of both L and µ

Accelerated GD 7-11

Page 12: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Proof of Theorem 7.1In view of (7.1), it suffices to control the spectrum of Ht (which istime-invariant). Let λi be the ith eigenvalue of Q and set

Λ :=

λ1

. . .λn

, then the spectral radius (denoted by ρ(·)) of Ht

obeys

ρ(Ht) = ρ

([ (1 + θt

)I − ηtΛ −θtII 0

])

≤ max1≤i≤n

ρ

([1 + θt − ηtλi −θt

1 0

])

To finish the proof, it suffices to show

maxiρ

([1 + θt − ηtλi −θt

1 0

])≤√κ− 1√κ+ 1 (7.2)

Accelerated GD 7-12

Page 13: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Proof of Theorem 7.1

To show (7.2), note that the two eigenvalues of[

1 + θt − ηtλi −θt1 0

]are

the roots ofz2 − (1 + θt − ηtλi)z + θt = 0 (7.3)

If (1 + θt − ηtλi)2 ≤ 4θt, then the roots of this equation have the samemagnitudes

√θt (as they are either both imaginary or there is only one root).

In addition, one can easily check that (1 + θt − ηtλi)2 ≤ 4θt is satisfied if

θt ∈[(

1−√ηtλi

)2,(1 +

√ηtλi

)2], (7.4)

which would hold if one picks θt = max{(

1−√ηtL)2,(1−√ηtµ

)2}

Accelerated GD 7-13

Page 14: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Proof of Theorem 7.1

With this choice of θt, we have (from (7.3) and the fact that twoeigenvalues have identical magnitudes)

ρ (Ht) ≤√θt

Finally, setting ηt = 4(√L+√µ)2 ensures 1−√ηtL = −(1−√ηtµ), which

yields

θt = max

(1− 2

√L√

L+õ

)2

,

(1− 2√µ√

L+õ

)2 =

(√κ− 1√κ+ 1

)2

This in turn establishes

ρ (Ht) ≤√κ− 1√κ+ 1

Accelerated GD 7-14

Page 15: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Nesterov’s accelerated gradient methods

Page 16: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Convex case

minimizex∈Rn f(x)

For a positive definite quadratic function f , including momentumterms allows to improve the iteration complexity from O

(κ log 1

ε

)to

O(√κ log 1

ε

)

Can we obtain improvement for more general convex cases as well?

Accelerated GD 7-16

Page 17: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Nesterov’s idea

Y. Nesterov

— Nesterov ’83

xt+1 = yt − ηt∇f(yt)

yt+1 = xt+1 + t

t+ 3(xt+1 − xt

)

• alternates between gradient updates and proper extrapolation• each iteration takes nearly the same cost as GD• not a descent method (i.e. we may not have f(xt+1) ≤ f(xt))• one of the most beautiful and mysterious results in optimization

...Accelerated GD 7-17

Page 18: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Convergence of Nesterov’s accelerated gradientmethod

Suppose f is convex and L-smooth. If ηt ≡ η = 1/L, then

f(xt)− fopt ≤ 2L‖x0 − x∗‖22(t+ 1)2

• iteration complexity: O( 1√

ε

)

• much faster than gradient methods• we’ll provide proof for the (more general) proximal version later

Accelerated GD 7-18

Page 19: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Interpretation using differential equations

Nesterov’s momentum coefficient tt+3 = 1− 3

t is particularlymysterious

Accelerated GD 7-19

Page 20: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Interpretation using differential equations

pi /k(U�)i,:k2

2

kU�k2F

=1

nk(U�)i,:k2

2

ConsiderP := W 1/2B

�B>WB

�†B>W 1/2

and note that P = W 1/2RW 1/2 is rescaled version of R := B�B>WB

�†B>,

whose diagonal elements are effective resistances. Given that � = W 1/2B, onehas

P = �(�>�)†�> = U�U>� .

As a result, leverage score w.r.t. � is recaled effective resistanceWith constant prob., kxls � x̃lskLG ✏kxlskLG

F = �rf(X)

1

pi /k(U�)i,:k2

2

kU�k2F

=1

nk(U�)i,:k2

2

ConsiderP := W 1/2B

�B>WB

�†B>W 1/2

and note that P = W 1/2RW 1/2 is rescaled version of R := B�B>WB

�†B>,

whose diagonal elements are effective resistances. Given that � = W 1/2B, onehas

P = �(�>�)†�> = U�U>� .

As a result, leverage score w.r.t. � is recaled effective resistanceWith constant prob., kxls � x̃lskLG ✏kxlskLG

F = �rf(X)

damping force � 3⌧

·X� 3

·X

1

pi /k(U�)i,:k2

2

kU�k2F

=1

nk(U�)i,:k2

2

ConsiderP := W 1/2B

�B>WB

�†B>W 1/2

and note that P = W 1/2RW 1/2 is rescaled version of R := B�B>WB

�†B>,

whose diagonal elements are effective resistances. Given that � = W 1/2B, onehas

P = �(�>�)†�> = U�U>� .

As a result, leverage score w.r.t. � is recaled effective resistanceWith constant prob., kxls � x̃lskLG ✏kxlskLG

F = �rf(X)

damping force � 3⌧

·X� 3

·X

time-varying dampling force spring force

1

pi /k(U�)i,:k2

2

kU�k2F

=1

nk(U�)i,:k2

2

ConsiderP := W 1/2B

�B>WB

�†B>W 1/2

and note that P = W 1/2RW 1/2 is rescaled version of R := B�B>WB

�†B>,

whose diagonal elements are effective resistances. Given that � = W 1/2B, onehas

P = �(�>�)†�> = U�U>� .

As a result, leverage score w.r.t. � is recaled effective resistanceWith constant prob., kxls � x̃lskLG ✏kxlskLG

F = �rf(X)

damping force � 3⌧

·X� 3

·X

time-varying dampling force spring force

1

To develop insight into why Nesterov’s method works so well, it’shelpful to look at its continuous limits (ηt → 0), which is given bysecond-order ordinary differential equations (ODE)

..X(τ) + 3/τ︸︷︷︸

dampling coefficient

.X(τ) +∇f(X(τ))︸ ︷︷ ︸

potential

= 0

— Su, Boyd, Candes ’14Accelerated GD 7-20

Page 21: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Heuristic derivation of ODETo begin with, Nesterov’s update rule is equivalent to

xt+1 − xt√η

= t− 1t+ 2

xt − xt−1√η

−√η∇f(yt) (7.5)

Let t = τ√η . Set X(τ) ≈ xτ/

√η = xt and X(τ +√η) ≈ xt+1. Then the

Taylor expansion givesxt+1 − xt√

η≈

.

X(τ) + 12..

X(τ)√η

xt − xt−1√η

≈.

X(τ)− 12..

X(τ)√η

which combined with (7.5) yields.

X(τ) + 12..

X(τ)√η ≈(

1− 3√ητ

)(.

X(τ)− 12..

X(τ)√η)−√η∇f

(X(τ)

)

=⇒..

X(τ) + 3τ

.

X(τ) +∇f(X(τ)

)≈ 0

Accelerated GD 7-21

Page 22: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Convergence rate of ODE

..X + 3

τ

.X +∇f(X) = 0 (7.6)

Standard ODE theory reveals that

f(X(τ))− fopt ≤ O( 1τ2

)(7.7)

which somehow explains Nesterov’s O(1/t2) convergence

Accelerated GD 7-22

Page 23: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Proof of (7.7)

Define E(τ) := τ2(f(X)− fopt)+ 2∥∥X + τ

2.

X −X∗∥∥2

2︸ ︷︷ ︸Lyapunov function / energy function

. This obeys

.

E = 2τ(f(X)− fopt)+ τ2⟨∇f(X),

.

X⟩

+ 4⟨X + τ

2.

X −X∗,32.

X + τ

2..

X⟩

(i)= 2τ(f(X)− fopt)− 2τ

⟨X −X∗,∇f

(X)⟩ (by convexity)

≤ 0

where (i) follows by replacing τ..

X + 3.

X with −τ∇f(X)

This means E is non-decreasing in τ , and hence

f(X(τ))− fopt (defn of E)≤ E(τ)

τ2 ≤ E(0)τ2 = O

(1τ2

)

Accelerated GD 7-23

Page 24: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Magic number 3

..X + 3

τ

.X +∇f(X) = 0

• 3 is the smallest constant that guarantees O(1/τ2) convergence,and can be replaced by any other α ≥ 3• in some sense, 3 minimizes the pre-constant in the convergence

bound O(1/τ2) (see Su, Boyd, Candes ’14)

Accelerated GD 7-24

Page 25: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Numerical exampletaken from UCLA EE236C

minimizex log(

m∑

i=1exp(a>i x + bi)

)

with randomly generated problems and m = 2000, n = 1000

Example

minimize logmP

i=1

exp(aTi x + bi)

• two randomly generated problems with m = 2000, n = 1000

• same fixed step size used for gradient method and FISTA

• figures show (f(x(k)) � f?)/f?

0 50 100 150 20010�6

10�5

10�4

10�3

10�2

10�1

100

k

gradientFISTA

0 50 100 150 20010�6

10�5

10�4

10�3

10�2

10�1

100

k

gradientFISTA

Accelerated proximal gradient methods 9-8

Convergence analysisUsing Lemma 5.4, we immediate arrive atTheorem 5.3

Suppose f is convex and Lipschitz continuous (i.e. ÎgtÎú Æ Lf ) on C,and suppoe Ï is fl-strongly convex w.r.t. Î · Î. Then

fbest,t ≠ fopt ÆsupxœC DÏ

!x,x0"

+ L2f

2fl

qtk=0 ÷2

kqtk=0 ÷k

• If ÷t =Ô

2flRLf

1Ôt

with R := supxœC DÏ!x,x0"

, then

fbest,t ≠ fopt Æ O

ALf

ÔRÔ

fl

log tÔt

B

¶ one can further remove log t factorMirror descent 5-37

Opt

imal

ityof

Nes

tero

v’s

met

hod

Inte

rest

ingly

,no

first

-ord

erm

etho

dsca

nim

prov

eup

onNe

ster

ov’s

resu

ltin

gene

ral

Mor

epr

ecise

ly,÷

conv

exan

dL

-smoo

thfu

nctio

nf

s.t.

f(x

t )≠

fop

3LÎx

0≠

xú Î

2 232

(t+

1)2

aslon

gas

xk

œx

0+

span

{Òf(x

0 ),·

··,Ò

f(x

k≠

1 )}

¸˚˙

˝de

finiti

onof

first

-ord

erm

etho

ds

fora

ll1

Æk

Æt

Acce

lerat

edGD

7-35

Accelerated GD 7-25

Page 26: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Extension to composite models

minimizex F (x) := f(x) + h(x)subject to x ∈ Rn

• f : convex and smooth• h: convex (may not be differentiable)

let F opt := minx F (x) be the optimal cost

Accelerated GD 7-26

Page 27: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

FISTA (Beck & Teboulle ’09)

Fast iterative shrinkage-thresholding algorithm

xt+1 = proxηth

(yt − ηt∇f(yt)

)

yt+1 = xt+1 + θt − 1θt+1

(xt+1 − xt)

where y0 = x0, θ0 = 1 and θt+1 = 1+√

1+4θ2t

2

• adopt the momentum coefficients originally proposed byNesterov ’83

Accelerated GD 7-27

Page 28: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Momentum coefficient

0 10 20 300

5

10

15

20

0 5 10 15 20 25 300

0.2

0.4

0.6

0.8

1

θt+1 =1 +

√1 + 4θ2

t

2 with θ0 = 1

coefficient θt−1θt+1

= 1− 3t + o

(1t

)(homework)

• asymptotically equivalent to tt+3

Fact 7.2

For all t ≥ 1, one has θt ≥ t+22 (homework)

Accelerated GD 7-28

Page 29: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Momentum coefficient

0 10 20 300

5

10

15

20

0 5 10 15 20 25 300

0.2

0.4

0.6

0.8

1

θt+1 =1 +

√1 + 4θ2

t

2 with θ0 = 1

coefficient θt−1θt+1

= 1− 3t + o

(1t

)(homework)

• asymptotically equivalent to tt+3

Fact 7.2

For all t ≥ 1, one has θt ≥ t+22 (homework)

Accelerated GD 7-28

Page 30: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Convergence analysis

Page 31: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Convergence for convex problems

Theorem 7.3 (Convergence of accelerated proximal gradientmethods for convex problems)

Suppose f is convex and L-smooth. If ηt ≡ 1/L, then

F (xt)− F opt ≤ 2L‖x0 − x∗‖22(t+ 1)2

• improved iteration complexity (i.e. O(1/√ε)) than proximal

gradient method (i.e. O(1/ε))• fast if prox can be efficiently implemented

Accelerated GD 7-30

Page 32: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Recap: the fundamental inequality for proximalmethod

Recall the following fundamental inequality shown in the last lecture:

Lemma 7.4

Let y+ = prox 1Lh

(y − 1

L∇f(y)), then

F (y+)− F (x) ≤ L

2 ‖x− y‖22 −L

2 ‖x− y+‖22

Accelerated GD 7-31

Page 33: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Proof of Theorem 7.6

1. build a discrete-time version of “Lyapunov function”

2. magic happens!◦ “Lyapunov function” is non-increasing when Nesterov’s

momentum coefficients are adopted

Accelerated GD 7-32

Page 34: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Proof of Theorem 7.6

Key lemma: monotonicity of a certain “Lyapunov function”

Lemma 7.5

Let ut = θt−1xt − (x∗ + (θt−1 − 1)xt−1)

︸ ︷︷ ︸or θt−1(xt−x∗)−(θt−1−1)(xt−1−x∗)

. Then

‖ut+1‖22 + 2Lθ2t

(F (xt+1)− F opt) ≤ ‖ut‖22 + 2

Lθ2t−1(F (xt)− F opt)

• quite similar to 2∥∥X + τ

2.X −X∗

∥∥22 + τ2(f(X)− fopt)

(Lyapunov function) as discussed before (think about θt ≈ t/2)

Accelerated GD 7-33

Page 35: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Proof of Theorem 7.6With Lemma 7.5 in place, one has

2Lθ2t−1(F (xt)− F opt) ≤ ‖u1‖2

2 + 2Lθ2

0(F (x1)− F opt)

= ‖x1 − x∗‖22 + 2

L

(F (x1)− F opt)

To bound the RHS of this inequality, we use Lemma 7.4 and y0 = x0 to get2L

(F (x1)− F opt) ≤ ‖y0 − x∗‖2

2 − ‖x1 − x∗‖22 = ‖x0 − x∗‖2

2 − ‖x1 − x∗‖22

⇐⇒ ‖x1 − x∗‖22 + 2

L

(F (x1)− F opt) ≤ ‖x0 − x∗‖2

2

As a result,2Lθ2t−1(F (xt)− F opt) ≤ ‖x1 − x∗‖2

2 + 2L

(F (x1)− F opt) ≤ ‖x0 − x∗‖2

2,

=⇒ F (xt)− F opt ≤ L‖x0 − x∗‖22

2θ2t−1

(Fact 7.2)≤ 2L‖x0 − x∗‖2

2(t+ 1)2

Accelerated GD 7-34

Page 36: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Proof of Lemma 7.5

Take x = 1θtx∗ +

(1− 1

θt

)xt and y = yt in Lemma 7.4 to get

F (xt+1)− F(θ−1t x∗ +

(1− θ−1

t

)xt)

(7.8)

≤ L

2∥∥θ−1t x∗ +

(1− θ−1

t

)xt − yt

∥∥22 −

L

2∥∥θ−1t x∗ +

(1− θ−1

t

)xt − xt+1∥∥2

2

= L

2θ2t

∥∥x∗ +(θt − 1

)xt − θtyt

∥∥22 −

L

2θ2t

∥∥x∗ +(θt − 1

)xt − θtxt+1

︸ ︷︷ ︸=−ut+1

∥∥22

(i)= L

2θ2t

(‖ut‖2

2 − ‖ut+1‖22), (7.9)

where (i) follows from the definition of ut and yt = xt + θt−1−1θt

(xt−xt−1)

Accelerated GD 7-35

Page 37: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Proof of Lemma 7.5 (cont.)

We will also lower bound (7.8). By convexity of F ,

F(θ−1t x∗ +

(1− θ−1

t

)xt)≤ θ−1

t F (x∗) +(1− θ−1

t

)F (xt)

= θ−1t F opt +

(1− θ−1

t

)F (xt)

⇐⇒ F(θ−1t x∗ +

(1− θ−1

t

)xt)− F (xt+1)

≤(1− θ−1

t

)(F (xt)− F opt)−

(F (xt+1)− F opt)

Combining this with (7.9) and θ2t − θt = θ2

t−1 yields

L

2(‖ut‖2

2 − ‖ut+1‖22)≥ θ2

t

(F (xt+1)− F opt)−

(θ2t − θt

)(F (xt)− F opt)

= θ2t

(F (xt+1)− F opt)− θ2

t−1(F (xt)− F opt),

thus finishing the proof

Accelerated GD 7-36

Page 38: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Convergence for strongly convex problems

xt+1 = proxηth

(yt − ηt∇f(yt)

)

yt+1 = xt+1 +√κ− 1√κ+ 1(xt+1 − xt)

Theorem 7.6 (Convergence of accelerated proximal gradientmethods for strongly convex case)

Suppose f is µ-strongly convex and L-smooth. If ηt ≡ 1/L, then

F (xt)− F opt ≤(

1− 1√κ

)t(F (x0)− F opt + µ‖x0 − x∗‖22

2

)

Accelerated GD 7-37

Page 39: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

A practical issue

Fast convergence requires knowledge of κ = L/µ

• in practice, estimating µ is typically very challenging

A common observation: ripples / bumps in the traces of cost values

Accelerated GD 7-38

Page 40: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Rippling behaviorNumerical example: take yt+1 = xt+1 + 1−√q

1+√q (xt+1 − xt); q∗ = 1/κ

0 500 1000 1500 2000 2500 3000 3500 4000 4500 500010−16

10−14

10−12

10−10

10−8

10−6

10−4

10−2

100

102

k

f(x

k)−

f⋆

q = 0q = q⋆/10q = q⋆/3q = q⋆

q = 3q⋆

q = 10q⋆

q = 1

Figure 1: Convergence of Algorithm 1 with different estimates of q.

Interpretation. The optimal momentum depends on the condition number of the function;specifically, higher momentum is required when the function has a higher condition number. Under-estimating the amount of momentum required leads to slower convergence. However we are moreoften in the other regime, that of overestimated momentum, because generally q = 0, in which caseβk ↑ 1; this corresponds to high momentum and rippling behavior, as we see in Figure 1. Thiscan be visually understood in Figure (2), which shows the trajectories of sequences generated byAlgorithm 1 minimizing a positive definite quadratic in two dimensions, under q = q⋆, the optimalchoice of q, and q = 0. The high momentum causes the trajectory to overshoot the minimum andoscillate around it. This causes a rippling in the function values along the trajectory. Later weshall demonstrate that the period of these ripples is proportional to the square root of the (local)condition number of the function.

Lastly we mention that the condition number is a global parameter; the sequence generated byan accelerated scheme may enter regions that are locally better conditioned, say, near the optimum.In these cases the choice of q = q⋆ is appropriate outside of this region, but once we enter it weexpect the rippling behavior associated with high momentum to emerge, despite the optimal choiceof q.

4

Opt

imal

ityof

Nes

tero

v’s

met

hod

Inte

rest

ingly

,no

first

-ord

erm

etho

dsca

nim

prov

eup

onNe

ster

ov’s

resu

ltin

gene

ral

Mor

epr

ecise

ly,÷

conv

exan

dL

-smoo

thfu

nctio

nf

s.t.

f(x

t )≠

fop

3LÎx

0≠

xú Î

2 232

(t+

1)2

aslon

gas

xk

œx

0+

span

{Òf(x

0 ),·

··,Ò

f(x

k≠

1 )}

¸˚˙

˝de

finiti

onof

first

-ord

erm

etho

ds

fora

ll1

Æk

Æt

Acce

lerat

edGD

7-35

Convergence analysisUsing Lemma 5.4, we immediate arrive atTheorem 5.3

Suppose f is convex and Lipschitz continuous (i.e. ÎgtÎú Æ Lf ) on C,and suppoe Ï is fl-strongly convex w.r.t. Î · Î. Then

fbest,t ≠ fopt ÆsupxœC DÏ

!x,x0"

+ L2f

2fl

qtk=0 ÷2

kqtk=0 ÷k

• If ÷t =Ô

2flRLf

1Ôt

with R := supxœC DÏ!x,x0"

, then

fbest,t ≠ fopt Æ O

ALf

ÔRÔ

fl

log tÔt

B

¶ one can further remove log t factorMirror descent 5-37

period of ripples isoften proportional to√L/µ

O’Donoghue, Candes ’12

• when q > q∗: we underestimate momentum −→ slower convergence

• when q < q∗: we overestimate momentum ( 1−√q1+√q is large)

−→ overshooting / rippling behaviorAccelerated GD 7-39

Page 41: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Adaptive restart (O’Donoghue, Candes ’12)

When a certain criterion is met, restart running FISTA with

x0 ← xt

y0 ← xt

θ0 = 1

• take the current iterate as a new starting point• erase all memory of previous iterates and reset the momentum

back to zero

Accelerated GD 7-40

Page 42: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Numerical comparisons of adaptive restart schemes

200 400 600 800 1000 1200 1400 1600 1800 200010−18

10−16

10−14

10−12

10−10

10−8

10−6

10−4

10−2

100

102

Gradient descentNo RestartRestart every 100Restart every 400Restart every 1000Restart optimal, 700With q=µ/LAdaptive Restart − Function schemeAdaptive Restart − Gradient scheme

k

f(x

k)−

f⋆

Figure 3: Comparison of fixed and adaptive restart intervals.

−0.1 −0.05 0 0.05 0.1

−0.1

−0.05

0

0.05

0.1

x1

x2

q = q⋆

q = 0

adaptive restart

Figure 4: Sequence trajectories under scheme I and with adaptive restart.

7

Opt

imal

ityof

Nes

tero

v’s

met

hod

Inte

resti

ngly,

nofir

st-or

derm

etho

dsca

nim

prov

eup

onNe

stero

v’sre

sult

inge

nera

l

Mor

epr

ecise

ly,÷

conv

exan

dL

-smoo

thfu

nctio

nf

s.t.

f(x

t )≠

fop

3LÎx

0≠

xú Î

2 232

(t+

1)2

aslon

gas

xk

œx

0+

span

{Òf(x

0 ),·

··,Ò

f(x

k≠

1 )}

¸˚˙

˝de

finiti

onof

first

-ord

erm

etho

ds

fora

ll1

Æk

Æt

Acce

lerat

edGD

7-35

Convergence analysisUsing Lemma 5.4, we immediate arrive atTheorem 5.3

Suppose f is convex and Lipschitz continuous (i.e. ÎgtÎú Æ Lf ) on C,and suppoe Ï is fl-strongly convex w.r.t. Î · Î. Then

fbest,t ≠ fopt ÆsupxœC DÏ

!x,x0"

+ L2f

2fl

qtk=0 ÷2

kqtk=0 ÷k

• If ÷t =Ô

2flRLf

1Ôt

with R := supxœC DÏ!x,x0"

, then

fbest,t ≠ fopt Æ O

ALf

ÔRÔ

fl

log tÔt

B

¶ one can further remove log t factorMirror descent 5-37

• function scheme: restart when f(xt) > f(xt−1)• gradient scheme: restart when 〈∇f(yt−1),xt − xt−1〉 > 0︸ ︷︷ ︸

restart when momentum lead us towards a bad directionAccelerated GD 7-41

Page 43: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Illustration

200 400 600 800 1000 1200 1400 1600 1800 200010−18

10−16

10−14

10−12

10−10

10−8

10−6

10−4

10−2

100

102

Gradient descentNo RestartRestart every 100Restart every 400Restart every 1000Restart optimal, 700With q=µ/LAdaptive Restart − Function schemeAdaptive Restart − Gradient scheme

k

f(x

k)−

f⋆

Figure 3: Comparison of fixed and adaptive restart intervals.

−0.1 −0.05 0 0.05 0.1

−0.1

−0.05

0

0.05

0.1

x1

x2

q = q⋆

q = 0

adaptive restart

Figure 4: Sequence trajectories under scheme I and with adaptive restart.

7

• with overestimated momentum (e.g. q = 0), one sees spirallingtrajectory• adaptive restart helps mitigate this issue

Accelerated GD 7-42

Page 44: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Lower bounds

Page 45: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Optimality of Nesterov’s method

Interestingly, no first-order methods can improve upon Nesterov’sresults in general

More precisely, ∃ convex and L-smooth function f s.t.

f(xt)− fopt ≥ 3L‖x0 − x∗‖2232(t+ 1)2

as long as xk ∈ x0 + span{∇f(x0), · · · ,∇f(xk−1)}︸ ︷︷ ︸definition of first-order methods

for all 1 ≤ k ≤ t

— Nemirovski, Yudin ’83

Accelerated GD 7-44

Page 46: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Example

minimizex∈R(2n+1) f(x) = L

4

(12x>Ax− e>1 x

)

where A =

2 −1−1 2 −1

. . . . . . . . .−1 2 −1

−1 2

∈ R(2n+1)×(2n+1)

• f is convex and L-smooth• the optimizer x∗ is given by x∗i = 1− i

2n+2 (1 ≤ i ≤ n) obeying

fopt = L

8

( 12n+ 2 − 1

)and ‖x∗‖22 ≤

2n+ 23

• ∇f(x) = L4 Ax− L

4 e1

• span{∇f(x0), · · · ,∇f(xk−1)}︸ ︷︷ ︸=:Kk

= span{e1, · · · , ek} if x0 = 0

◦ every iteration of first-order methods expands the search space byat most one dimension

Accelerated GD 7-45

Page 47: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Example

minimizex∈R(2n+1) f(x) = L

4

(12x>Ax− e>1 x

)

where A =

2 −1−1 2 −1

. . . . . . . . .−1 2 −1

−1 2

∈ R(2n+1)×(2n+1)

• f is convex and L-smooth• the optimizer x∗ is given by x∗i = 1− i

2n+2 (1 ≤ i ≤ n) obeying

fopt = L

8

( 12n+ 2 − 1

)and ‖x∗‖22 ≤

2n+ 23

• ∇f(x) = L4 Ax− L

4 e1

• span{∇f(x0), · · · ,∇f(xk−1)}︸ ︷︷ ︸=:Kk

= span{e1, · · · , ek} if x0 = 0

◦ every iteration of first-order methods expands the search space byat most one dimension

Accelerated GD 7-45

Page 48: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Example (cont.)

If we start with x0 = 0, then

f(xn) ≥ infx∈Kn

f(x) = L

8

( 1n+ 1 − 1

)

=⇒ f(xn)− fopt

‖x0 − x∗‖22≥

L8

(1

n+1 − 12n+2

)

13(2n+ 2)

= 3L32(n+ 1)2

Accelerated GD 7-46

Page 49: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Summary: accelerated proximal gradient

stepsize convergence iterationrule rate complexity

convex & smoothηt = 1

L O(

1t2

)O(

1√ε

)problems

strongly convex &ηt = 1

L O

((1− 1√

κ

)t)O(√

κ log 1ε

)smooth problems

Accelerated GD 7-47

Page 50: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Reference

[1] ”Some methods of speeding up the convergence of iteration methods,”B. Polyak, USSR Computational Mathematics and MathematicalPhysics, 1964

[2] ”A method of solving a convex programming problem with convergencerate O(1/k2),” Y. Nestrove, Soviet Mathematics Doklady, 1983.

[3] ”A fast iterative shrinkage-thresholding algorithm for linear inverseproblems,” A. Beck and M. Teboulle, SIAM journal on imaging sciences,2009.

[4] ”First-order methods in optimization,” A. Beck, Vol. 25, SIAM, 2017.[5] ”Mathematical optimization, MATH301 lecture notes,” E. Candes,

Stanford.[6] ”Gradient methods for minimizing composite functions,”, Y. Nesterov,

Technical Report, 2007.

Accelerated GD 7-48

Page 51: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Reference

[7] ”Large-scale numerical optimization, MS&E318 lecture notes,”M. Saunders, Stanford.

[8] ”Analysis and design of optimization algorithms via integral quadraticconstraints,” L. Lessard, B. Recht, A. Packard, SIAM Journal onOptimization, 2016.

[9] ”Proximal algorithms,” N. Parikh and S. Boyd, Foundations and Trendsin Optimization, 2013.

[10] ”Convex optimization: algorithms and complexity,” S. Bubeck,Foundations and trends in machine learning, 2015.

[11] ”Optimization methods for large-scale systems, EE236C lecture notes,”L. Vandenberghe, UCLA.

[12] ”Problem complexity and method efficiency in optimization,”A. Nemirovski, D. Yudin, Wiley, 1983.

Accelerated GD 7-49

Page 52: Accelerated gradient methods - Princeton UniversityGradient methods 2-36 heavy-ball method Accelerated GD 7-7 State-space models minimizex 1 2 (x−x ∗)>Q(x−x∗) where Q ˜0 has

Reference

[13] ”Introductory lectures on convex optimization: a basic course,”Y. Nesterov, 2004

[14] ”A differential equation for modeling Nesterov’s accelerated gradientmethod,” W. Su, S. Boyd, E. Candes, NIPS, 2014.

[15] ”On accelerated proximal gradient methods for convex-concaveoptimization,” P. Tseng, 2008

[16] ”Adaptive restart for accelerated gradient schemes,” B. O’donoghue,and E. Candes, Foundations of computational mathematics, 2012

Accelerated GD 7-50