40
Two Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms Quanquan Gu Department of Computer Science University of California, Los Angeles Joint work with Pan Xu and Tianhao Wang ACC Workshop on Interplay between Control, Optimization, and Machine Learning

Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Two Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time

Algorithms

Quanquan Gu Department of Computer Science

University of California, Los Angeles

Joint work with Pan Xu and Tianhao Wang

ACC Workshop on Interplay between Control, Optimization, and Machine Learning

Page 2: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Optimization for Machine Learning

• Many machine learning methods can be formulated as an optimization problem

!

• ! is a (strongly) convex function • ! is a constrained set

• Stochastic optimization plays a central role in large-scale machine learning • stochastic gradient descent • stochastic mirror descent • stochastic Langevin gradient descent • accelerated variants • …

minx∈𝒳

f(x)

f : ℝd → ℝ𝒳 ⊆ ℝd

Page 3: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Outline

• Stochastic Mirror Descent

• Understanding Acceleration in Optimization

• Continuous-time Dynamics for Accelerated Stochastic Mirror Descent

• Discretization of SDEs and New ASMD Algorithms

• Experiments

Page 4: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Stochastic Gradient Descent

SGD update:xk+1 = Π𝒳(xk − ηkG(xk; ξk))

Unbiased estimator of the gradient:

𝔼ξk[G(xk; ξk)] = ∇f(xk)

step size stochastic gradient

Convergence rate

𝔼[ f(xk) − f(x*)] = O( G

k )𝔼[ f(xk) − f(x*)] = O( G2 log k

μk )

convex & bounded gradient

strongly convex & bounded gradientbounded gradient: �∥G(x; ξ)∥2 ≤ G

� -strongly convex: �μ f(x) ≥ f(y) + ⟨∇f(x), x − y⟩ + μ/2∥x − y∥22

Page 5: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

From Euclidean Space to Non-Euclidean Space: Stochastic Mirror DescentBregman divergence

Stochastic Mirror Descent (SMD) update:

yk+1 = ∇h(xk) − ηkG(xk; ξk)xk+1 = ∇h*(yk+1)

Primal space Dual space∇h

∇h*

xk

xk+1

∇h(xk)

yk+1

−ηkG(xk; ξk)

Dh(x, z) := h(z) − h(x) − ⟨∇h(x), z − x⟩

Descent method in the dual space

𝔼[ f(xk) − f(x*)] = O( G

k )

𝔼[ f(xk) − f(x*)] = O( G2 log kμk )

convex & bounded gradient

strongly convex & bounded gradient

strongly convex distance generating function

Page 6: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Accelerated Stochastic Mirror Descent

ASMD update [Lan, 2012; Saeed & Lan, 2012]xmd

k = β−1k xk + (1 − β−1

k )xmdk

xk+1 = ∇h*(∇h(xk) − γ∇G(xmdk ; ξk))

xagk+1 = β−1

k xk+1 + (1 − β−1k )xag

kHard to Interpret!

Convergence rate

𝔼[ f(xk) − f(x*)] = O( Lk2

k )𝔼[ f(xk) − f(x*)] = O( L

k2+

σ2

μk )

convex & bounded gradient

strongly convex & bounded gradient

when � , it matches the optimal rate of deterministic mirror descent

σ = 0f(x) f(y) + hrf(y),x� yi+ L/2kx� yk22Smooth:𝔼[∥G(x, ξ) − ∇F(x)∥2

2] ≤ σ2

Page 7: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

We Want to …

• Better understand accelerated stochastic mirror descent

• Derive intuitive and simple accelerated stochastic mirror descent algorithms

• Deliver simple proof of the convergence rates

Page 8: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Outline

• Stochastic Mirror Descent

• Understanding Acceleration in Optimization

• Continuous-time Dynamics for Accelerated Stochastic Mirror Descent

• Discretization of SDEs and New ASMD Algorithms

• Experiments

Page 9: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Interpretations of Nesterov’s AGD/AMD

• Ordinary Differential Equation interpretation[Su et al, 2014] [Krichene et al, 2015] [Wibisono et al, 2016] [Wilson et al, 2016] [Diakonikolas & Orecchia, 2018]

• Other interpretations

• Linear Matrix Inequality [Lessard et al, 2016]

• Dissipativity Theory [Hu & Lessard, 2017]

• Linear Coupling [Allen-Zhu & Orecchia, 2017]

• Geometry [Bubeck et al, 2015]

• Game theory [Lan & Zhou, 2018]

Page 10: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

From ODE to SDE

t0 1 2 3 4 5 6 7 8 9 10

Xt

-0.5

0

0.5

1

1.5

2

2.5

3

3.5

t0 1 2 3 4 5 6 7 8 9 10

Xt

-0.5

0

0.5

1

1.5

2

2.5

3

Brownian motion

dXt = (�0.5Xt + sin(0.01t))dt dXt = (�0.5Xt + sin(0.01t))dt+ 0.2dBt

dXt = u(Xt, t)dt

dXt = u(Xt, t)dt+ �(Xt, t)dBt

Ordinary Differential Equation

Stochastic Differential Equation

Page 11: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

SDE Interpretations of Stochastic Optimization

xk+1 = xk � ⌘kr ef(xk, ⇠k)Stochastic Gradient Descent Stochastic Gradient Flow

dXt = �rf(Xt)dt+ �dBt

yk+1 = rh(xk)� ⌘kr ef(xk, ⇠k)

xk+1 = ↵krh⇤(yk+1) + (1� ↵k)xk

Accelerated Stochastic Mirror Descent

dZt = �⌘t[rf(Xt)dt+ �(Xt, t)dBt]

dXt = at[rh⇤(Zt/st)�Xt]dt,

Accelerated Stochastic Mirror Flow

[Krichene & Bartlett, 2017]

yk+1 = rh(xk)� ⌘kr ef(xk, ⇠k)

xk+1 = rh⇤(yk+1)

Stochastic Mirror Descent Stochastic Mirror Flow

drh(Xt) = �rf(Xt)dt+ �dBt

[Raginsky & Bouvrie, 2012] [Metrikopoulos & Staudigl, 2016]

Page 12: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Yet …

Convex Strongly Convex Acceleration Converge Discrete-time

Algorithm

(Raginsky & Bouvrie, 2012)

(Mertikopoulos and Staudigl, 2016)

(Krichene & Bartlett, 2017)

Page 13: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Yet …

Convex Strongly Convex Acceleration Converge Discrete-time

Algorithm

(Raginsky & Bouvrie, 2012)

(Mertikopoulos and Staudigl, 2016)

(Krichene & Bartlett, 2017)

This talk

Page 14: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Outline

• Stochastic Mirror Descent

• Understanding Acceleration in Optimization

• Continuous-time Dynamics for Accelerated Stochastic Mirror Descent

• Discretization of SDEs and New ASMD Algorithms

• Experiments

Page 15: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Lagrangian Mechanics Behind Optimization• Optimization: Mechanical/Physical system with friction

• Undamped Lagrangian �

• Principle of Least Action: real-world motion � minimize

• Euler-Lagrange equation

ℒ(X, V, t) =12

∥V∥2 − f(X)

Xt

J(X) = ∫𝕋ℒ(Xt,

·Xt, t)dt

ddt { ∂ℒ

∂ ·Xt(Xt,

·Xt, t)} =∂ℒ∂Xt

(Xt,·Xt, t)

kinetic energy potential energy

Page 16: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Bregman Lagrangian for Mirror Descent: General Convex Functions• Damped Lagrangian

• Solution to Euler-Lagrangian equation

• Damped Bregman Lagrangian [Wibisono et al, 2016]

ℒ(X, V, t) = eγt( 12

∥V∥2 − f(X))

··Xt + ·γt + ∇f(Xt) = 0

ℒ(X, V, t) = eαt+γt(Dh(X + e−αtV, X) − eβtf(X))

Page 17: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Continuous-time Dynamics of MD: General Convex Functions

By Euler-Lagrange equation, choosing �eαt = ·βt, γt = βt

ddt

∇h(Xt + 1/ ·βt·Xt) = − ·βteβt ∇f(Xt)

a second order ODE

Define � and rewrite the ODEYt = ∇h(Xt + 1/ ·βt·Xt)

{dXt = ·βt(∇h*(Yt) − Xt)dtdYt = − ·βteβt ∇f(Xt)dt

continuous-time dynamics of AMD[Wibisono et al, 2016]

Page 18: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Continuous-time Dynamics of SMD: General Convex FunctionsAdd a Brownian motion?

{dXt = ·βt(∇h*(Yt) − Xt)dt

dYt = − ·βteβt ∇f(Xt)dt + δσ(Xt, t)dBt

Brownian motion

Does not converge

So we introduce an extra shrinkage parameter

dXt = ·βt(∇h*(Yt) − Xt)dt

dYt = −·βteβt

st(∇f(Xt)dt + δσ(Xt, t)dBt)

This is the continuous-time dynamics of accelerated SMD for general convex function

shrinkage parameter

Page 19: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

dXt = ·βt(∇h*(Yt) − Xt)dt

dYt = −·βteβt

st(∇f(Xt)dt + δσ(Xt, t)dBt)

Convergence Rate of Continuous-time Dynamics: General Convex FunctionsStochastic differential equation (SDE):

๏ is time index๏ are scaling parameter๏ is the standard Brownian motion

t > 0δ, βt, stBt ∈ ℝd

drift term diffusion term

Convergence of the proposed SDE:

𝔼[ f(Xt) − f(x*)] = O( 1t2

+σ2

t1/2−q )∥σ(Xt, t)∥2 ≤ σtq๏diffusion term

๏optimal convergence rate of ASMD O( 1k2

+σ2

k )when � , it matches optimal rate for stochastic mirror descent for general convex functions [Lan, 2012; Saeed & Lan, 2012]

q = 0

Page 20: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Roadmap of the Proof

dEt =@Et@t

dt+D @Et@Xt

, dXt

E+

D @Et@Yt

, dYt

E+

�2t e

2�t

2s2ttr

✓�>t@2Et@Y 2

t

�t

◆dt

stMh,X +1

2st�2t e

2�ttr��>t r2h⇤(Yt)�t

�dt� �te

�thrh⇤(Yt)� x⇤,�tdBti.

Rewrite the stochastic dynamics as the following SDE

d

Xt

Yt

�=

�t(rh⇤(Yt)�Xt)��te�trf(Xt)/st

�dt+

0

��te�tp��(Xt, t)/st

�dBt.

• Step 1: bounding dEt

Applying Itô's Lemma to with respect to the above SDE yieldsEt

• Lyapunov FunctionEt = e�t(f(Xt)� f(x⇤)) + stDh⇤(Yt,rh(x⇤))

Page 21: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Roadmap of the Proof

!21

Plugging in parameters: and

• Step 2: integrating and taking expectation

• Step 3: choosing parameters

E[Et] E0 + (st � s0)Mh,X +1

2E Z t

0

�2re

2�r

srtr��>r r2h⇤(Yr)�r

�dr

�,

where is the diameter of XMh,X

�t = 2 log t st = t3/2+q

E[f(Xt)� f(x⇤)] E[Et]e�t

= O

✓1

t2+

1

t1/2�q

◆.

Page 22: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Bregman Lagrangian for Mirror Descent: Strongly Convex Functions• Damped Lagrangian

• Solution to Euler-Lagrangian equation

• Damped Bregman Lagrangian [Xu et al., 2018]

ℒ(X, V, t) = eγt( 12

∥V∥2 − f(X))

··Xt + ·γt + ∇f(Xt) = 0

ℒ(X, V, t) = eαt+βt+γt(μDh(X + e−αtV, X) − f(X))

Page 23: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Continuous-time Dynamics of MD: Strongly Convex Functions

By Euler-Lagrange equation, choosing �eαt = ·βt, ·γt = − eαt

ddt

∇h(Xt + 1/ ·βt·Xt) = − ·βteβt(∇f(Xt)/μ + ∇h(Xt + 1/ ·βt

·Xt) − ∇h(Xt))

a second order ODEDefine � and add Brownian motion to the ODEYt = ∇h(Xt + 1/ ·βt

·Xt)

dXt = ·βt(∇h*(Yt) − Xt)dt

dYt = − ·βt( 1μ ∇f(Xt)dt + (Yt − ∇h(Xt))dt +

δσ(Xt, t)μ dBt)

This is the continuous-time dynamics of accelerated SMD for strongly convex function [Xu et al., 2018]

Page 24: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

dXt = ·βt(∇h*(Yt) − Xt)dt

dYt = − ·βt( 1μ ∇f(Xt)dt + (Yt − ∇h(Xt))dt +

δσ(Xt, t)μ dBt)

Convergence Rate of Continuous-time Dynamics: Strongly Convex FunctionsStochastic differential equation (SDE):

๏ is time index๏ are scaling parameter๏ is the standard Brownian motion

t > 0δ, βtBt ∈ ℝd

drift term diffusion term

Convergence of the proposed SDE:

𝔼[ f(Xt) − f(x*)] = O( 1t2

+σ2

μt1−2q )∥σ(Xt, t)∥2 ≤ σtq๏diffusion term

๏optimal convergence rate of ASMD O( 1k2

+σ2

μk )when � , it matches optimal rate for stochastic mirror descent for general convex functions [Lan, 2012; Saeed & Lan, 2012]

q = 0

Page 25: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Outline

• Stochastic Mirror Descent

• Understanding Acceleration in Optimization

• Continuous-time Dynamics for Accelerated Stochastic Mirror Descent

• Discretization of SDEs and New ASMD Algorithms

• Experiments

Page 26: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Discretization of SDE

yk yk+1

Yt Yt+δ

xk+1 − xk

δ≈ ·Xt

yk+1 − yk

δ≈ ·Yt

Forward (Explicit) Euler Discretization

xk − xk−1

δ≈ ·Xt

yk − yk−1

δ≈ ·Yt

Backward (Implicit) Euler Discretization

Continuous-Time to Discrete-time sequence

xk xk+1

Xt Xt+δContinuous indexDiscrete index

Page 27: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

New Discrete-time Algorithm (Implicit)SDEs for general convex functions:

dXt = ·βt(∇h*(Yt) − Xt)dt

dYt = −·βteβt

st(∇f(Xt)dt + δσ(Xt, t)dBt)

Implicit discretization

∇h*(yk+1) = xk+1 + 1/τk(xk+1 − xk)

yk+1 − yk = − τk /skG(xk+1; ξk+1)

Convergence rate𝔼[ f(xk) − f(x*)] = O( 1

k2+

σ

k )๏Optimal rate [Ghadimi & Lan, 2012]

๏Implicit update

Page 28: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

New Discrete-time Algorithm (ASMD)

Convergence rate𝔼[ f(xk) − f(x*)] = O( 1

k2+

σ2 + 1

k )๏Not optimal rate [Ghadimi & Lan, 2012]

๏Explicit (practical) algorithm

Hybrid discretization

yk+1 − yk = − τk /skG(xk+1; ξk+1)∇h*(yk) = xk+1 + 1/τk(xk+1 − xk)

SDEs for general convex functions:dXt = ·βt(∇h*(Yt) − Xt)dt

dYt = −·βteβt

st(∇f(Xt)dt + δσ(Xt, t)dBt)

Page 29: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

New Discrete-time Algorithm (ASMD3)SDEs for general convex functions:

dXt = ·βt(∇h*(Yt) − Xt)dt

dYt = −·βteβt

st(∇f(Xt)dt + δσ(Xt, t)dBt)

Convergence rate𝔼[ f(xk) − f(x*)] = O( 1

k2+

σ2

k )

Explicit discretization with additional sequence ∇h*(yk) = xk + 1/τk(zk+1 − xk)yk+1 − yk = − τk /skG(zk+1; ξk+1)

xk+1 = arg minx∈𝒳

{⟨G(zk+1; ξk+1), x⟩ + MkDh(zk+1, x)}

๏Optimal rate [Ghadimi & Lan, 2012]

๏Explicit (practical) algorithm

Page 30: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

New Discrete-time Algorithm (Implicit)SDEs for strongly convex functions:

dXt = ·βt(∇h*(Yt) − Xt)dt

dYt = − ·βt( 1μ

∇f(Xt)dt + (Yt − ∇h(Xt))dt +δσ(Xt, t)

μdBt)

Implicit discretization

∇h*(yk+1) = xk+1 + 1/τk(xk+1 − xk)

yk+1 − yk = − τk(G(xk+1; ξk+1)/μ + yk+1 − ∇h(xk+1))

Convergence rate𝔼[ f(xk) − f(x*)] = O( 1

k2+

σ2

μk )๏Optimal rate [Ghadimi & Lan, 2012]

๏Implicit update

Page 31: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

New Discrete-time Algorithm (ASMD)

Convergence rate𝔼[ f(xk) − f(x*)] = O( 1

k2+

σ2 + 1μk )

๏Not optimal rate [Ghadimi & Lan, 2012]

๏Explicit (practical) algorithm

SDEs for strongly convex functions:dXt = ·βt(∇h*(Yt) − Xt)dt

dYt = − ·βt( 1μ

∇f(Xt)dt + (Yt − ∇h(Xt))dt +δσ(Xt, t)

μdBt)

Hybrid discretization

yk+1 − yk = − τk(G(xk+1; ξk+1)/μ + yk+1 − ∇h(xk+1))∇h*(yk) = xk+1 + 1/τk(xk+1 − xk)

Page 32: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

New Discrete-time Algorithm (ASMD3)

Convergence rate𝔼[ f(xk) − f(x*)] = O( 1

k2+

σ2

μk )

Explicit discretization with additional sequence ∇h*(yk) = xk + 1/τk(zk+1 − xk)yk+1 − yk = − τk(G(zk+1; ξk+1)/μ + yk − ∇h(zk+1))

xk+1 = arg minx∈𝒳

{⟨G(zk+1; ξk+1), x⟩ + MkDh(zk+1, x)}

๏Optimal rate [Ghadimi & Lan, 2012]

๏Explicit (practical) algorithm

SDEs for strongly convex functions:dXt = ·βt(∇h*(Yt) − Xt)dt

dYt = − ·βt( 1μ

∇f(Xt)dt + (Yt − ∇h(Xt))dt +δσ(Xt, t)

μdBt)

Page 33: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Outline

• Stochastic Mirror Descent

• Understanding Acceleration in Optimization

• Continuous-time Dynamics for Accelerated Stochastic Mirror Descent

• Discretization of SDEs and New ASMD Algorithms

• Experiments

Page 34: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Experiment Results: General Convex Case

0 200 400 600 800 1000 1200 1400 1600 1800 2000-3

-2

-1

0

1

2

3

4

5

SMDSAGEAC-SAASMDASMD3

0 200 400 600 800 1000 1200-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

SMDAC-SAASMDASMD3

๏constrain set: 𝒳 = {x ∈ ℝd : ∥x∥2 ≤ R}๏distance generating function:

h(x) =12

∥x∥22

๏constrain set: 𝒳 = {x ∈ ℝd : ∑di=1 xi = 1,xi ≥ 0}

๏ distance generating function:h(x) = ∑d

i=1 xi log xi

Baselines:

Optimization problem:

SMD, SAGE [Hu et al., 2009], AC-SA [Ghadimi & Lan, 2012]

minx∈𝒳

12n

∥Ax − y∥22

Page 35: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Experiment Results: Strongly Convex CaseBaselines:

Optimization problem:

SMD, SAGE [Hu et al., 2009], AC-SA [Ghadimi & Lan, 2012]

minx∈𝒳

12n

∥Ax − y∥22 + λ∥x∥2

2

0 100 200 300 400 500 600 700 800 900 1000-2

-1

0

1

2

3

4

5

SMDSAGEAC-SAASMDASMD3

๏constrain set: 𝒳 = {x ∈ ℝd : ∥x∥2 ≤ R}๏distance generating function:

h(x) =12

∥x∥22

0 100 200 300 400 500 600 700 800 900 1000-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

SMDAC-SAASMDASMD3

๏constrain set: 𝒳 = {x ∈ ℝd : ∑di=1 xi = 1,xi ≥ 0}

๏ distance generating function:h(x) = ∑d

i=1 xi log xi

Page 36: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Take Away

Continuous-time dynamics can help us

• better understand stochastic optimization

• derive new discrete-time algorithms based on various discretization schemes

• deliver a unified and simple proof of convergence rates

Page 37: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Thank You

Page 38: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Reference• Allen-Zhu, Z., & Orecchia, L. (2017). Linear Coupling: An Ultimate Unification of Gradient and Mirror

Descent. In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.

• Bubeck, S., Lee, Y. T., and Singh, M. A geometric alternative to nesterov’s accelerated gradient descent. arXiv preprint arXiv:1506.08187, 2015.

• Diakonikolas J, Orecchia L. Accelerated Extra-Gradient Descent: A Novel Accelerated First-Order Method. In9th Innovations in Theoretical Computer Science Conference (ITCS 2018) 2018 Jan (Vol. 94).

• Ghadimi, S. and Lan, G. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012.

• Hu, B., & Lessard, L. (2017, August). Dissipativity theory for Nesterov's accelerated method. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 1549-1557). JMLR. org.

• Hu, C., Pan, W., and Kwok, J. T. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems, pp. 781–789, 2009.

Page 39: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Reference• Diakonikolas, Jelena, and Lorenzo Orecchia. "Accelerated Extra-Gradient Descent: A Novel

Accelerated First-Order Method." 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.

• Krichene, W., Bayen, A., and Bartlett, P. L. Accelerated mirror descent in continuous and discrete time. In Advances in neural information processing systems, pp. 2845–2853, 2015.

• Krichene, W. and Bartlett, P.L., 2017. Acceleration and averaging in stochastic descent dynamics. In Advances in Neural Information Processing Systems (pp. 6796-6806).

• Lan, G. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365–397, 2012.

• Lan, G. and Zhou, Y., 2018. An optimal randomized incremental gradient method. Mathematical programming, 171(1-2), pp.167-215.

• Lessard, L., Recht, B., and Packard, A. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57–95, 2016.

• Mertikopoulos, P. and Staudigl, M. On the convergence of gradient-like flows with noisy gradient input. arXiv preprint arXiv:1611.06730, 2016.

Page 40: Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Reference• Raginsky, M. and Bouvrie, J. Continuous-time stochastic mirror descent on a network: Variance

reduction, consensus, convergence. In Decision and Control (CDC), 2012 IEEE 51st Annual Conference on, pp. 6793–6800. IEEE, 2012.

• Su, W., Boyd, S., and Candes, E. A differential equation for modeling nesterovs accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pp. 2510–2518, 2014.

• Wibisono, A., Wilson, A. C., and Jordan, M. I. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, pp. 201614734, 2016.

• Wilson, A. C., Recht, B., and Jordan, M. I. A lyapunov analysis of momentum methods in optimization. arXiv preprint arXiv:1611.02635, 2016.

• Xu, P., Wang, T. and Gu, Q., 2018, March. Accelerated stochastic mirror descent: From continuous-time dynamics to discrete-time algorithms. In International Conference on Artificial Intelligence and Statistics (pp. 1087-1096).

• Xu, P., Wang, T. and Gu, Q., 2018, July. Continuous and discrete-time accelerated stochastic mirror descent for strongly convex functions. In International Conference on Machine Learning (pp. 5488-5497).