Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms

Two Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time

Algorithms

Quanquan Gu Department of Computer Science

University of California, Los Angeles

Joint work with Pan Xu and Tianhao Wang

ACC Workshop on Interplay between Control, Optimization, and Machine Learning

Optimization for Machine Learning

• Many machine learning methods can be formulated as an optimization problem

!

• ! is a (strongly) convex function • ! is a constrained set

• Stochastic optimization plays a central role in large-scale machine learning • stochastic gradient descent • stochastic mirror descent • stochastic Langevin gradient descent • accelerated variants • …

minx∈𝒳

f(x)

f : ℝd → ℝ𝒳 ⊆ ℝd

Outline

• Stochastic Mirror Descent

• Understanding Acceleration in Optimization

• Continuous-time Dynamics for Accelerated Stochastic Mirror Descent

• Discretization of SDEs and New ASMD Algorithms

• Experiments

Stochastic Gradient Descent

SGD update:xk+1 = Π𝒳(xk − ηkG(xk; ξk))

Unbiased estimator of the gradient:

𝔼ξk[G(xk; ξk)] = ∇f(xk)

step size stochastic gradient

Convergence rate

𝔼[ f(xk) − f(x*)] = O( G

k )𝔼[ f(xk) − f(x*)] = O( G2 log k

μk )

convex & bounded gradient

strongly convex & bounded gradientbounded gradient: �∥G(x; ξ)∥2 ≤ G

� -strongly convex: �μ f(x) ≥ f(y) + ⟨∇f(x), x − y⟩ + μ/2∥x − y∥22

From Euclidean Space to Non-Euclidean Space: Stochastic Mirror DescentBregman divergence

Stochastic Mirror Descent (SMD) update:

yk+1 = ∇h(xk) − ηkG(xk; ξk)xk+1 = ∇h*(yk+1)

Primal space Dual space∇h

∇h*

xk

xk+1

∇h(xk)

yk+1

−ηkG(xk; ξk)

Dh(x, z) := h(z) − h(x) − ⟨∇h(x), z − x⟩

Descent method in the dual space

𝔼[ f(xk) − f(x*)] = O( G

k )

𝔼[ f(xk) − f(x*)] = O( G2 log kμk )


strongly convex & bounded gradient

strongly convex distance generating function

Accelerated Stochastic Mirror Descent

ASMD update [Lan, 2012; Saeed & Lan, 2012]xmd

k = β−1k xk + (1 − β−1

k )xmdk

xk+1 = ∇h*(∇h(xk) − γ∇G(xmdk ; ξk))

xagk+1 = β−1

k xk+1 + (1 − β−1k )xag

kHard to Interpret!

Convergence rate

𝔼[ f(xk) − f(x*)] = O( Lk2

+σ

k )𝔼[ f(xk) − f(x*)] = O( L

k2+

σ2

μk )


strongly convex & bounded gradient

when � , it matches the optimal rate of deterministic mirror descent

σ = 0f(x) f(y) + hrf(y),x� yi+ L/2kx� yk22Smooth:𝔼[∥G(x, ξ) − ∇F(x)∥2

2] ≤ σ2

We Want to …

• Better understand accelerated stochastic mirror descent

• Derive intuitive and simple accelerated stochastic mirror descent algorithms

• Deliver simple proof of the convergence rates

Outline





• Experiments

Interpretations of Nesterov’s AGD/AMD

• Ordinary Differential Equation interpretation[Su et al, 2014] [Krichene et al, 2015] [Wibisono et al, 2016] [Wilson et al, 2016] [Diakonikolas & Orecchia, 2018]

• Other interpretations

• Linear Matrix Inequality [Lessard et al, 2016]

• Dissipativity Theory [Hu & Lessard, 2017]

• Linear Coupling [Allen-Zhu & Orecchia, 2017]

• Geometry [Bubeck et al, 2015]

• Game theory [Lan & Zhou, 2018]

From ODE to SDE

t0 1 2 3 4 5 6 7 8 9 10

Xt

-0.5

0

0.5

1

1.5

2

2.5

3

3.5

t0 1 2 3 4 5 6 7 8 9 10

Xt

-0.5

0

0.5

1

1.5

2

2.5

3

Brownian motion

dXt = (�0.5Xt + sin(0.01t))dt dXt = (�0.5Xt + sin(0.01t))dt+ 0.2dBt

dXt = u(Xt, t)dt

dXt = u(Xt, t)dt+ �(Xt, t)dBt

Ordinary Differential Equation

Stochastic Differential Equation

SDE Interpretations of Stochastic Optimization

xk+1 = xk � ⌘kr ef(xk, ⇠k)Stochastic Gradient Descent Stochastic Gradient Flow

dXt = �rf(Xt)dt+ �dBt

yk+1 = rh(xk)� ⌘kr ef(xk, ⇠k)

xk+1 = ↵krh⇤(yk+1) + (1� ↵k)xk

Accelerated Stochastic Mirror Descent

dZt = �⌘t[rf(Xt)dt+ �(Xt, t)dBt]

dXt = at[rh⇤(Zt/st)�Xt]dt,

Accelerated Stochastic Mirror Flow

[Krichene & Bartlett, 2017]

yk+1 = rh(xk)� ⌘kr ef(xk, ⇠k)

xk+1 = rh⇤(yk+1)

Stochastic Mirror Descent Stochastic Mirror Flow

drh(Xt) = �rf(Xt)dt+ �dBt

[Raginsky & Bouvrie, 2012] [Metrikopoulos & Staudigl, 2016]

Yet …

Convex Strongly Convex Acceleration Converge Discrete-time

Algorithm

(Raginsky & Bouvrie, 2012)

(Mertikopoulos and Staudigl, 2016)

(Krichene & Bartlett, 2017)

Yet …

Convex Strongly Convex Acceleration Converge Discrete-time

Algorithm

(Raginsky & Bouvrie, 2012)

(Mertikopoulos and Staudigl, 2016)

(Krichene & Bartlett, 2017)

This talk

Outline





• Experiments

Lagrangian Mechanics Behind Optimization• Optimization: Mechanical/Physical system with friction

• Undamped Lagrangian �

• Principle of Least Action: real-world motion � minimize

�

• Euler-Lagrange equation

�

ℒ(X, V, t) =12

∥V∥2 − f(X)

Xt

J(X) = ∫𝕋ℒ(Xt,

·Xt, t)dt

ddt { ∂ℒ

∂ ·Xt(Xt,

·Xt, t)} =∂ℒ∂Xt

(Xt,·Xt, t)

kinetic energy potential energy

Bregman Lagrangian for Mirror Descent: General Convex Functions• Damped Lagrangian

�

• Solution to Euler-Lagrangian equation

�

• Damped Bregman Lagrangian [Wibisono et al, 2016]

�

ℒ(X, V, t) = eγt( 12

∥V∥2 − f(X))

··Xt + ·γt + ∇f(Xt) = 0

ℒ(X, V, t) = eαt+γt(Dh(X + e−αtV, X) − eβtf(X))

Continuous-time Dynamics of MD: General Convex Functions

By Euler-Lagrange equation, choosing �eαt = ·βt, γt = βt

ddt

∇h(Xt + 1/ ·βt·Xt) = − ·βteβt ∇f(Xt)

a second order ODE

Define � and rewrite the ODEYt = ∇h(Xt + 1/ ·βt·Xt)

{dXt = ·βt(∇h*(Yt) − Xt)dtdYt = − ·βteβt ∇f(Xt)dt

continuous-time dynamics of AMD[Wibisono et al, 2016]

Continuous-time Dynamics of SMD: General Convex FunctionsAdd a Brownian motion?

{dXt = ·βt(∇h*(Yt) − Xt)dt

dYt = − ·βteβt ∇f(Xt)dt + δσ(Xt, t)dBt

Brownian motion

Does not converge

So we introduce an extra shrinkage parameter

dXt = ·βt(∇h*(Yt) − Xt)dt

dYt = −·βteβt

st(∇f(Xt)dt + δσ(Xt, t)dBt)

This is the continuous-time dynamics of accelerated SMD for general convex function

shrinkage parameter


dYt = −·βteβt


Convergence Rate of Continuous-time Dynamics: General Convex FunctionsStochastic differential equation (SDE):

๏ is time index๏ are scaling parameter๏ is the standard Brownian motion

t > 0δ, βt, stBt ∈ ℝd

drift term diffusion term

Convergence of the proposed SDE:

𝔼[ f(Xt) − f(x*)] = O( 1t2

+σ2

t1/2−q )∥σ(Xt, t)∥2 ≤ σtq๏diffusion term

๏optimal convergence rate of ASMD O( 1k2

+σ2

k )when � , it matches optimal rate for stochastic mirror descent for general convex functions [Lan, 2012; Saeed & Lan, 2012]

q = 0

Roadmap of the Proof

dEt =@Et@t

dt+D @Et@Xt

, dXt

E+

D @Et@Yt

, dYt

E+

�2t e

2�t

2s2ttr

✓�>t@2Et@Y 2

t

�t

◆dt

stMh,X +1

2st�2t e

2�ttr��>t r2h⇤(Yt)�t

�dt� �te

�thrh⇤(Yt)� x⇤,�tdBti.

Rewrite the stochastic dynamics as the following SDE

d

Xt

Yt

�=

�t(rh⇤(Yt)�Xt)��te�trf(Xt)/st

�dt+

0

��te�tp��(Xt, t)/st

�dBt.

• Step 1: bounding dEt

Applying Itô's Lemma to with respect to the above SDE yieldsEt

• Lyapunov FunctionEt = e�t(f(Xt)� f(x⇤)) + stDh⇤(Yt,rh(x⇤))

Roadmap of the Proof

!21

Plugging in parameters: and

• Step 2: integrating and taking expectation

• Step 3: choosing parameters

E[Et] E0 + (st � s0)Mh,X +1

2E Z t

0

�2re

2�r

srtr��>r r2h⇤(Yr)�r

�dr

�,

where is the diameter of XMh,X

�t = 2 log t st = t3/2+q

E[f(Xt)� f(x⇤)] E[Et]e�t

= O

✓1

t2+

1

t1/2�q

◆.

Bregman Lagrangian for Mirror Descent: Strongly Convex Functions• Damped Lagrangian

�

• Solution to Euler-Lagrangian equation

�

• Damped Bregman Lagrangian [Xu et al., 2018]

�

ℒ(X, V, t) = eγt( 12

∥V∥2 − f(X))

··Xt + ·γt + ∇f(Xt) = 0

ℒ(X, V, t) = eαt+βt+γt(μDh(X + e−αtV, X) − f(X))

Continuous-time Dynamics of MD: Strongly Convex Functions

By Euler-Lagrange equation, choosing �eαt = ·βt, ·γt = − eαt

ddt

∇h(Xt + 1/ ·βt·Xt) = − ·βteβt(∇f(Xt)/μ + ∇h(Xt + 1/ ·βt

·Xt) − ∇h(Xt))

a second order ODEDefine � and add Brownian motion to the ODEYt = ∇h(Xt + 1/ ·βt

·Xt)


dYt = − ·βt( 1μ ∇f(Xt)dt + (Yt − ∇h(Xt))dt +

δσ(Xt, t)μ dBt)

This is the continuous-time dynamics of accelerated SMD for strongly convex function [Xu et al., 2018]


dYt = − ·βt( 1μ ∇f(Xt)dt + (Yt − ∇h(Xt))dt +

δσ(Xt, t)μ dBt)

Convergence Rate of Continuous-time Dynamics: Strongly Convex FunctionsStochastic differential equation (SDE):

๏ is time index๏ are scaling parameter๏ is the standard Brownian motion

t > 0δ, βtBt ∈ ℝd

drift term diffusion term

Convergence of the proposed SDE:

𝔼[ f(Xt) − f(x*)] = O( 1t2

+σ2

μt1−2q )∥σ(Xt, t)∥2 ≤ σtq๏diffusion term

๏optimal convergence rate of ASMD O( 1k2

+σ2

μk )when � , it matches optimal rate for stochastic mirror descent for general convex functions [Lan, 2012; Saeed & Lan, 2012]

q = 0

Outline





• Experiments

Discretization of SDE

yk yk+1

Yt Yt+δ

xk+1 − xk

δ≈ ·Xt

yk+1 − yk

δ≈ ·Yt

Forward (Explicit) Euler Discretization

xk − xk−1

δ≈ ·Xt

yk − yk−1

δ≈ ·Yt

Backward (Implicit) Euler Discretization

Continuous-Time to Discrete-time sequence

xk xk+1

Xt Xt+δContinuous indexDiscrete index

New Discrete-time Algorithm (Implicit)SDEs for general convex functions:


dYt = −·βteβt


Implicit discretization

∇h*(yk+1) = xk+1 + 1/τk(xk+1 − xk)

yk+1 − yk = − τk /skG(xk+1; ξk+1)

Convergence rate𝔼[ f(xk) − f(x*)] = O( 1

k2+

σ

k )๏Optimal rate [Ghadimi & Lan, 2012]

๏Implicit update

New Discrete-time Algorithm (ASMD)


k2+

σ2 + 1

k )๏Not optimal rate [Ghadimi & Lan, 2012]

๏Explicit (practical) algorithm

Hybrid discretization

yk+1 − yk = − τk /skG(xk+1; ξk+1)∇h*(yk) = xk+1 + 1/τk(xk+1 − xk)

SDEs for general convex functions:dXt = ·βt(∇h*(Yt) − Xt)dt

dYt = −·βteβt


New Discrete-time Algorithm (ASMD3)SDEs for general convex functions:


dYt = −·βteβt



k2+

σ2

k )

Explicit discretization with additional sequence ∇h*(yk) = xk + 1/τk(zk+1 − xk)yk+1 − yk = − τk /skG(zk+1; ξk+1)

xk+1 = arg minx∈𝒳

{⟨G(zk+1; ξk+1), x⟩ + MkDh(zk+1, x)}

๏Optimal rate [Ghadimi & Lan, 2012]


New Discrete-time Algorithm (Implicit)SDEs for strongly convex functions:


dYt = − ·βt( 1μ

∇f(Xt)dt + (Yt − ∇h(Xt))dt +δσ(Xt, t)

μdBt)

Implicit discretization

∇h*(yk+1) = xk+1 + 1/τk(xk+1 − xk)

yk+1 − yk = − τk(G(xk+1; ξk+1)/μ + yk+1 − ∇h(xk+1))


k2+

σ2

μk )๏Optimal rate [Ghadimi & Lan, 2012]

๏Implicit update

New Discrete-time Algorithm (ASMD)


k2+

σ2 + 1μk )

๏Not optimal rate [Ghadimi & Lan, 2012]


SDEs for strongly convex functions:dXt = ·βt(∇h*(Yt) − Xt)dt



μdBt)

Hybrid discretization

yk+1 − yk = − τk(G(xk+1; ξk+1)/μ + yk+1 − ∇h(xk+1))∇h*(yk) = xk+1 + 1/τk(xk+1 − xk)

New Discrete-time Algorithm (ASMD3)


k2+

σ2

μk )

Explicit discretization with additional sequence ∇h*(yk) = xk + 1/τk(zk+1 − xk)yk+1 − yk = − τk(G(zk+1; ξk+1)/μ + yk − ∇h(zk+1))

xk+1 = arg minx∈𝒳

{⟨G(zk+1; ξk+1), x⟩ + MkDh(zk+1, x)}

๏Optimal rate [Ghadimi & Lan, 2012]


SDEs for strongly convex functions:dXt = ·βt(∇h*(Yt) − Xt)dt



μdBt)

Outline





• Experiments

Experiment Results: General Convex Case

0 200 400 600 800 1000 1200 1400 1600 1800 2000-3

-2

-1

0

1

2

3

4

5

SMDSAGEAC-SAASMDASMD3

0 200 400 600 800 1000 1200-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

SMDAC-SAASMDASMD3

๏constrain set: 𝒳 = {x ∈ ℝd : ∥x∥2 ≤ R}๏distance generating function:

h(x) =12

∥x∥22

๏constrain set: 𝒳 = {x ∈ ℝd : ∑di=1 xi = 1,xi ≥ 0}

๏ distance generating function:h(x) = ∑d

i=1 xi log xi

Baselines:

Optimization problem:

SMD, SAGE [Hu et al., 2009], AC-SA [Ghadimi & Lan, 2012]

minx∈𝒳

12n

∥Ax − y∥22

Experiment Results: Strongly Convex CaseBaselines:

Optimization problem:

SMD, SAGE [Hu et al., 2009], AC-SA [Ghadimi & Lan, 2012]

minx∈𝒳

12n

∥Ax − y∥22 + λ∥x∥2

2

0 100 200 300 400 500 600 700 800 900 1000-2

-1

0

1

2

3

4

5

SMDSAGEAC-SAASMDASMD3

๏constrain set: 𝒳 = {x ∈ ℝd : ∥x∥2 ≤ R}๏distance generating function:

h(x) =12

∥x∥22

0 100 200 300 400 500 600 700 800 900 1000-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

SMDAC-SAASMDASMD3

๏constrain set: 𝒳 = {x ∈ ℝd : ∑di=1 xi = 1,xi ≥ 0}

๏ distance generating function:h(x) = ∑d

i=1 xi log xi

Take Away

Continuous-time dynamics can help us

• better understand stochastic optimization

• derive new discrete-time algorithms based on various discretization schemes

• deliver a unified and simple proof of convergence rates

Thank You

Reference• Allen-Zhu, Z., & Orecchia, L. (2017). Linear Coupling: An Ultimate Unification of Gradient and Mirror

Descent. In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.

• Bubeck, S., Lee, Y. T., and Singh, M. A geometric alternative to nesterov’s accelerated gradient descent. arXiv preprint arXiv:1506.08187, 2015.

• Diakonikolas J, Orecchia L. Accelerated Extra-Gradient Descent: A Novel Accelerated First-Order Method. In9th Innovations in Theoretical Computer Science Conference (ITCS 2018) 2018 Jan (Vol. 94).

• Ghadimi, S. and Lan, G. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012.

• Hu, B., & Lessard, L. (2017, August). Dissipativity theory for Nesterov's accelerated method. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 1549-1557). JMLR. org.

• Hu, C., Pan, W., and Kwok, J. T. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems, pp. 781–789, 2009.

Reference• Diakonikolas, Jelena, and Lorenzo Orecchia. "Accelerated Extra-Gradient Descent: A Novel

Accelerated First-Order Method." 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.

• Krichene, W., Bayen, A., and Bartlett, P. L. Accelerated mirror descent in continuous and discrete time. In Advances in neural information processing systems, pp. 2845–2853, 2015.

• Krichene, W. and Bartlett, P.L., 2017. Acceleration and averaging in stochastic descent dynamics. In Advances in Neural Information Processing Systems (pp. 6796-6806).

• Lan, G. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365–397, 2012.

• Lan, G. and Zhou, Y., 2018. An optimal randomized incremental gradient method. Mathematical programming, 171(1-2), pp.167-215.

• Lessard, L., Recht, B., and Packard, A. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57–95, 2016.

• Mertikopoulos, P. and Staudigl, M. On the convergence of gradient-like flows with noisy gradient input. arXiv preprint arXiv:1611.06730, 2016.

Reference• Raginsky, M. and Bouvrie, J. Continuous-time stochastic mirror descent on a network: Variance

reduction, consensus, convergence. In Decision and Control (CDC), 2012 IEEE 51st Annual Conference on, pp. 6793–6800. IEEE, 2012.

• Su, W., Boyd, S., and Candes, E. A differential equation for modeling nesterovs accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pp. 2510–2518, 2014.

• Wibisono, A., Wilson, A. C., and Jordan, M. I. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, pp. 201614734, 2016.

• Wilson, A. C., Recht, B., and Jordan, M. I. A lyapunov analysis of momentum methods in optimization. arXiv preprint arXiv:1611.02635, 2016.

• Xu, P., Wang, T. and Gu, Q., 2018, March. Accelerated stochastic mirror descent: From continuous-time dynamics to discrete-time algorithms. In International Conference on Artificial Intelligence and Statistics (pp. 1087-1096).

• Xu, P., Wang, T. and Gu, Q., 2018, July. Continuous and discrete-time accelerated stochastic mirror descent for strongly convex functions. In International Conference on Machine Learning (pp. 5488-5497).

Documents

Two Facets of Stochastic Optimization: Continuous …web.cs.ucla.edu/~qgu/pdf/slides_ACC.pdfTwo Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time Algorithms