Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Two Facets of Stochastic Optimization: Continuous-time Dynamics and Discrete-time
Algorithms
Quanquan Gu Department of Computer Science
University of California, Los Angeles
Joint work with Pan Xu and Tianhao Wang
ACC Workshop on Interplay between Control, Optimization, and Machine Learning
Optimization for Machine Learning
• Many machine learning methods can be formulated as an optimization problem
!
• ! is a (strongly) convex function • ! is a constrained set
• Stochastic optimization plays a central role in large-scale machine learning • stochastic gradient descent • stochastic mirror descent • stochastic Langevin gradient descent • accelerated variants • …
minx∈𝒳
f(x)
f : ℝd → ℝ𝒳 ⊆ ℝd
Outline
• Stochastic Mirror Descent
• Understanding Acceleration in Optimization
• Continuous-time Dynamics for Accelerated Stochastic Mirror Descent
• Discretization of SDEs and New ASMD Algorithms
• Experiments
Stochastic Gradient Descent
SGD update:xk+1 = Π𝒳(xk − ηkG(xk; ξk))
Unbiased estimator of the gradient:
𝔼ξk[G(xk; ξk)] = ∇f(xk)
step size stochastic gradient
Convergence rate
𝔼[ f(xk) − f(x*)] = O( G
k )𝔼[ f(xk) − f(x*)] = O( G2 log k
μk )
convex & bounded gradient
strongly convex & bounded gradientbounded gradient: �∥G(x; ξ)∥2 ≤ G
� -strongly convex: �μ f(x) ≥ f(y) + ⟨∇f(x), x − y⟩ + μ/2∥x − y∥22
From Euclidean Space to Non-Euclidean Space: Stochastic Mirror DescentBregman divergence
Stochastic Mirror Descent (SMD) update:
yk+1 = ∇h(xk) − ηkG(xk; ξk)xk+1 = ∇h*(yk+1)
Primal space Dual space∇h
∇h*
xk
xk+1
∇h(xk)
yk+1
−ηkG(xk; ξk)
Dh(x, z) := h(z) − h(x) − ⟨∇h(x), z − x⟩
Descent method in the dual space
𝔼[ f(xk) − f(x*)] = O( G
k )
𝔼[ f(xk) − f(x*)] = O( G2 log kμk )
convex & bounded gradient
strongly convex & bounded gradient
strongly convex distance generating function
Accelerated Stochastic Mirror Descent
ASMD update [Lan, 2012; Saeed & Lan, 2012]xmd
k = β−1k xk + (1 − β−1
k )xmdk
xk+1 = ∇h*(∇h(xk) − γ∇G(xmdk ; ξk))
xagk+1 = β−1
k xk+1 + (1 − β−1k )xag
kHard to Interpret!
Convergence rate
𝔼[ f(xk) − f(x*)] = O( Lk2
+σ
k )𝔼[ f(xk) − f(x*)] = O( L
k2+
σ2
μk )
convex & bounded gradient
strongly convex & bounded gradient
when � , it matches the optimal rate of deterministic mirror descent
σ = 0f(x) f(y) + hrf(y),x� yi+ L/2kx� yk22Smooth:𝔼[∥G(x, ξ) − ∇F(x)∥2
2] ≤ σ2
We Want to …
• Better understand accelerated stochastic mirror descent
• Derive intuitive and simple accelerated stochastic mirror descent algorithms
• Deliver simple proof of the convergence rates
Outline
• Stochastic Mirror Descent
• Understanding Acceleration in Optimization
• Continuous-time Dynamics for Accelerated Stochastic Mirror Descent
• Discretization of SDEs and New ASMD Algorithms
• Experiments
Interpretations of Nesterov’s AGD/AMD
• Ordinary Differential Equation interpretation[Su et al, 2014] [Krichene et al, 2015] [Wibisono et al, 2016] [Wilson et al, 2016] [Diakonikolas & Orecchia, 2018]
• Other interpretations
• Linear Matrix Inequality [Lessard et al, 2016]
• Dissipativity Theory [Hu & Lessard, 2017]
• Linear Coupling [Allen-Zhu & Orecchia, 2017]
• Geometry [Bubeck et al, 2015]
• Game theory [Lan & Zhou, 2018]
From ODE to SDE
t0 1 2 3 4 5 6 7 8 9 10
Xt
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
t0 1 2 3 4 5 6 7 8 9 10
Xt
-0.5
0
0.5
1
1.5
2
2.5
3
Brownian motion
dXt = (�0.5Xt + sin(0.01t))dt dXt = (�0.5Xt + sin(0.01t))dt+ 0.2dBt
dXt = u(Xt, t)dt
dXt = u(Xt, t)dt+ �(Xt, t)dBt
Ordinary Differential Equation
Stochastic Differential Equation
SDE Interpretations of Stochastic Optimization
xk+1 = xk � ⌘kr ef(xk, ⇠k)Stochastic Gradient Descent Stochastic Gradient Flow
dXt = �rf(Xt)dt+ �dBt
yk+1 = rh(xk)� ⌘kr ef(xk, ⇠k)
xk+1 = ↵krh⇤(yk+1) + (1� ↵k)xk
Accelerated Stochastic Mirror Descent
dZt = �⌘t[rf(Xt)dt+ �(Xt, t)dBt]
dXt = at[rh⇤(Zt/st)�Xt]dt,
Accelerated Stochastic Mirror Flow
[Krichene & Bartlett, 2017]
yk+1 = rh(xk)� ⌘kr ef(xk, ⇠k)
xk+1 = rh⇤(yk+1)
Stochastic Mirror Descent Stochastic Mirror Flow
drh(Xt) = �rf(Xt)dt+ �dBt
[Raginsky & Bouvrie, 2012] [Metrikopoulos & Staudigl, 2016]
Yet …
Convex Strongly Convex Acceleration Converge Discrete-time
Algorithm
(Raginsky & Bouvrie, 2012)
(Mertikopoulos and Staudigl, 2016)
(Krichene & Bartlett, 2017)
Yet …
Convex Strongly Convex Acceleration Converge Discrete-time
Algorithm
(Raginsky & Bouvrie, 2012)
(Mertikopoulos and Staudigl, 2016)
(Krichene & Bartlett, 2017)
This talk
Outline
• Stochastic Mirror Descent
• Understanding Acceleration in Optimization
• Continuous-time Dynamics for Accelerated Stochastic Mirror Descent
• Discretization of SDEs and New ASMD Algorithms
• Experiments
Lagrangian Mechanics Behind Optimization• Optimization: Mechanical/Physical system with friction
• Undamped Lagrangian �
• Principle of Least Action: real-world motion � minimize
�
• Euler-Lagrange equation
�
ℒ(X, V, t) =12
∥V∥2 − f(X)
Xt
J(X) = ∫𝕋ℒ(Xt,
·Xt, t)dt
ddt { ∂ℒ
∂ ·Xt(Xt,
·Xt, t)} =∂ℒ∂Xt
(Xt,·Xt, t)
kinetic energy potential energy
Bregman Lagrangian for Mirror Descent: General Convex Functions• Damped Lagrangian
�
• Solution to Euler-Lagrangian equation
�
• Damped Bregman Lagrangian [Wibisono et al, 2016]
�
ℒ(X, V, t) = eγt( 12
∥V∥2 − f(X))
··Xt + ·γt + ∇f(Xt) = 0
ℒ(X, V, t) = eαt+γt(Dh(X + e−αtV, X) − eβtf(X))
Continuous-time Dynamics of MD: General Convex Functions
By Euler-Lagrange equation, choosing �eαt = ·βt, γt = βt
ddt
∇h(Xt + 1/ ·βt·Xt) = − ·βteβt ∇f(Xt)
a second order ODE
Define � and rewrite the ODEYt = ∇h(Xt + 1/ ·βt·Xt)
{dXt = ·βt(∇h*(Yt) − Xt)dtdYt = − ·βteβt ∇f(Xt)dt
continuous-time dynamics of AMD[Wibisono et al, 2016]
Continuous-time Dynamics of SMD: General Convex FunctionsAdd a Brownian motion?
{dXt = ·βt(∇h*(Yt) − Xt)dt
dYt = − ·βteβt ∇f(Xt)dt + δσ(Xt, t)dBt
Brownian motion
Does not converge
So we introduce an extra shrinkage parameter
dXt = ·βt(∇h*(Yt) − Xt)dt
dYt = −·βteβt
st(∇f(Xt)dt + δσ(Xt, t)dBt)
This is the continuous-time dynamics of accelerated SMD for general convex function
shrinkage parameter
dXt = ·βt(∇h*(Yt) − Xt)dt
dYt = −·βteβt
st(∇f(Xt)dt + δσ(Xt, t)dBt)
Convergence Rate of Continuous-time Dynamics: General Convex FunctionsStochastic differential equation (SDE):
๏ is time index๏ are scaling parameter๏ is the standard Brownian motion
t > 0δ, βt, stBt ∈ ℝd
drift term diffusion term
Convergence of the proposed SDE:
𝔼[ f(Xt) − f(x*)] = O( 1t2
+σ2
t1/2−q )∥σ(Xt, t)∥2 ≤ σtq๏diffusion term
๏optimal convergence rate of ASMD O( 1k2
+σ2
k )when � , it matches optimal rate for stochastic mirror descent for general convex functions [Lan, 2012; Saeed & Lan, 2012]
q = 0
Roadmap of the Proof
dEt =@Et@t
dt+D @Et@Xt
, dXt
E+
D @Et@Yt
, dYt
E+
�2t e
2�t
2s2ttr
✓�>t@2Et@Y 2
t
�t
◆dt
stMh,X +1
2st�2t e
2�ttr��>t r2h⇤(Yt)�t
�dt� �te
�thrh⇤(Yt)� x⇤,�tdBti.
Rewrite the stochastic dynamics as the following SDE
d
Xt
Yt
�=
�t(rh⇤(Yt)�Xt)��te�trf(Xt)/st
�dt+
0
��te�tp��(Xt, t)/st
�dBt.
• Step 1: bounding dEt
Applying Itô's Lemma to with respect to the above SDE yieldsEt
• Lyapunov FunctionEt = e�t(f(Xt)� f(x⇤)) + stDh⇤(Yt,rh(x⇤))
Roadmap of the Proof
!21
Plugging in parameters: and
• Step 2: integrating and taking expectation
• Step 3: choosing parameters
E[Et] E0 + (st � s0)Mh,X +1
2E Z t
0
�2re
2�r
srtr��>r r2h⇤(Yr)�r
�dr
�,
where is the diameter of XMh,X
�t = 2 log t st = t3/2+q
E[f(Xt)� f(x⇤)] E[Et]e�t
= O
✓1
t2+
1
t1/2�q
◆.
Bregman Lagrangian for Mirror Descent: Strongly Convex Functions• Damped Lagrangian
�
• Solution to Euler-Lagrangian equation
�
• Damped Bregman Lagrangian [Xu et al., 2018]
�
ℒ(X, V, t) = eγt( 12
∥V∥2 − f(X))
··Xt + ·γt + ∇f(Xt) = 0
ℒ(X, V, t) = eαt+βt+γt(μDh(X + e−αtV, X) − f(X))
Continuous-time Dynamics of MD: Strongly Convex Functions
By Euler-Lagrange equation, choosing �eαt = ·βt, ·γt = − eαt
ddt
∇h(Xt + 1/ ·βt·Xt) = − ·βteβt(∇f(Xt)/μ + ∇h(Xt + 1/ ·βt
·Xt) − ∇h(Xt))
a second order ODEDefine � and add Brownian motion to the ODEYt = ∇h(Xt + 1/ ·βt
·Xt)
dXt = ·βt(∇h*(Yt) − Xt)dt
dYt = − ·βt( 1μ ∇f(Xt)dt + (Yt − ∇h(Xt))dt +
δσ(Xt, t)μ dBt)
This is the continuous-time dynamics of accelerated SMD for strongly convex function [Xu et al., 2018]
dXt = ·βt(∇h*(Yt) − Xt)dt
dYt = − ·βt( 1μ ∇f(Xt)dt + (Yt − ∇h(Xt))dt +
δσ(Xt, t)μ dBt)
Convergence Rate of Continuous-time Dynamics: Strongly Convex FunctionsStochastic differential equation (SDE):
๏ is time index๏ are scaling parameter๏ is the standard Brownian motion
t > 0δ, βtBt ∈ ℝd
drift term diffusion term
Convergence of the proposed SDE:
𝔼[ f(Xt) − f(x*)] = O( 1t2
+σ2
μt1−2q )∥σ(Xt, t)∥2 ≤ σtq๏diffusion term
๏optimal convergence rate of ASMD O( 1k2
+σ2
μk )when � , it matches optimal rate for stochastic mirror descent for general convex functions [Lan, 2012; Saeed & Lan, 2012]
q = 0
Outline
• Stochastic Mirror Descent
• Understanding Acceleration in Optimization
• Continuous-time Dynamics for Accelerated Stochastic Mirror Descent
• Discretization of SDEs and New ASMD Algorithms
• Experiments
Discretization of SDE
yk yk+1
Yt Yt+δ
xk+1 − xk
δ≈ ·Xt
yk+1 − yk
δ≈ ·Yt
Forward (Explicit) Euler Discretization
xk − xk−1
δ≈ ·Xt
yk − yk−1
δ≈ ·Yt
Backward (Implicit) Euler Discretization
Continuous-Time to Discrete-time sequence
xk xk+1
Xt Xt+δContinuous indexDiscrete index
New Discrete-time Algorithm (Implicit)SDEs for general convex functions:
dXt = ·βt(∇h*(Yt) − Xt)dt
dYt = −·βteβt
st(∇f(Xt)dt + δσ(Xt, t)dBt)
Implicit discretization
∇h*(yk+1) = xk+1 + 1/τk(xk+1 − xk)
yk+1 − yk = − τk /skG(xk+1; ξk+1)
Convergence rate𝔼[ f(xk) − f(x*)] = O( 1
k2+
σ
k )๏Optimal rate [Ghadimi & Lan, 2012]
๏Implicit update
New Discrete-time Algorithm (ASMD)
Convergence rate𝔼[ f(xk) − f(x*)] = O( 1
k2+
σ2 + 1
k )๏Not optimal rate [Ghadimi & Lan, 2012]
๏Explicit (practical) algorithm
Hybrid discretization
yk+1 − yk = − τk /skG(xk+1; ξk+1)∇h*(yk) = xk+1 + 1/τk(xk+1 − xk)
SDEs for general convex functions:dXt = ·βt(∇h*(Yt) − Xt)dt
dYt = −·βteβt
st(∇f(Xt)dt + δσ(Xt, t)dBt)
New Discrete-time Algorithm (ASMD3)SDEs for general convex functions:
dXt = ·βt(∇h*(Yt) − Xt)dt
dYt = −·βteβt
st(∇f(Xt)dt + δσ(Xt, t)dBt)
Convergence rate𝔼[ f(xk) − f(x*)] = O( 1
k2+
σ2
k )
Explicit discretization with additional sequence ∇h*(yk) = xk + 1/τk(zk+1 − xk)yk+1 − yk = − τk /skG(zk+1; ξk+1)
xk+1 = arg minx∈𝒳
{⟨G(zk+1; ξk+1), x⟩ + MkDh(zk+1, x)}
๏Optimal rate [Ghadimi & Lan, 2012]
๏Explicit (practical) algorithm
New Discrete-time Algorithm (Implicit)SDEs for strongly convex functions:
dXt = ·βt(∇h*(Yt) − Xt)dt
dYt = − ·βt( 1μ
∇f(Xt)dt + (Yt − ∇h(Xt))dt +δσ(Xt, t)
μdBt)
Implicit discretization
∇h*(yk+1) = xk+1 + 1/τk(xk+1 − xk)
yk+1 − yk = − τk(G(xk+1; ξk+1)/μ + yk+1 − ∇h(xk+1))
Convergence rate𝔼[ f(xk) − f(x*)] = O( 1
k2+
σ2
μk )๏Optimal rate [Ghadimi & Lan, 2012]
๏Implicit update
New Discrete-time Algorithm (ASMD)
Convergence rate𝔼[ f(xk) − f(x*)] = O( 1
k2+
σ2 + 1μk )
๏Not optimal rate [Ghadimi & Lan, 2012]
๏Explicit (practical) algorithm
SDEs for strongly convex functions:dXt = ·βt(∇h*(Yt) − Xt)dt
dYt = − ·βt( 1μ
∇f(Xt)dt + (Yt − ∇h(Xt))dt +δσ(Xt, t)
μdBt)
Hybrid discretization
yk+1 − yk = − τk(G(xk+1; ξk+1)/μ + yk+1 − ∇h(xk+1))∇h*(yk) = xk+1 + 1/τk(xk+1 − xk)
New Discrete-time Algorithm (ASMD3)
Convergence rate𝔼[ f(xk) − f(x*)] = O( 1
k2+
σ2
μk )
Explicit discretization with additional sequence ∇h*(yk) = xk + 1/τk(zk+1 − xk)yk+1 − yk = − τk(G(zk+1; ξk+1)/μ + yk − ∇h(zk+1))
xk+1 = arg minx∈𝒳
{⟨G(zk+1; ξk+1), x⟩ + MkDh(zk+1, x)}
๏Optimal rate [Ghadimi & Lan, 2012]
๏Explicit (practical) algorithm
SDEs for strongly convex functions:dXt = ·βt(∇h*(Yt) − Xt)dt
dYt = − ·βt( 1μ
∇f(Xt)dt + (Yt − ∇h(Xt))dt +δσ(Xt, t)
μdBt)
Outline
• Stochastic Mirror Descent
• Understanding Acceleration in Optimization
• Continuous-time Dynamics for Accelerated Stochastic Mirror Descent
• Discretization of SDEs and New ASMD Algorithms
• Experiments
Experiment Results: General Convex Case
0 200 400 600 800 1000 1200 1400 1600 1800 2000-3
-2
-1
0
1
2
3
4
5
SMDSAGEAC-SAASMDASMD3
0 200 400 600 800 1000 1200-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
SMDAC-SAASMDASMD3
๏constrain set: 𝒳 = {x ∈ ℝd : ∥x∥2 ≤ R}๏distance generating function:
h(x) =12
∥x∥22
๏constrain set: 𝒳 = {x ∈ ℝd : ∑di=1 xi = 1,xi ≥ 0}
๏ distance generating function:h(x) = ∑d
i=1 xi log xi
Baselines:
Optimization problem:
SMD, SAGE [Hu et al., 2009], AC-SA [Ghadimi & Lan, 2012]
minx∈𝒳
12n
∥Ax − y∥22
Experiment Results: Strongly Convex CaseBaselines:
Optimization problem:
SMD, SAGE [Hu et al., 2009], AC-SA [Ghadimi & Lan, 2012]
minx∈𝒳
12n
∥Ax − y∥22 + λ∥x∥2
2
0 100 200 300 400 500 600 700 800 900 1000-2
-1
0
1
2
3
4
5
SMDSAGEAC-SAASMDASMD3
๏constrain set: 𝒳 = {x ∈ ℝd : ∥x∥2 ≤ R}๏distance generating function:
h(x) =12
∥x∥22
0 100 200 300 400 500 600 700 800 900 1000-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
SMDAC-SAASMDASMD3
๏constrain set: 𝒳 = {x ∈ ℝd : ∑di=1 xi = 1,xi ≥ 0}
๏ distance generating function:h(x) = ∑d
i=1 xi log xi
Take Away
Continuous-time dynamics can help us
• better understand stochastic optimization
• derive new discrete-time algorithms based on various discretization schemes
• deliver a unified and simple proof of convergence rates
Thank You
Reference• Allen-Zhu, Z., & Orecchia, L. (2017). Linear Coupling: An Ultimate Unification of Gradient and Mirror
Descent. In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
• Bubeck, S., Lee, Y. T., and Singh, M. A geometric alternative to nesterov’s accelerated gradient descent. arXiv preprint arXiv:1506.08187, 2015.
• Diakonikolas J, Orecchia L. Accelerated Extra-Gradient Descent: A Novel Accelerated First-Order Method. In9th Innovations in Theoretical Computer Science Conference (ITCS 2018) 2018 Jan (Vol. 94).
• Ghadimi, S. and Lan, G. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492, 2012.
• Hu, B., & Lessard, L. (2017, August). Dissipativity theory for Nesterov's accelerated method. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 1549-1557). JMLR. org.
• Hu, C., Pan, W., and Kwok, J. T. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems, pp. 781–789, 2009.
Reference• Diakonikolas, Jelena, and Lorenzo Orecchia. "Accelerated Extra-Gradient Descent: A Novel
Accelerated First-Order Method." 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.
• Krichene, W., Bayen, A., and Bartlett, P. L. Accelerated mirror descent in continuous and discrete time. In Advances in neural information processing systems, pp. 2845–2853, 2015.
• Krichene, W. and Bartlett, P.L., 2017. Acceleration and averaging in stochastic descent dynamics. In Advances in Neural Information Processing Systems (pp. 6796-6806).
• Lan, G. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365–397, 2012.
• Lan, G. and Zhou, Y., 2018. An optimal randomized incremental gradient method. Mathematical programming, 171(1-2), pp.167-215.
• Lessard, L., Recht, B., and Packard, A. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57–95, 2016.
• Mertikopoulos, P. and Staudigl, M. On the convergence of gradient-like flows with noisy gradient input. arXiv preprint arXiv:1611.06730, 2016.
Reference• Raginsky, M. and Bouvrie, J. Continuous-time stochastic mirror descent on a network: Variance
reduction, consensus, convergence. In Decision and Control (CDC), 2012 IEEE 51st Annual Conference on, pp. 6793–6800. IEEE, 2012.
• Su, W., Boyd, S., and Candes, E. A differential equation for modeling nesterovs accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pp. 2510–2518, 2014.
• Wibisono, A., Wilson, A. C., and Jordan, M. I. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, pp. 201614734, 2016.
• Wilson, A. C., Recht, B., and Jordan, M. I. A lyapunov analysis of momentum methods in optimization. arXiv preprint arXiv:1611.02635, 2016.
• Xu, P., Wang, T. and Gu, Q., 2018, March. Accelerated stochastic mirror descent: From continuous-time dynamics to discrete-time algorithms. In International Conference on Artificial Intelligence and Statistics (pp. 1087-1096).
• Xu, P., Wang, T. and Gu, Q., 2018, July. Continuous and discrete-time accelerated stochastic mirror descent for strongly convex functions. In International Conference on Machine Learning (pp. 5488-5497).