Continuous-time Models for Stochastic Optimization Algorithmspeople.inf.ethz.ch/orvietoa/SDE_models_slides.pdf · 2019-10-28 · Stochastic Approximation and Recursive Algorithms

Continuous-time Modelsfor Stochastic Optimization Algorithms

Antonio Orvieto, Aurelien Lucchi

0 / 9

Unconstrained non-convex optimization

For some regular f : Rd → R, find x∗ := argminx∈Rd f (x).

Training loss of ResNet-110, no skip connections on CIFAR-10(for more details, check [Li et al., 2018])

Usual Assumption: f is L-smooth, i.e. ‖∇f (x)−∇f (y)‖ ≤ L‖x − y‖.

A recent trend is to model the dynamics of iterative gradient-basedoptimization algorithms with differential equations.

1 / 9

Tutorial: how is an ODE model constructed?

xk+1 = xk − h∇f (xk ) (GD)Define curve y(t) as smooth interpolation: y(kh) = xk

x0

x1 x2

x3

x4

x5

y(0)y(h)

y(2h)

y(3h)

y(4h)y(5h)

What is the law for y(t)?1) by construction : y(t + h) = y(t)− h∇f (y(t));2) thanks to smoothness: y(t + h) = y(t) + hy(t) +O(h2).

In the limit h→ 0, y(t) = −∇f (y(t))

*Solution y ∈ C1(R+,Rd ) exists unique in since ∇f is globally Lip.2 / 9

Some important ODE/SDE models

Algorithm Model Perturbed model (stochastic grads)

xk+1 = xk − h∇f (xk ) X = −∇f (X) dY = −∇f (Y )dt + σdB

(GD)

[Mertikopoulos and Staudigl, 2016]

xk+1 = xk + β(xk − xk−1) y = −αy −∇f (y)−h∇f (xk )

or

{v = −αv −∇f (y)y = v

{dV = −αVdt−∇f (Y )dt + σdBdY = Ydt

(HB)

[Polyak, 1964] [Polyak, 1964] [Orvieto et al., 2019]{xk+1 = uk − h∇f (uk )

uk+1 = xk+1 + kk+3 (xk+1 − xk ).

y = − 3t y −∇f (y)

or

{v = − 3

t v −∇f (y)y = v

{dV = − 3

t Vdt−∇f (Y )dt + σdBdY = Vdt

(NAG)

[Nesterov, 1983] [Su et al., 2016] [Krichene and Bartlett, 2017]

and many more: primal-dual algorithms, adaptive methods, etc.

3 / 9

why should we care about SDE models?do we really need to introduce these objects? what’s the gain?

I GD-ODE/SDE is the basis for many seminal contributions to thetheory of SGD:

1. asymptotic behavior [Kushner and Yin, 2003];2. connection to Bayesian inference [Mandt et al., 2017];3. generalization, width of minimas [Jastrzebski et al., 2017].

I NAG-ODE recently provided us with some novel insights of theacceleration phenomenon1:

1. [Su et al., 2016] studied accel. with Bessel functions;2. [Wibisono et al., 2016] connected NAG to meta-learning

and physics via the minimum action principle;3. [Krichene and Bartlett, 2017] studied the non-trivial

interplay between noise and acceleration in NAG usingstochastic analysis on the NAG-SDE;

4. [Orvieto et al., 2019] showed NAG is equivalent to a lineargradient averaging system after the time-stretch τ = t2/8.

1For convex functions, there is a method (NAG) strictly faster than GD.4 / 9

In this paper, inspired by this success..

I we build SDE models for SVRG and mini-batch SGD, whichinclude the effect of decaying learning rates and increasingbatch-sizes.

I We derive convergence rates for our models. We focus onnon-convex functions relevant for machine learning.

I We derive equivalent novel results for the algorithmiccounterparts, using the same Lyapunov functions. This provesthe effectiveness of our SDE models.

I We provide a new interpretation for the distribution induced bySGD with decreasing stepsizes, which reveals an underlyingtime warping that can be used for designing Lyapunovfunctions.

I We provide a dual interpretation of this last phenomenon aslandscape stretching.

5 / 9

SDEs description

Below are the two SDEs corresponding to mini-batch SGD (MB-PGF)and SVRG (VR-PGF).

dX(t) = −ψ(t)∇f (X(t)) dt + ψ(t)√

h/b(t) σMB(X(t)) dB(t)(MB-PGF)

dX(t) = −ψ(t)∇f (X(t)) dt + ψ(t)√

h/b(t) σVR(X(t),X(t − ξ(t))) dB(t)(VR-PGF)

where

I ξ : R+ → [0,T] is the staleness function (linked to the pivotupdate frequency m in SVRG);

I ψ(·) ∈ C1(R+, [0,1]) is the adjustment function (encodes therelative decrease in the learning rate)

I b(·) ∈ C1(R+,R+) is the mini-batch size function;I {B(t)}t≥0 is a d−dimensional Brownian Motion on some filtered

probability space.

6 / 9

Cond. Rate (Continuous-time)

(∼),(H-),(Hσ) f (x0)− f (x?)

ϕ(t)+

h d L σ2∗

2 ϕ(t)

∫ t

0

ψ(s)2

b(s)ds

(∼),(H-),(Hσ),(HWQC) ‖x0 − x?‖2

2 τ ϕ(t)+

h d σ2∗

2 τ ϕ(t)

∫ t

0

ψ(s)2

b(s)ds

(H-),(Hσ),(HWQC) ‖x0 − x?‖2

2 τ ϕ(t)+

h d σ2∗

2 τ ϕ(t)

∫ t

0(L τ ϕ(s) + 1)

ψ(s)2

b(s)ds

(H-),(Hσ),(HPŁ) e−2µϕ(t)(f (x0)− f (x?)) +h d L σ2

∗2

∫ t

0

ψ(s)2

b(s)e−2µ(ϕ(t)−ϕ(s))ds

(H-),(HRSI)(

1 + 2hL2T

T(µ− 2hL2)

)j

‖x0 − x∗‖2 (under variance reduction)

Cond. Rate (Discrete-time, no Gaussian assumption)

(∼),(H-),(Hσ) 2 (f (x0)− f (x?))

(hϕk+1)+

h d L σ2∗

(hϕk+1)

k∑i=0

ψ2i

bih

(∼),(H-),(Hσ),(HWQC) ‖x0 − x?‖2

τ (hϕk+1)+

d h σ2∗

τ (hϕk+1)

k∑i=0

ψ2i

bih

(H-),(Hσ),(HWQC) ‖x0 − x?‖2

2 τ (hϕk+1)+

h d σ2∗

2 τ (hϕk+1)

k∑i=0

(1 + τϕi+1L)ψ2

i

bih

(H-),(Hσ),(HPŁ)k∏

i=0

(1− µ hψi )(f (x0)− f (x?)) +h d L σ2

∗2

k∑i=0

∏k`=0(1− µ hψ`)∏ij=0(1− µ hψl )

ψ2i

bih

(H-),(HRSI)(

1 + 2L2h2m

hm(µ− 2L2h)

)j

‖x0 − x∗‖2 (under variance reduction)

We derive matching convergence rates in continuous- and discrete-time, using thesame Lyapunov functions.This proves the effectiveness of our SDE models.

7 / 9

Insight 1: time stretchingUsing the SDE models, we can transform an algorithm to anequivalent one which is easier to study.

Theorem. Let {X (t)}t≥0 satisfy PGF and define τ(·) = ϕ−1(·),where ϕ(t) =

∫ t0 ψ(s)ds. For all t ≥ 0, X (τ(t)) = Y (t) in distri-

bution, where {Y (t)}t≥0 has the stochastic differential

dY (t) = −∇f (Y (t))dt +√

h ψ(τ(t))/b(τ(t))σ(τ(t)) dB(t).

Example.b(t) = 1, σ(s) = σId and ψ(t) = 1/(t + 1);we have ϕ(t) = log(t + 1) and τ(t) = et − 1.

dX (t) = − 1t+1∇f (X (t))dt −

√hσ

t+1 dB(t) is s.t.the sped-up solution Y (t) = X (et − 1)satisfies

dY (t) = −∇f (X (t))dt +√

hσe−tdB(t). Verification of the Thm. on a 1dquadratic (100 samples): empirically

X(t) d= Y (ϕ(t)).

8 / 9

Insight 2: landspace stretching

For the sake of simplicity, let f (x) = 12‖x‖

2.PGF with b(t) = 1, σ(s) = σId , ψ(t) = 1

t+1 is

dX (t) = − 1t + 1

X (t)dt +hσ

t + 1dB(t).

Using solution feedback (only possible with acontinuous time formulation), we find that inexpectation

E[dX ] = CX 2dt → dE[X ]]

dt= C∇

(X 3/3

).

Hence, PGF on the quadratic 12‖x‖

2 withlearning rate decreasing as 1/t behaves inexpectation like PGF with constant learningrate on a cubic.

i.e., we loose strong convexity hence we converge slower!

9 / 9

References

10 / 9

Anosov, D. V. (1967).

Geodesic flows on closed riemannian manifolds of negativecurvature.

Trudy Matematicheskogo Instituta Imeni VA Steklova, 90:3–210.

Hairer, E., Lubich, C., and Wanner, G. (2006).

Geometric numerical integration: structure-preserving algorithmsfor ordinary differential equations, volume 31.

Springer Science & Business Media.

Jastrzebski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A.,Bengio, Y., and Storkey, A. (2017).

Three Factors Influencing Minima in SGD.

ArXiv e-prints.

Krichene, W. and Bartlett, P. L. (2017).

Acceleration and Averaging in Stochastic Mirror DescentDynamics.

ArXiv e-prints.9 / 9

Kushner, H. and Yin, G. (2003).

Stochastic Approximation and Recursive Algorithms andApplications.

Stochastic Modelling and Applied Probability. Springer New York.

Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. (2018).

Visualizing the loss landscape of neural nets.

In Advances in Neural Information Processing Systems, pages6389–6399.

Mandt, S., Hoffman, M. D., and Blei, D. M. (2017).

Stochastic gradient descent as approximate bayesian inference.

The Journal of Machine Learning Research, 18(1):4873–4907.

Mertikopoulos, P. and Staudigl, M. (2016).

On the convergence of gradient-like flows with noisy gradientinput.

ArXiv e-prints.

Nesterov, Y. E. (1983).9 / 9

A method for solving the convex programming problem withconvergence rate o (1/kˆ 2).

In Dokl. Akad. Nauk SSSR, volume 269, pages 543–547.

Orvieto, A., Kohler, J., and Lucchi, A. (2019).

The role of memory in stochastic optimization.

arXiv preprint arXiv:1907.01678.

Polyak, B. T. (1964).

Some methods of speeding up the convergence of iterationmethods.

USSR Computational Mathematics and Mathematical Physics,4(5):1–17.

Su, W., Boyd, S., and Candes, E. J. (2016).

A differential equation for modeling nesterovs acceleratedgradient method: theory and insights.

Journal of Machine Learning Research, 17(153):1–43.

Wibisono, A., Wilson, A. C., and Jordan, M. I. (2016).9 / 9

A Variational Perspective on Accelerated Methods inOptimization.ArXiv e-prints.

9 / 9

Documents

Continuous-time Models for Stochastic Optimization Algorithmspeople.inf.ethz.ch/orvietoa/SDE_models_slides.pdf · 2019-10-28 · Stochastic Approximation and Recursive Algorithms