43
On the fast convergence of random perturbations of the gradient flow. Wenqing Hu. 1 (Joint work with Chris Junchi Li 2 .) 1. Department of Mathematics and Statistics, Missouri S&T. 2. Department of Operations Research and Financial Engineering, Princeton University.

On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

On the fast convergence of randomperturbations of the gradient flow.

Wenqing Hu. 1

(Joint work with Chris Junchi Li 2.)

1. Department of Mathematics and Statistics, Missouri S&T.2. Department of Operations Research and Financial Engineering, Princeton

University.

Page 2: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Stochastic Gradient Descent Algorithm.

I We target at finding a local minimum point x∗ of theexpectation function F (x) ≡ E[F (x ; ζ)] :

x∗ = arg minx∈U⊂Rd

E[F (x ; ζ)] .

I Index random variable ζ follows some prescribed distributionD.

I We consider a general nonconvex stochastic loss functionF (x ; ζ) that is twice differentiable with respect to x . Togetherwith some additional regularity assumptions 3 we guaranteethat ∇E[F (x ; ζ)] = E[∇F (x ; ζ)].

3. Such as control of the growth of the gradient in expectation.

Page 3: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Stochastic Gradient Descent Algorithm.

I Gradient Descent (GD) has iteration :

x (t) = x (t−1) − ηE[∇F (x (t−1); ζ)] .

I Stochastic Gradient Descent (SGD) has iteration :

x (t) = x (t−1) − η∇F (x (t−1); ζt) ,

where {ζt} are i.i.d. random variables that have the samedistribution as ζ ∼ D.

Page 4: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Machine Learning Background : DNN.

I Deep Neural Network (DNN). Goal is to solve the followingstochastic optimization problem

minx∈Rd

f (x) ≡ 1

M

M∑i=1

fi (x)

where each component fi corresponds to the loss function fordata point i ∈ {1, ...,M}, and x is the vector of weights beingoptimized.

Page 5: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Machine Learning Background : DNN.

I Let B be the minibatch of prescribed size uniformly sampledfrom {1, ...,M}, then the objective function can be furtherwritten as the expectation of a stochastic function

1

M

M∑i=1

fi (x) = EB

(1

|B|∑x∈B

fi (x)

).

I SGD updates as

x (t) = x (t−1) − η

1

|Bt |∑i∈Bt

∇fi (x (t−1))

,

which is the classical mini–batch version of the SGD.

Page 6: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Machine Learning Background : Online learning.

I Online learning via SGD : standard sequential predictingproblem where for i = 1, 2, ...1. An unlabeled example ai arrives ;2. We make a prediction bi based on the current weightsxi = [x1

i , ..., xdi ] ∈ Rd ;

3. We observe bi , let ζi = (ai , bi ), and incur some known lossL(xi , ζi ) which is convex in xi ;4. We update weights according to some rule xi+1 ← f (xi ).

I SGD rule :f (xi ) = xi − η∇1L(xi , ζi ) ,

where ∇1L(u, v) is a subgradient of L(u, v) with respect tothe first variable u, and the parameter η > 0 is often referredas the learning rate.

Page 7: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Closer look at SGD.

I SGD :x (t) = x (t−1) − η∇F (x (t−1); ζt) .

I Set F (x) = Eζ∼DF (x ; ζ).

I Letet = ∇F (x (t−1); ζt)−∇F (x (t−1))

and we can rewrite the SGD as

x (t) = x (t−1) − η(∇F (x (t−1)) + et) .

Page 8: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Local statistical characteristics of SGD path.

I In a difference form it looks like

x (t) − x (t−1) = −η∇F (x (t−1))− ηet .

I We see that

E(x (t) − x (t−1)|x (t−1)) = −η∇F (x (t−1))

and

[Cov(x (t) − x (t−1)|x (t−1))]1/2 = η[Cov(∇F (x (t−1); ζ))]1/2 .

Page 9: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

SGD approximating diffusion process.

I Very roughly speaking, we can approximate x (t) by a diffusionprocess Xt driven by the stochastic differential equation

dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) ,

where σ(x) = [Cov(∇F (x ; ζ))]1/2 and Wt is a standardBrownian motion in Rd .

I Slogan : Continuous Markov processes are characterized by itslocal statistical characteristics only in the first and secondmoments (conditional mean and (co)variance).

Page 10: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Diffusion Approximation of SGD : Justification.

I Such an approximation has been justified in the weak sense inmany classical literature 4 5 6.

I It can also be thought of as a normal deviation result.

I One can call the continuous process Xt as the “continuousSGD” 7.

I I will come back to this topic by the end of the talk.

4. Benveniste, A., Metivier, M., Priouret, P., Adaptive Algorithms andStochastic Approximations, Applications of Mathematics, 22, Springer, 1990.

5. Borkar, Vivek S., Stochastic Approximation : A Dynamical Systems View-point, Cambridge University Press, 2008.

6. Kushner, H.J., George Yin, G., Stochastic approximation and recursivealgorithms and applications, Stochastic Modeling and Applied Probability, 35,Second Edition, Springer, 2003.

7. Sirignano, J., Spiliopoulos, K., Stochastic Gradient Descent in ContinuousTime. SIAM Journal of Financial Mathematics, to appear.

Page 11: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

SGD approximating diffusion process : convergence time.

I Recall that we have set F (x) = EF (x ; ζ).

I Thus we can formulate the original optimization problem as

x∗ = arg minx∈U

F (x) .

I Instead of SGD, in this work let us consider its approximatingdiffusion process

dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) .

I The dynamics of Xt is used as an alternative optimizationprocedure to find x∗.

I Remark : If there is no noise, then

dXt = −η∇F (Xt)dt , X0 = x (0)

is just the gradient flow S tx (0). Thus it approaches x∗ in thestrictly convex case.

Page 12: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

SGD approximating diffusion process : convergence time.

I Hitting time

τη = inf{t ≥ 0 : F (Xt) ≤ F (x∗) + e}

for some small e > 0.

I Asymptotic of Eτη as η → 0 ?

Page 13: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

SGD approximating diffusion process : convergence time.

I Approximating diffusion process

dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) .

I Let Yt = Xt/η, then

dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) .

(Random perturbations of the gradient flow !)

I Hitting time

T η = inf{t ≥ 0 : F (Yt) ≤ F (x∗) + e}

for some small e > 0.

I

τη = η−1T η .

Page 14: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Random perturbations of the gradient flow : convergencetime.

I Let Yt = Xt/η, then

dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) .

I Hitting time

T η = inf{t ≥ 0 : F (Yt) ≤ F (x∗) + e}

for some small e > 0.

I Asymptotic of ET η as η → 0 ?

Page 15: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Where is the difficulty ?

Figure 1: Various critical points and the landscape of F .

Page 16: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Where is the difficulty ?

Figure 2: Higher order critical points.

Page 17: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Strict saddle property.

I Definition (strict saddle property)8 9 Given fixed γ1 > 0 and γ2 > 0, we say a Morse function Fdefined on Rn satisfies the “strict saddle property” if each pointx ∈ Rn belongs to one of the following : (i) |∇F (x)| ≥ γ2 > 0 ;(ii) |∇F (x)| < γ2 and λmin(∇2F (x)) ≤ −γ1 < 0 ; (iii)|∇F (x)| < γ2 and λmin(∇2F (x)) ≥ γ1 > 0.

8. Ge, R., Huang, F., Jin, C., Yuan, Y., Escaping from sad-dle points–online stochastic gradient descent for tensor decomposition,arXiv:1503.02101v1[cs.LG]

9. Sun, J., Qu, Q., Wright, J., When are nonconvex problems not scary ?arXiv:1510.06096[math.DC]

Page 18: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Strict saddle property.

Figure 3: Types of local landscape geometry with only strict saddles.

Page 19: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Strong saddle property.

I Definition (strong saddle property)

Let the Morse function F (•) satisfy the strict saddle property withparameters γ1 > 0 and γ2 > 0. We say the Morse function F (•)satisfy the “strong saddle property” if for some γ3 > 0 and anyx ∈ Rn such that ∇F (x) = 0, all eigenvalues λi , i = 1, 2, ..., n ofthe Hessian ∇2F (x) at x satisfying (ii) in Definition 1 are boundedaway from zero by some γ3 > 0 in absolute value, i.e.,|λi | ≥ γ3 > 0 for any 1 ≤ i ≤ n.

Page 20: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Escape from saddle points.

Figure 4: Escape from a strong saddle point.

Page 21: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Escape from saddle points : behavior of process near onespecific saddle.

I The problem was first studied by Kifer in 1981 10.

I Recall that

dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) ,

is a random perturbation of the gradient flow. Kifer needs toassume further that the diffusion matrixσ(•)σT (•) = Cov(∇F (•; ζ)) is strictly uniformly positivedefinite.

I Roughly speaking, Kifer’s result states that the exit from aneighborhood of a strong saddle point happens along the“most unstable” direction for the Hessian matrix, and the exittime is asymptotically ∼ C log(η−1)

10. Kifer, Y., The exit problem for small random perturbations of dynamicalsystems with a hyperbolic fixed point. Israel Journal of Mathematics, 40(1), pp.74–96.

Page 22: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Escape from saddle points : behavior of process near onespecific saddle.

Figure 5: Escape from a strong saddle point : Kifer’s result.

Page 23: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Escape from saddle points : behavior of process near onespecific saddle.

I G ∪ ∂G = 0 ∪ A1 ∪ A2 ∪ A3.

I If x ∈ A2 ∪ A3, then for the deterministic gradient flow thereis a finite

t(x) = inf{t > 0 : S tx ∈ ∂G} .

I In this case as η → 0, expected first exit time converges tot(x), and first exit position converges to S t(x)x .

I Remind that the gradient flow S tx (0) is

dXt = −η∇F (Xt)dt , X0 = x (0) .

Page 24: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Escape from saddle points : behavior of process near onespecific saddle.

Figure 6: Escape from a strong saddle point : Kifer’s result.

Page 25: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Escape from saddle points : behavior of process near onespecific saddle.

I If x ∈ (0 ∪ A1)\∂G , situation is more interesting.

I First exit occurs along the direction pointed out by the mostnegative eigenvalue(s) of the Hessian ∇2F (0).

I ∇2F (0) has spectrum −λ1 = −λ2 = ... = −λq < −λq+1 ≤... ≤ −λp < 0 < λp+1 ≤ ... ≤ λd .

I Exit time will be asymptotically1

2λ1ln(η−1) as η → 0.

I Exit position will converge (in probability) to the intersectionof ∂G with the invariant manifold corresponding to−λ1, ...,−λq.

I Slogan : when the process Yt comes close to 0, it will“choose” the most unstable directions and move along them.

Page 26: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Escape from saddle points : behavior of process near onespecific saddle.

Figure 7: Escape from a strong saddle point : Kifer’s result.

Page 27: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

More technical aspects of Kifer’s result...

I In Kifer’s result all convergence are point–wise with respect toinitial point x . They are not uniform convergence when η → 0with respect to all initial point x.

I The “exit along most unstable directions” happens only wheninitial point x strictly stands on A1.

I Does not allow small perturbations with respect to initialpoint x . Run into messy calculations...

Page 28: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Why these technicalities matter ?

I Our landscape may have a “chain of saddles”.

I Recall that

dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0)

is a random perturbation of the gradient flow.

Page 29: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Global landscape : chain of saddle points.

Figure 8: Chain of saddle points.

Page 30: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Linerization of the gradient flow near a strong saddle point.

I Hartman–Grobman Theorem : for any strong saddle point Othat we consider, there exist an open neighborhood U of Oand a C(0) homeomorphism h : U → Rd such that thegradient flow under h is mapped into a linear flow.

I Linearization Assumption 11 : The homeomorphism h providedby the Hartman–Grobman theorem can be taken to be C(2).

I A sufficient condition for the validity of the C(2) linerizationassumption is the so called non–resonance condition(Sternberg linerization theorem).

I I will also come back to this topic at the end of the talk.

11. Bakhtin, Y., Noisy heteroclinic networks, Probability Theory and RelatedFields, 150(2011), no.1–2, pp. 1–42.

Page 31: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Linerization of the gradient flow near a strong saddle point.

Figure 9: Linerization of the gradient flow near a strong saddle point.

Page 32: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Uniform version of Kifer’s exit asymptotic.

I Let U ⊂ G be an open neighborhood of the saddle point O.Set initial point x ∈ U ∪ ∂U.

I Set dist(U ∪ ∂U, ∂G ) > 0. Let t(x) = inf{t > 0 : S tx ∈ ∂G} .I Wmax is the invariant manifold corresponding to the most

negative eigenvalues −λ1, ...,−λq of ∇2F (O).

I Define Qmax = Wmax ∩ ∂G .

I Set

∂GU∪∂U→out = {S t(x)x for some x ∈ U∪∂U with finite t(x)}∪Qmax .

I For small µ > 0 let

Qµ = {x ∈ ∂G , dist(x , ∂GU∪∂U→out) < µ} .

I τηx is the first exit time to ∂G for the process Yt starting fromx .

Page 33: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Uniform version of Kifer’s exit asymptotic.

Figure 10: Uniform exit dynamics near a specific strong saddle pointunder the linerization assumption.

Page 34: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Uniform version of Kifer’s exit asymptotic.

I TheoremFor any r > 0, there exist some η0 > 0 so that for all x ∈ U ∪ ∂Uand all 0 < η < η0 we have

Exτηx

ln(η−1)≤ 1

2λ1+ r .

For any small µ > 0 and any ρ > 0, there exist some η0 > 0 sothat for all x ∈ U ∪ ∂U and all 0 < η < η0 we have

Px(Yτηx ∈ Qµ) ≥ 1− ρ .

Page 35: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Convergence analysis in a basin containing a localminimum : Sequence of stopping times.

I To demonstrate our analysis, we will assume that x∗ is a localminimum point of F (•) in the sense that for some openneighborhood U(x∗) of x∗ we have

x∗ = arg minx∈U(x∗)

F (x) .

I Assume that there are k strong saddle points O1, ...,Ok inU(x∗) such that F (O1) > F (O2) > ... > F (Ok) > F (x∗).

I Start the process with initial point Y0 = x ∈ U(x∗).

I Hitting time

T η = inf{t ≥ 0 : F (Yt) ≤ F (x∗) + e}

for some small e > 0.

I Asymptotic of ET η as η → 0 ?

Page 36: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Convergence analysis in a basin containing a localminimum : Sequence of stopping times.

Figure 11: Sequence of stopping times.

Page 37: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Convergence analysis in a basin containing a localminimum : Sequence of stopping times.

I Standard Markov cycle type argument.

I But the geometry is a little different from the classicalarguments for elliptic equilibriums found in Freidlin–Wentzellbook 12.

I ET η .k

2γ1ln(η−1) conditioned upon convergence.

12. Freidlin, M., Wentzell, A., Random perturbations of dynamical systems,2nd edition, Springer, 1998.

Page 38: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

SGD approximating diffusion process : convergence time.

I Recall that we have set F (x) = EF (x ; ζ).

I Thus we can formulate the original optimization problem as

x∗ = arg minx∈U(x∗)

F (x) .

I Instead of SGD, in this work let us consider its approximatingdiffusion process

dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) .

I The dynamics of Xt is used as an alternative optimizationprocedure to find x∗.

Page 39: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

SGD approximating diffusion process : convergence time.

I Hitting time

τη = inf{t ≥ 0 : F (Xt) ≤ F (x∗) + e}

for some small e > 0.

I Asymptotic of Eτη as η → 0 ?

Page 40: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

SGD approximating diffusion process : convergence time.

I Approximating diffusion process

dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) .

I Let Yt = Xt/η, then

dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) .

(Random perturbations of the gradient flow !)

I Hitting time

T η = inf{t ≥ 0 : F (Yt) ≤ F (x∗) + e}

for some small e > 0.

I

τη = η−1T η .

Page 41: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Convergence analysis in a basin containing a localminimum.

I Theorem(i) For any small ρ > 0, with probability at least 1− ρ, SGDapproximating diffusion process Xt converges to the minimizer x∗

for sufficiently small β after passing through all k saddle points O1,..., Ok ;(ii) Consider the stopping time τη. Then as η ↓ 0, conditioned onthe above convergence of SGD approximating diffusion process Xt ,we have

limη→0

Eτη

η−1 ln η−1≤ k

2γ1.

Page 42: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Epilogue : Problems remaining.

I Approximating diffusion process Xt : how it approximatesx (t) ? Various ways 13. May need correction term 14 fromnumerical SDE theory.

I Linerization assumption : typical in dynamical systems, butshall be moved by much harder work.

13. Li, Q., Tai, C., E. W., Stochastic modified equations and adaptive stochas-tic gradient algorithms, arXiv:1511.06251v3

14. Milstein, G.N., Approximate integration of stochastic differential equa-tions, Theory of Probability and its Applications, 19(3), pp.557-562, 1975.

Page 43: On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

Thank you for your attention !