On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation

On the fast convergence of randomperturbations of the gradient flow.

Wenqing Hu. 1

(Joint work with Chris Junchi Li 2.)

1. Department of Mathematics and Statistics, Missouri S&T.2. Department of Operations Research and Financial Engineering, Princeton

University.

Stochastic Gradient Descent Algorithm.

I We target at finding a local minimum point x∗ of theexpectation function F (x) ≡ E[F (x ; ζ)] :

x∗ = arg minx∈U⊂Rd

E[F (x ; ζ)] .

I Index random variable ζ follows some prescribed distributionD.

I We consider a general nonconvex stochastic loss functionF (x ; ζ) that is twice differentiable with respect to x . Togetherwith some additional regularity assumptions 3 we guaranteethat ∇E[F (x ; ζ)] = E[∇F (x ; ζ)].

3. Such as control of the growth of the gradient in expectation.

Stochastic Gradient Descent Algorithm.

I Gradient Descent (GD) has iteration :

x (t) = x (t−1) − ηE[∇F (x (t−1); ζ)] .

I Stochastic Gradient Descent (SGD) has iteration :

x (t) = x (t−1) − η∇F (x (t−1); ζt) ,

where {ζt} are i.i.d. random variables that have the samedistribution as ζ ∼ D.

Machine Learning Background : DNN.

I Deep Neural Network (DNN). Goal is to solve the followingstochastic optimization problem

minx∈Rd

f (x) ≡ 1

M

M∑i=1

fi (x)

where each component fi corresponds to the loss function fordata point i ∈ {1, ...,M}, and x is the vector of weights beingoptimized.

Machine Learning Background : DNN.

I Let B be the minibatch of prescribed size uniformly sampledfrom {1, ...,M}, then the objective function can be furtherwritten as the expectation of a stochastic function

1

M

M∑i=1

fi (x) = EB

(1

|B|∑x∈B

fi (x)

).

I SGD updates as

x (t) = x (t−1) − η

1

|Bt |∑i∈Bt

∇fi (x (t−1))

,

which is the classical mini–batch version of the SGD.

Machine Learning Background : Online learning.

I Online learning via SGD : standard sequential predictingproblem where for i = 1, 2, ...1. An unlabeled example ai arrives ;2. We make a prediction bi based on the current weightsxi = [x1

i , ..., xdi ] ∈ Rd ;

3. We observe bi , let ζi = (ai , bi ), and incur some known lossL(xi , ζi ) which is convex in xi ;4. We update weights according to some rule xi+1 ← f (xi ).

I SGD rule :f (xi ) = xi − η∇1L(xi , ζi ) ,

where ∇1L(u, v) is a subgradient of L(u, v) with respect tothe first variable u, and the parameter η > 0 is often referredas the learning rate.

Closer look at SGD.

I SGD :x (t) = x (t−1) − η∇F (x (t−1); ζt) .

I Set F (x) = Eζ∼DF (x ; ζ).

I Letet = ∇F (x (t−1); ζt)−∇F (x (t−1))

and we can rewrite the SGD as

x (t) = x (t−1) − η(∇F (x (t−1)) + et) .

Local statistical characteristics of SGD path.

I In a difference form it looks like

x (t) − x (t−1) = −η∇F (x (t−1))− ηet .

I We see that

E(x (t) − x (t−1)|x (t−1)) = −η∇F (x (t−1))

and

[Cov(x (t) − x (t−1)|x (t−1))]1/2 = η[Cov(∇F (x (t−1); ζ))]1/2 .

SGD approximating diffusion process.

I Very roughly speaking, we can approximate x (t) by a diffusionprocess Xt driven by the stochastic differential equation

dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) ,

where σ(x) = [Cov(∇F (x ; ζ))]1/2 and Wt is a standardBrownian motion in Rd .

I Slogan : Continuous Markov processes are characterized by itslocal statistical characteristics only in the first and secondmoments (conditional mean and (co)variance).

Diffusion Approximation of SGD : Justification.

I Such an approximation has been justified in the weak sense inmany classical literature 4 5 6.

I It can also be thought of as a normal deviation result.

I One can call the continuous process Xt as the “continuousSGD” 7.

I I will come back to this topic by the end of the talk.

4. Benveniste, A., Metivier, M., Priouret, P., Adaptive Algorithms andStochastic Approximations, Applications of Mathematics, 22, Springer, 1990.

5. Borkar, Vivek S., Stochastic Approximation : A Dynamical Systems View-point, Cambridge University Press, 2008.

6. Kushner, H.J., George Yin, G., Stochastic approximation and recursivealgorithms and applications, Stochastic Modeling and Applied Probability, 35,Second Edition, Springer, 2003.

7. Sirignano, J., Spiliopoulos, K., Stochastic Gradient Descent in ContinuousTime. SIAM Journal of Financial Mathematics, to appear.

SGD approximating diffusion process : convergence time.

I Recall that we have set F (x) = EF (x ; ζ).

I Thus we can formulate the original optimization problem as

x∗ = arg minx∈U

F (x) .

I Instead of SGD, in this work let us consider its approximatingdiffusion process

dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) .

I The dynamics of Xt is used as an alternative optimizationprocedure to find x∗.

I Remark : If there is no noise, then

dXt = −η∇F (Xt)dt , X0 = x (0)

is just the gradient flow S tx (0). Thus it approaches x∗ in thestrictly convex case.


I Hitting time

τη = inf{t ≥ 0 : F (Xt) ≤ F (x∗) + e}

for some small e > 0.

I Asymptotic of Eτη as η → 0 ?


I Approximating diffusion process


I Let Yt = Xt/η, then

dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) .

(Random perturbations of the gradient flow !)

I Hitting time

T η = inf{t ≥ 0 : F (Yt) ≤ F (x∗) + e}


I

τη = η−1T η .

Random perturbations of the gradient flow : convergencetime.



I Hitting time

T η = inf{t ≥ 0 : F (Yt) ≤ F (x∗) + e}


I Asymptotic of ET η as η → 0 ?

Where is the difficulty ?

Figure 1: Various critical points and the landscape of F .

Where is the difficulty ?

Figure 2: Higher order critical points.

Strict saddle property.

I Definition (strict saddle property)8 9 Given fixed γ1 > 0 and γ2 > 0, we say a Morse function Fdefined on Rn satisfies the “strict saddle property” if each pointx ∈ Rn belongs to one of the following : (i) |∇F (x)| ≥ γ2 > 0 ;(ii) |∇F (x)| < γ2 and λmin(∇2F (x)) ≤ −γ1 < 0 ; (iii)|∇F (x)| < γ2 and λmin(∇2F (x)) ≥ γ1 > 0.

8. Ge, R., Huang, F., Jin, C., Yuan, Y., Escaping from sad-dle points–online stochastic gradient descent for tensor decomposition,arXiv:1503.02101v1[cs.LG]

9. Sun, J., Qu, Q., Wright, J., When are nonconvex problems not scary ?arXiv:1510.06096[math.DC]

Strict saddle property.

Figure 3: Types of local landscape geometry with only strict saddles.

Strong saddle property.

I Definition (strong saddle property)

Let the Morse function F (•) satisfy the strict saddle property withparameters γ1 > 0 and γ2 > 0. We say the Morse function F (•)satisfy the “strong saddle property” if for some γ3 > 0 and anyx ∈ Rn such that ∇F (x) = 0, all eigenvalues λi , i = 1, 2, ..., n ofthe Hessian ∇2F (x) at x satisfying (ii) in Definition 1 are boundedaway from zero by some γ3 > 0 in absolute value, i.e.,|λi | ≥ γ3 > 0 for any 1 ≤ i ≤ n.

Escape from saddle points.

Figure 4: Escape from a strong saddle point.

Escape from saddle points : behavior of process near onespecific saddle.

I The problem was first studied by Kifer in 1981 10.

I Recall that

dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) ,

is a random perturbation of the gradient flow. Kifer needs toassume further that the diffusion matrixσ(•)σT (•) = Cov(∇F (•; ζ)) is strictly uniformly positivedefinite.

I Roughly speaking, Kifer’s result states that the exit from aneighborhood of a strong saddle point happens along the“most unstable” direction for the Hessian matrix, and the exittime is asymptotically ∼ C log(η−1)

10. Kifer, Y., The exit problem for small random perturbations of dynamicalsystems with a hyperbolic fixed point. Israel Journal of Mathematics, 40(1), pp.74–96.


Figure 5: Escape from a strong saddle point : Kifer’s result.


I G ∪ ∂G = 0 ∪ A1 ∪ A2 ∪ A3.

I If x ∈ A2 ∪ A3, then for the deterministic gradient flow thereis a finite

t(x) = inf{t > 0 : S tx ∈ ∂G} .

I In this case as η → 0, expected first exit time converges tot(x), and first exit position converges to S t(x)x .

I Remind that the gradient flow S tx (0) is

dXt = −η∇F (Xt)dt , X0 = x (0) .




I If x ∈ (0 ∪ A1)\∂G , situation is more interesting.

I First exit occurs along the direction pointed out by the mostnegative eigenvalue(s) of the Hessian ∇2F (0).

I ∇2F (0) has spectrum −λ1 = −λ2 = ... = −λq < −λq+1 ≤... ≤ −λp < 0 < λp+1 ≤ ... ≤ λd .

I Exit time will be asymptotically1

2λ1ln(η−1) as η → 0.

I Exit position will converge (in probability) to the intersectionof ∂G with the invariant manifold corresponding to−λ1, ...,−λq.

I Slogan : when the process Yt comes close to 0, it will“choose” the most unstable directions and move along them.



More technical aspects of Kifer’s result...

I In Kifer’s result all convergence are point–wise with respect toinitial point x . They are not uniform convergence when η → 0with respect to all initial point x.

I The “exit along most unstable directions” happens only wheninitial point x strictly stands on A1.

I Does not allow small perturbations with respect to initialpoint x . Run into messy calculations...

Why these technicalities matter ?

I Our landscape may have a “chain of saddles”.

I Recall that

dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0)

is a random perturbation of the gradient flow.

Global landscape : chain of saddle points.

Figure 8: Chain of saddle points.

Linerization of the gradient flow near a strong saddle point.

I Hartman–Grobman Theorem : for any strong saddle point Othat we consider, there exist an open neighborhood U of Oand a C(0) homeomorphism h : U → Rd such that thegradient flow under h is mapped into a linear flow.

I Linearization Assumption 11 : The homeomorphism h providedby the Hartman–Grobman theorem can be taken to be C(2).

I A sufficient condition for the validity of the C(2) linerizationassumption is the so called non–resonance condition(Sternberg linerization theorem).

I I will also come back to this topic at the end of the talk.

11. Bakhtin, Y., Noisy heteroclinic networks, Probability Theory and RelatedFields, 150(2011), no.1–2, pp. 1–42.

Linerization of the gradient flow near a strong saddle point.

Figure 9: Linerization of the gradient flow near a strong saddle point.

Uniform version of Kifer’s exit asymptotic.

I Let U ⊂ G be an open neighborhood of the saddle point O.Set initial point x ∈ U ∪ ∂U.

I Set dist(U ∪ ∂U, ∂G ) > 0. Let t(x) = inf{t > 0 : S tx ∈ ∂G} .I Wmax is the invariant manifold corresponding to the most

negative eigenvalues −λ1, ...,−λq of ∇2F (O).

I Define Qmax = Wmax ∩ ∂G .

I Set

∂GU∪∂U→out = {S t(x)x for some x ∈ U∪∂U with finite t(x)}∪Qmax .

I For small µ > 0 let

Qµ = {x ∈ ∂G , dist(x , ∂GU∪∂U→out) < µ} .

I τηx is the first exit time to ∂G for the process Yt starting fromx .


Figure 10: Uniform exit dynamics near a specific strong saddle pointunder the linerization assumption.


I TheoremFor any r > 0, there exist some η0 > 0 so that for all x ∈ U ∪ ∂Uand all 0 < η < η0 we have

Exτηx

ln(η−1)≤ 1

2λ1+ r .

For any small µ > 0 and any ρ > 0, there exist some η0 > 0 sothat for all x ∈ U ∪ ∂U and all 0 < η < η0 we have

Px(Yτηx ∈ Qµ) ≥ 1− ρ .

Convergence analysis in a basin containing a localminimum : Sequence of stopping times.

I To demonstrate our analysis, we will assume that x∗ is a localminimum point of F (•) in the sense that for some openneighborhood U(x∗) of x∗ we have

x∗ = arg minx∈U(x∗)

F (x) .

I Assume that there are k strong saddle points O1, ...,Ok inU(x∗) such that F (O1) > F (O2) > ... > F (Ok) > F (x∗).

I Start the process with initial point Y0 = x ∈ U(x∗).

I Hitting time

T η = inf{t ≥ 0 : F (Yt) ≤ F (x∗) + e}


I Asymptotic of ET η as η → 0 ?


Figure 11: Sequence of stopping times.


I Standard Markov cycle type argument.

I But the geometry is a little different from the classicalarguments for elliptic equilibriums found in Freidlin–Wentzellbook 12.

I ET η .k

2γ1ln(η−1) conditioned upon convergence.

12. Freidlin, M., Wentzell, A., Random perturbations of dynamical systems,2nd edition, Springer, 1998.


I Recall that we have set F (x) = EF (x ; ζ).

I Thus we can formulate the original optimization problem as

x∗ = arg minx∈U(x∗)

F (x) .

I Instead of SGD, in this work let us consider its approximatingdiffusion process


I The dynamics of Xt is used as an alternative optimizationprocedure to find x∗.


I Hitting time

τη = inf{t ≥ 0 : F (Xt) ≤ F (x∗) + e}


I Asymptotic of Eτη as η → 0 ?


I Approximating diffusion process




(Random perturbations of the gradient flow !)

I Hitting time

T η = inf{t ≥ 0 : F (Yt) ≤ F (x∗) + e}


I

τη = η−1T η .

Convergence analysis in a basin containing a localminimum.

I Theorem(i) For any small ρ > 0, with probability at least 1− ρ, SGDapproximating diffusion process Xt converges to the minimizer x∗

for sufficiently small β after passing through all k saddle points O1,..., Ok ;(ii) Consider the stopping time τη. Then as η ↓ 0, conditioned onthe above convergence of SGD approximating diffusion process Xt ,we have

limη→0

Eτη

η−1 ln η−1≤ k

2γ1.

Epilogue : Problems remaining.

I Approximating diffusion process Xt : how it approximatesx (t) ? Various ways 13. May need correction term 14 fromnumerical SDE theory.

I Linerization assumption : typical in dynamical systems, butshall be moved by much harder work.

13. Li, Q., Tai, C., E. W., Stochastic modified equations and adaptive stochas-tic gradient algorithms, arXiv:1511.06251v3

14. Milstein, G.N., Approximate integration of stochastic differential equa-tions, Theory of Probability and its Applications, 19(3), pp.557-562, 1975.

Thank you for your attention !

Documents

On the fast convergence of random perturbations of the gradient …web.mst.edu/~huwen/slides_Fast_convergence_random... · 2017-09-21 · Di usion Approximation of SGD : Justi cation