102
Stochastic Approximation Theory Yingzhen Li and Mark Rowland November 26, 2015 1 of 27

Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Stochastic Approximation Theory

Yingzhen Li and Mark Rowland

November 26 2015

1 of 27

Plan

I History and modern formulation of stochastic approximation theory

I In-depth look at stochastic gradient descent (SGD)

I Introduction to key ideas in stochastic approximation theory suchas Lyapunov functions quasimartingales and also numericalsolutions to differential equations

1 of 27

Bibliography

Stochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Lin (2003)

Online Learning and Stochastic Approximation Leon Bottou (1998)

A Stochastic Approximation Algorithm Robbins amp Monro (1951)

Stochastic Estimation of the Maximisation of a Regression FunctionKiefer amp Wolfowitz (1952)

Introduction to Stochastic Search and Optimization Spall (2003)

2 of 27

A brief history of numerical analysis

Numerical Analysis is a well-established discipline

c 1800 - 1600 BC

Babyloniansattempted tocalculate

radic2 or in

modern terms findthe roots ofx2 minus 2 = 0

1

1httpwwwmathubcca˜cassEuclidybcybchtml

Yale Babylonian Collection

3 of 27

A brief history of numerical analysis

C17-C19Huge number of numerical techniques developed as tools for thenatural sciences

I Root-finding methods (eg Newton-Raphson)

I Numerical integration (eg Gaussian Quadrature)

I ODE solution (eg Euler method)

I Interpolation (eg Lagrange polynomials)

Focus is on situations where function evaluation is deterministic

4 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo

Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator

7 of 27

A brief history of numerical analysis

Present dayStochastic approximation widely researched and used in practice

I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent

I Neural networksI K-MeansI

I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu

amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))

TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)

8 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 2: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Plan

I History and modern formulation of stochastic approximation theory

I In-depth look at stochastic gradient descent (SGD)

I Introduction to key ideas in stochastic approximation theory suchas Lyapunov functions quasimartingales and also numericalsolutions to differential equations

1 of 27

Bibliography

Stochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Lin (2003)

Online Learning and Stochastic Approximation Leon Bottou (1998)

A Stochastic Approximation Algorithm Robbins amp Monro (1951)

Stochastic Estimation of the Maximisation of a Regression FunctionKiefer amp Wolfowitz (1952)

Introduction to Stochastic Search and Optimization Spall (2003)

2 of 27

A brief history of numerical analysis

Numerical Analysis is a well-established discipline

c 1800 - 1600 BC

Babyloniansattempted tocalculate

radic2 or in

modern terms findthe roots ofx2 minus 2 = 0

1

1httpwwwmathubcca˜cassEuclidybcybchtml

Yale Babylonian Collection

3 of 27

A brief history of numerical analysis

C17-C19Huge number of numerical techniques developed as tools for thenatural sciences

I Root-finding methods (eg Newton-Raphson)

I Numerical integration (eg Gaussian Quadrature)

I ODE solution (eg Euler method)

I Interpolation (eg Lagrange polynomials)

Focus is on situations where function evaluation is deterministic

4 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo

Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator

7 of 27

A brief history of numerical analysis

Present dayStochastic approximation widely researched and used in practice

I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent

I Neural networksI K-MeansI

I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu

amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))

TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)

8 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 3: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Bibliography

Stochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Lin (2003)

Online Learning and Stochastic Approximation Leon Bottou (1998)

A Stochastic Approximation Algorithm Robbins amp Monro (1951)

Stochastic Estimation of the Maximisation of a Regression FunctionKiefer amp Wolfowitz (1952)

Introduction to Stochastic Search and Optimization Spall (2003)

2 of 27

A brief history of numerical analysis

Numerical Analysis is a well-established discipline

c 1800 - 1600 BC

Babyloniansattempted tocalculate

radic2 or in

modern terms findthe roots ofx2 minus 2 = 0

1

1httpwwwmathubcca˜cassEuclidybcybchtml

Yale Babylonian Collection

3 of 27

A brief history of numerical analysis

C17-C19Huge number of numerical techniques developed as tools for thenatural sciences

I Root-finding methods (eg Newton-Raphson)

I Numerical integration (eg Gaussian Quadrature)

I ODE solution (eg Euler method)

I Interpolation (eg Lagrange polynomials)

Focus is on situations where function evaluation is deterministic

4 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo

Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator

7 of 27

A brief history of numerical analysis

Present dayStochastic approximation widely researched and used in practice

I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent

I Neural networksI K-MeansI

I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu

amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))

TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)

8 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 4: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

A brief history of numerical analysis

Numerical Analysis is a well-established discipline

c 1800 - 1600 BC

Babyloniansattempted tocalculate

radic2 or in

modern terms findthe roots ofx2 minus 2 = 0

1

1httpwwwmathubcca˜cassEuclidybcybchtml

Yale Babylonian Collection

3 of 27

A brief history of numerical analysis

C17-C19Huge number of numerical techniques developed as tools for thenatural sciences

I Root-finding methods (eg Newton-Raphson)

I Numerical integration (eg Gaussian Quadrature)

I ODE solution (eg Euler method)

I Interpolation (eg Lagrange polynomials)

Focus is on situations where function evaluation is deterministic

4 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo

Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator

7 of 27

A brief history of numerical analysis

Present dayStochastic approximation widely researched and used in practice

I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent

I Neural networksI K-MeansI

I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu

amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))

TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)

8 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 5: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

A brief history of numerical analysis

C17-C19Huge number of numerical techniques developed as tools for thenatural sciences

I Root-finding methods (eg Newton-Raphson)

I Numerical integration (eg Gaussian Quadrature)

I ODE solution (eg Euler method)

I Interpolation (eg Lagrange polynomials)

Focus is on situations where function evaluation is deterministic

4 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo

Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator

7 of 27

A brief history of numerical analysis

Present dayStochastic approximation widely researched and used in practice

I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent

I Neural networksI K-MeansI

I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu

amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))

TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)

8 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 6: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo

Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator

7 of 27

A brief history of numerical analysis

Present dayStochastic approximation widely researched and used in practice

I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent

I Neural networksI K-MeansI

I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu

amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))

TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)

8 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 7: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo

Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator

7 of 27

A brief history of numerical analysis

Present dayStochastic approximation widely researched and used in practice

I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent

I Neural networksI K-MeansI

I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu

amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))

TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)

8 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 8: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo

Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator

7 of 27

A brief history of numerical analysis

Present dayStochastic approximation widely researched and used in practice

I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent

I Neural networksI K-MeansI

I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu

amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))

TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)

8 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 9: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

A brief history of numerical analysis

1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available

5 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo

Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator

7 of 27

A brief history of numerical analysis

Present dayStochastic approximation widely researched and used in practice

I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent

I Neural networksI K-MeansI

I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu

amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))

TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)

8 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 10: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo

Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator

7 of 27

A brief history of numerical analysis

Present dayStochastic approximation widely researched and used in practice

I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent

I Neural networksI K-MeansI

I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu

amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))

TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)

8 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 11: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo

Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator

7 of 27

A brief history of numerical analysis

Present dayStochastic approximation widely researched and used in practice

I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent

I Neural networksI K-MeansI

I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu

amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))

TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)

8 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 12: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

A brief history of numerical analysis

They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability

Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure

The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right

6 of 27

A brief history of numerical analysis

1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo

Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator

7 of 27

A brief history of numerical analysis

Present dayStochastic approximation widely researched and used in practice

I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent

I Neural networksI K-MeansI

I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu

amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))

TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)

8 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 13: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

A brief history of numerical analysis

1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo

Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator

7 of 27

A brief history of numerical analysis

Present dayStochastic approximation widely researched and used in practice

I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent

I Neural networksI K-MeansI

I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu

amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))

TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)

8 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 14: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

A brief history of numerical analysis

Present dayStochastic approximation widely researched and used in practice

I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent

I Neural networksI K-MeansI

I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu

amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))

TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)

8 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 15: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 16: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Modern-day formulation

Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form

Minimise f (w) = E [F (w ξ)]

(over w in some domain W and some random variable ξ)

This is an immensely flexible framework

I F (w ξ) = f (w) + ξ models experimentalmeasurement error

I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression

I If f (w) = 1N

sumNi=1 g(w xi ) and F (w ξ) = 1

Ks

sumxisinξ g(w x) with

ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms

9 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 17: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 18: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Stochastic gradient descent

Wersquoll focus on proof techniques for stochastic gradient descent (SGD)

Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2

Plan

I Continuous gradient descent

I Discrete gradient descent

I Stochastic gradient descent

2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 19: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 20: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 21: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)

11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 22: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 23: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave

d

dts(t) = minusnablaC (s(t))

Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2

We can analytically solve the gradient descent equation for thisexample

nablaC (x) = (12x1 + 8x2 8x1 + 12x2)

So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with

(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)

)=

(eminus20t + 15eminus4t

eminus20t minus 05eminus4t

)11 of 27

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 24: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Exact solution of gradient descent ODE

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 25: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 26: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 27: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 28: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Continuous Gradient Descent

Wersquoll start with some simplifying assumptions

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

(The second condition is weaker than convexity)

Proposition If s [0infin)rarr X satisfies thedifferential equation

ds(t)

dt= minusnablaC (s(t))

then s(t)rarr x as t rarrinfin

Example ofnon-convex functionsatisfying theseconditions

13 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 29: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 30: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 31: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

d

dth(t) =

d

dts(t)minus x2

2

=d

dt〈s(t)minus x s(t)minus x〉

= 2

langds(t)

dt s(t)minus x

rang= minus〈s(t)minus xnablaC (s(t))〉 le 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 32: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

h(t) is a decreasing function

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 33: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption

infx h(x)gtK

〈x minus xnablaC (x)〉 gt 0

which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)

to 0

14 of 27

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 34: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Continuous Gradient Descent

1 Define Lyapunov function h(t) = s(t)minus x22

2 h(t) is a decreasing function

3 h(t) converges to a limit and hprime(t) converges to 0

4 The limit of h(t) must be 0

5 s(t)rarr x

14 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 35: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 36: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 37: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 38: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Continuous Gradient Descent

So gradient descent finds the global minimum for functions in the classstated in the theorem

Solving general ODEs analytically is extremely difficult at best andusually impossible

To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically

There is no one best way to do this

15 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 39: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 40: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 41: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Discrete Gradient DescentAn intuitive straightforward discretisation of

d

dts(t) = minusnablaC (s(t))

issn+1 minus sn

εn= minusnablaC (sn) (ForwardEuler)

Although it is easy and quick to implement it has theoretical(A-)instability issues

An alternative method which is (A-)stable is the implicit Euler method

sn+1 minus snεn

= minusnablaC (sn+1)

So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks

16 of 27

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 42: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Discrete Gradient Descent

Also have multistep methods eg

sn+2 = sn+1 + εn+2

(3

2nablaC (sn+1)minus 1

2nablaC (sn)

)This is the 2nd-order Adams-Bashforth method

Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))

17 of 27

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 43: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Discrete Gradient DescentExamples of discretisation with ε = 01

18 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 44: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Discrete Gradient DescentExamples of discretisation with ε = 007

19 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 45: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Discrete Gradient DescentExamples of discretisation with ε = 001

20 of 27

var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 46: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 47: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Discrete Gradient Descent

Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 nablaC (x)22 le A + Bx minus x2

2 for some AB ge 0

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation

21 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 48: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 49: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 50: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Define Lyapunov sequence hn = sn minus x22

Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof

Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning

and stochastic approximation LeonBottou (1998)

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 51: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

With some algebra note that

hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x

〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nnablaC (sn)22

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 52: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)

=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

le ε2nA

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 53: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show thatsuminfin

n=1 h+n ltinfin

Assuming nablaC (x)22 le A + Bx minus x2

2 we get

hn+1 minus (1minus ε2nB)hn le ε2

nA

Writing micron =prodn

i=11

1+ε2i B

and hprimen = micronhn we get

hprimen+1 minus hprimen le ε2nmicronA

Assumingsuminfin

n=1 ε2n ltinfin micron converges away from 0 so RHS is

summable22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 54: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that this implies that hn convergesIfsuminfin

n=1 max(0 hn+1 minus hn) ltinfin thensuminfin

n=1 min(0 hn+1 minus hn) ltinfin(why)

But hn+1 = h0 +sumn

k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))

So (hn)infinn=1 converges

22 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 55: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had

hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2

2

This is summable and if we assume further thatsuminfin

n=1 εn =infin thenwe get

〈sn minus xnablaC (sn)〉 rarr 0

contradicting hn converging away from 022 of 27

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 56: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Discrete Gradient Descent

1 Define Lyapunov sequence hn = sn minus x22

2 Consider the positive variations h+n = max(0 hn+1 minus hn)

3 Show thatsuminfin

n=1 h+n ltinfin

4 Show that this implies that hn converges

5 Show that hn must converge to 0

6 sn rarr x

22 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 57: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 58: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 59: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form

nablaC (x) =1

N

Nsumn=1

fn(x)

and an approximation is formed by subsampling

nablaC (x) =1

K

sumkisinIK

fk(x) (Ik sim Unif (subsets of size K))

Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it

23 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 60: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 61: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 62: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Measure-theory-free martingale theory

Let (Xn)nge0 be a stochastic process

Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)

Definition A stochastic process (Xn)infinn=0 is a martingale3 if

I E [|Xn|] ltinfin for all n

I E [Xn|Fm] = Xm for n ge m

Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and

infinsumn=1

E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0

]ltinfin

then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln

24 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 63: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 64: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Stochastic Gradient Descent

Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies

1 C has a unique minimiser x

2 forallε gt 0 infxminusx2

2gtε〈x minus xnablaC (x)〉 gt 0

3 E[Hn(x)2

2

]le A + Bx minus x2

2 for some AB ge 0 independentof n

then subject tosum

n εn =infin andsum

n ε2n ltinfin we have sn rarr x

Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators

25 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 65: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 66: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 67: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Consider the positive variations h+n = max(0 hn+1 minus hn)

By exactly the same calculation as in the deterministic discrete case

hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2

2

So

E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2

2

∣∣Fn

]26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 68: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

Show that hn converges almost surelyExactly the same as for discrete gradient descent

I Assume E[Hn(x)2

2

]le A + Bx minus x2

2I Introduce micron =

prodni=1

11+ε2

i Band hprimen = micronhn

Get

E[hprimen+1 minus hprimen

∣∣Fn

]le ε2

nmicronA

=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0

∣∣∣Fn

]le ε2

nmicronAsuminfinn=1 ε

2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1

converges as =rArr (hn)infinn=1 converges as26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 69: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x

Show that hn must converge to 0 almost surelyFrom previous calculations

E[hn+1 minus (1minus ε2

nB)hn∣∣Fn

]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2

nA

(hn)infinn=1 converges so sequence is summable as Assumesuminfin

n=1 ε2n ltinfin so

right term is summable as so left term side is also summable as infinsumn=1

εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely

If we assume in addition thatsuminfin

n=1 εn =infin this forces

〈sn minus xnablaC (sn)〉 rarr 0almost surely

And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost

surely 26 of 27

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 70: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Stochastic Gradient Descent

1 Define Lyapunov process hn = sn minus x22

2 Consider the variations hn+1 minus hn

3 Show that hn converges almost surely

4 Show that hn must converge to 0 almost surely

5 sn rarr x

26 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 71: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 72: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Relaxing the pseudo-convexity conditions

Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume

I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold

suminfinn=1 εn =infinsuminfin

n=1 ε2n ltinfin

I Gradient estimate H satisfies E[Hn(x)k

]le Ak + Bkxk for

k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0

Idea for proof is then

1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely

2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence

3 Prove that nablaC (sn) necessarily converges almost surely

Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum

27 of 27

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 73: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Motivation

So far wersquove told you why SGD is ldquosaferdquo )

but Robbins-Monro is just a sufficient condition

then how to choose learning rates to achieve

fast convergencebetter local optimum

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 74: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Motivation (cont)

Figure from [Bergstra and Bengio 2012]

idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization

But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 75: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Motivation (cont)

idea 2 let the learning rates adapt themselves

pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)

now wersquoll talk a bit about online learning

a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]

and some practical comparisons

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 76: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Online learning

From batch to online learning

Batch learning often assumes

Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]

SGDmini-batch learning accelerate training in real time by

processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions

nablaC (wD) asymp 1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 77: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Online learning

From batch to online learning (cont)

Letrsquos forget the cost function on the batch for a moment

online gradient descent (OGD)

each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt

for SGD the received gradient is defined by

gt =1

M

Msumm=1

nablac(w xm)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 78: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Online learning

From batch to online learning (cont)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 79: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Online learning

From batch to online learning (cont)

Online learning assumes

each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules

Performance is measured by regret

RlowastT =Tsumt=1

ft(wt)minus infwisinS

Tsumt=1

ft(w) (1)

General definition RT (u) =sumT

t=1 [ft(wt)minus ft(u)]

SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 80: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Online learning

From batch to online learning (cont)

Follow the Leader (FTL)

wt+1 = argminwisinS

tsumi=1

fi(w) (2)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTL then forallu isin S

RT (u) =Tsumt=1

[ft(wt)minus ft(u)] leTsumt=1

[ft(wt)minus ft(wt+1)] (3)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 81: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Online learning

From batch to online learning (cont)

A game that fools FTL

Let w isin S = [minus1 1] and the loss function at time t

ft(w) =

minus05w t = 1

w t is even

minusw t gt 1 and t is odd

FTL is easily fooled

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 82: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Online learning

From batch to online learning (cont)

Follow the Regularized Leader (FTRL)

wt+1 = argminwisinS

tsumi=1

fi(w) + ϕ(w) (4)

Lemma (upper bound of regret)

Let w1 w2 be the sequence produced by FTRL then forallu isin S

Tsumt=1

[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1

[ft(wt)minus ft(wt+1)]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 83: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Online learning

From batch to online learning (cont)

Letrsquos assume the loss ft are convex functions

forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)

use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate

Tsumt=1

ft(wt)minus ft(u) leTsumt=1

〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 84: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

From FTRL to online gradient descent

use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1

2η ||w minusw1||22apply FTRL regret bound to the surrogate loss

wt+1 = argminwisinS

tsumi=1

〈w gi〉+ ϕ(w)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 85: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Online learning

From batch to online learning (cont)

Online Gradient Descent (OGD)

wt+1 = wt minus ηgt gt isin partft(wt) (5)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 86: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Online learning

From batch to online learning (cont)

Theorem (regret bound for online gradient descent)

Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1

2η||w minusw1||22 then for all u isin S

RT (u) le 1

2η||uminusw1||22 + η

Tsumt=1

||gt ||22 gt isin partft(wt)

keep in mind for later discussion of adaGrad

I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 87: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Advanced adaptive learning rates

adaGrad [Duchi et al 2010]

Use proximal terms ψt

ψt is a strongly-convex function and

Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉

Primal-dual sub-gradient update

wt+1 = argminw

η〈w gt〉+ ηϕ(w) +1

tψt(w) (6)

Proximal gradientcomposite mirror descent

wt+1 = argminw

η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 88: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

adaGrad with diagonal matrices

Gt =tsum

i=1

gtgTt Ht = δI + diag(G

12t )

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2twTHtw (8)

wt+1 = argminwisinS

η〈w gt〉+ ηϕ(w) +1

2(w minuswt)

THt(w minuswt)

(9)

To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time

ψt(w) =1

2wTHtw

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 89: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

example composite mirror descent on S = RD with ϕ(w) = 0

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

T [δI + diag(Gt)](w minuswt)

rArr wt+1 = wt minusηgt

δ +radicsumt

i=1 g2i

(10)

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 90: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

WHY

time-varying learning rate vs time-varying proximal term

I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2

define || middot ||ψt =radic〈middotHt middot〉

ψt(w) = 12w

THtw is strongly convex wrt || middot ||ψt

a little math can show thatTsumt=1

||gt ||2ψtlowast le 2Dsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 91: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Advanced adaptive learning rates

adaGrad [Duchi et al 2010] (cont)

Theorem (Thm 5 in the paper)

Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S

RT (u) le δ

η||u||22 +

1

η||u||2infin

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

For wt generated using composite mirror descent update (9)

RT (u) le 1

2ηmaxtleT||uminuswt ||22

Dsumd=1

||g1T d ||2 + ηDsum

d=1

||g1T d ||2

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 92: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Advanced adaptive learning rates

Improving adaGrad

possible problems of adaGrad

global learning rate η still need hand tuningsensitive to initial conditions

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

existing methods

RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 93: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Advanced adaptive learning rates

Improving adaGrad (cont)

RMSprop

Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12

wt+1 = argminw

η〈w gt〉+1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusηgt

RMS[g]t RMS[g]t = diag(Ht) (11)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 94: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Advanced adaptive learning rates

Improving adaGrad (cont)

∆w prop Hminus1g prop partf partw

part2f partw2

rArr part2f partw2 prop partf partw

∆w

rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 95: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Advanced adaptive learning rates

Improving adaGrad (cont)

adaDelta

diag(Ht) =RMS[g]t

RMS[∆w]tminus1

wt+1 = argminw〈w gt〉+

1

2(w minuswt)

THt(w minuswt)

rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t

gt (12)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 96: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Advanced adaptive learning rates

Improving adaGrad (cont)

ADAM

mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t

mt =mt

1minus βt1

vt =vt

1minus βt2

wt+1 = wt minusηmt

vt + δ (13)

possible improving directions

use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 97: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Advanced adaptive learning rates

Demo time

wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 98: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Advanced adaptive learning rates

Learn the learning rates

idea 3 learn the (global) learning rates as well

wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)

wt is a function of η

learn η (with gradient descent) by

η = argminsumslet

fs(w(s η))

Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]

stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 99: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Advanced adaptive learning rates

Summary

Recent successes of machine learning rely on stochasticapproximation and optimisation methods

Theoretical analysis guarantees good behaviour

Adaptive learning rates provides good performance in practice

Future work will reduce labour on tuning hyper-parameters

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 100: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Advanced adaptive learning rates

References

Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 101: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Advanced adaptive learning rates

References

Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3
Page 102: Stochastic Approximation Theory - Yingzhen Liyingzhenli.net/home/pdf/SA.pdf · Plan I History and modern formulation of stochastic approximation theory I In-depth look at stochastic

Advanced adaptive learning rates

Reference

Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015

Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27

  1. fdrm0
  2. fdrm1
  3. fdrm2
  4. fdrm3