Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Stochastic Approximation Theory
Yingzhen Li and Mark Rowland
November 26 2015
1 of 27
Plan
I History and modern formulation of stochastic approximation theory
I In-depth look at stochastic gradient descent (SGD)
I Introduction to key ideas in stochastic approximation theory suchas Lyapunov functions quasimartingales and also numericalsolutions to differential equations
1 of 27
Bibliography
Stochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Lin (2003)
Online Learning and Stochastic Approximation Leon Bottou (1998)
A Stochastic Approximation Algorithm Robbins amp Monro (1951)
Stochastic Estimation of the Maximisation of a Regression FunctionKiefer amp Wolfowitz (1952)
Introduction to Stochastic Search and Optimization Spall (2003)
2 of 27
A brief history of numerical analysis
Numerical Analysis is a well-established discipline
c 1800 - 1600 BC
Babyloniansattempted tocalculate
radic2 or in
modern terms findthe roots ofx2 minus 2 = 0
1
1httpwwwmathubcca˜cassEuclidybcybchtml
Yale Babylonian Collection
3 of 27
A brief history of numerical analysis
C17-C19Huge number of numerical techniques developed as tools for thenatural sciences
I Root-finding methods (eg Newton-Raphson)
I Numerical integration (eg Gaussian Quadrature)
I ODE solution (eg Euler method)
I Interpolation (eg Lagrange polynomials)
Focus is on situations where function evaluation is deterministic
4 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo
Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator
7 of 27
A brief history of numerical analysis
Present dayStochastic approximation widely researched and used in practice
I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent
I Neural networksI K-MeansI
I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu
amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))
TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)
8 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Plan
I History and modern formulation of stochastic approximation theory
I In-depth look at stochastic gradient descent (SGD)
I Introduction to key ideas in stochastic approximation theory suchas Lyapunov functions quasimartingales and also numericalsolutions to differential equations
1 of 27
Bibliography
Stochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Lin (2003)
Online Learning and Stochastic Approximation Leon Bottou (1998)
A Stochastic Approximation Algorithm Robbins amp Monro (1951)
Stochastic Estimation of the Maximisation of a Regression FunctionKiefer amp Wolfowitz (1952)
Introduction to Stochastic Search and Optimization Spall (2003)
2 of 27
A brief history of numerical analysis
Numerical Analysis is a well-established discipline
c 1800 - 1600 BC
Babyloniansattempted tocalculate
radic2 or in
modern terms findthe roots ofx2 minus 2 = 0
1
1httpwwwmathubcca˜cassEuclidybcybchtml
Yale Babylonian Collection
3 of 27
A brief history of numerical analysis
C17-C19Huge number of numerical techniques developed as tools for thenatural sciences
I Root-finding methods (eg Newton-Raphson)
I Numerical integration (eg Gaussian Quadrature)
I ODE solution (eg Euler method)
I Interpolation (eg Lagrange polynomials)
Focus is on situations where function evaluation is deterministic
4 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo
Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator
7 of 27
A brief history of numerical analysis
Present dayStochastic approximation widely researched and used in practice
I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent
I Neural networksI K-MeansI
I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu
amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))
TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)
8 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Bibliography
Stochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Lin (2003)
Online Learning and Stochastic Approximation Leon Bottou (1998)
A Stochastic Approximation Algorithm Robbins amp Monro (1951)
Stochastic Estimation of the Maximisation of a Regression FunctionKiefer amp Wolfowitz (1952)
Introduction to Stochastic Search and Optimization Spall (2003)
2 of 27
A brief history of numerical analysis
Numerical Analysis is a well-established discipline
c 1800 - 1600 BC
Babyloniansattempted tocalculate
radic2 or in
modern terms findthe roots ofx2 minus 2 = 0
1
1httpwwwmathubcca˜cassEuclidybcybchtml
Yale Babylonian Collection
3 of 27
A brief history of numerical analysis
C17-C19Huge number of numerical techniques developed as tools for thenatural sciences
I Root-finding methods (eg Newton-Raphson)
I Numerical integration (eg Gaussian Quadrature)
I ODE solution (eg Euler method)
I Interpolation (eg Lagrange polynomials)
Focus is on situations where function evaluation is deterministic
4 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo
Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator
7 of 27
A brief history of numerical analysis
Present dayStochastic approximation widely researched and used in practice
I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent
I Neural networksI K-MeansI
I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu
amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))
TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)
8 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
A brief history of numerical analysis
Numerical Analysis is a well-established discipline
c 1800 - 1600 BC
Babyloniansattempted tocalculate
radic2 or in
modern terms findthe roots ofx2 minus 2 = 0
1
1httpwwwmathubcca˜cassEuclidybcybchtml
Yale Babylonian Collection
3 of 27
A brief history of numerical analysis
C17-C19Huge number of numerical techniques developed as tools for thenatural sciences
I Root-finding methods (eg Newton-Raphson)
I Numerical integration (eg Gaussian Quadrature)
I ODE solution (eg Euler method)
I Interpolation (eg Lagrange polynomials)
Focus is on situations where function evaluation is deterministic
4 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo
Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator
7 of 27
A brief history of numerical analysis
Present dayStochastic approximation widely researched and used in practice
I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent
I Neural networksI K-MeansI
I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu
amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))
TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)
8 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
A brief history of numerical analysis
C17-C19Huge number of numerical techniques developed as tools for thenatural sciences
I Root-finding methods (eg Newton-Raphson)
I Numerical integration (eg Gaussian Quadrature)
I ODE solution (eg Euler method)
I Interpolation (eg Lagrange polynomials)
Focus is on situations where function evaluation is deterministic
4 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo
Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator
7 of 27
A brief history of numerical analysis
Present dayStochastic approximation widely researched and used in practice
I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent
I Neural networksI K-MeansI
I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu
amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))
TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)
8 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo
Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator
7 of 27
A brief history of numerical analysis
Present dayStochastic approximation widely researched and used in practice
I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent
I Neural networksI K-MeansI
I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu
amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))
TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)
8 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo
Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator
7 of 27
A brief history of numerical analysis
Present dayStochastic approximation widely researched and used in practice
I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent
I Neural networksI K-MeansI
I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu
amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))
TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)
8 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo
Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator
7 of 27
A brief history of numerical analysis
Present dayStochastic approximation widely researched and used in practice
I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent
I Neural networksI K-MeansI
I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu
amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))
TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)
8 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
A brief history of numerical analysis
1951 Robbins and Monro publish ldquoA Stochastic ApproximationAlgorithmrdquo describing how to find the root of an increasing functionf Rrarr R when only noisy estimates of the functionrsquos value at a givenpoint are available
5 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo
Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator
7 of 27
A brief history of numerical analysis
Present dayStochastic approximation widely researched and used in practice
I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent
I Neural networksI K-MeansI
I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu
amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))
TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)
8 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo
Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator
7 of 27
A brief history of numerical analysis
Present dayStochastic approximation widely researched and used in practice
I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent
I Neural networksI K-MeansI
I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu
amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))
TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)
8 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo
Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator
7 of 27
A brief history of numerical analysis
Present dayStochastic approximation widely researched and used in practice
I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent
I Neural networksI K-MeansI
I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu
amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))
TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)
8 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
A brief history of numerical analysis
They provide an iterative scheme for root estimation in this context andprove convergence of the resulting estimate in L2 and in probability
Note also that if we treat f as the gradient of some function F theRobbins-Monro algorithm can be viewed a minimisation procedure
The Robbins-Monro paper establishes stochastic approximation as anarea of numerical analysis in its own right
6 of 27
A brief history of numerical analysis
1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo
Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator
7 of 27
A brief history of numerical analysis
Present dayStochastic approximation widely researched and used in practice
I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent
I Neural networksI K-MeansI
I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu
amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))
TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)
8 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
A brief history of numerical analysis
1952Motivated by the Robbins-Monro paper the year before Kiefer andWolfowitz publish ldquoStochastic Estimation of the Maximisation of aRegression Functionrdquo
Their algorithm is phrased as a maximisation procedure in contrast tothe Robbins-Monro paper and uses central-difference approximationsof the gradient to update the optimum estimator
7 of 27
A brief history of numerical analysis
Present dayStochastic approximation widely researched and used in practice
I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent
I Neural networksI K-MeansI
I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu
amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))
TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)
8 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
A brief history of numerical analysis
Present dayStochastic approximation widely researched and used in practice
I Original applications in root-finding and optimisation often in theguise of stochastic gradient descent
I Neural networksI K-MeansI
I Related ideas are used in I Stochastic variational inference (Hoffman et al (2013))I Psuedo-marginal Metropolis-Hastings (Beaumont (2003) Andrieu
amp Roberts (2009))I Stochastic Gradient Langevin Dynamics (Welling amp Teh (2011))
TextbooksStochastic Approximation and Recursive Algorithms and ApplicationsKushner amp Yin (2003)Introduction to Stochastic Search and Optimization Spall (2003)
8 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Modern-day formulation
Stochastic approximation is now a mature area of numerical analysisand the general problem it seeks to solve has the following form
Minimise f (w) = E [F (w ξ)]
(over w in some domain W and some random variable ξ)
This is an immensely flexible framework
I F (w ξ) = f (w) + ξ models experimentalmeasurement error
I F (x ξ) = (wgtφ(ξX)minus ξY)2 corresponds to (least squares) linearregression
I If f (w) = 1N
sumNi=1 g(w xi ) and F (w ξ) = 1
Ks
sumxisinξ g(w x) with
ξ a randomly-selected subset (of size K ) of large data setcorresponds to ldquostochasticrdquo machine learning algorithms
9 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Stochastic gradient descent
Wersquoll focus on proof techniques for stochastic gradient descent (SGD)
Wersquoll derive conditions for SGD to convergence almost surely Webroadly follow the structure of Leon Bottoursquos paper2
Plan
I Continuous gradient descent
I Discrete gradient descent
I Stochastic gradient descent
2Online Learning and Stochastic Approximations - Leon Bottou (1998)10 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)
11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Continuous Gradient DescentLet C Rk minusrarr R be differentiable How do solutions s Rrarr Rk tothe following ODE behave
d
dts(t) = minusnablaC (s(t))
Example C (x) = 5(x1 + x2)2 + (x1 minus x2)2
We can analytically solve the gradient descent equation for thisexample
nablaC (x) = (12x1 + 8x2 8x1 + 12x2)
So (s prime1 sprime2) = (12s1 + 8s2 8s1 + 12s2) and solving with
(s1(0) s2(0)) = (15 05) gives(s1(t)s2(t)
)=
(eminus20t + 15eminus4t
eminus20t minus 05eminus4t
)11 of 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Exact solution of gradient descent ODE
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton0)ocgs[i]state=false
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Continuous Gradient Descent
Wersquoll start with some simplifying assumptions
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
(The second condition is weaker than convexity)
Proposition If s [0infin)rarr X satisfies thedifferential equation
ds(t)
dt= minusnablaC (s(t))
then s(t)rarr x as t rarrinfin
Example ofnon-convex functionsatisfying theseconditions
13 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
d
dth(t) =
d
dts(t)minus x2
2
=d
dt〈s(t)minus x s(t)minus x〉
= 2
langds(t)
dt s(t)minus x
rang= minus〈s(t)minus xnablaC (s(t))〉 le 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
h(t) is a decreasing function
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
The limit of h(t) must be 0If not there is some K gt 0 such that h(t) gt K for all t Byassumption
infx h(x)gtK
〈x minus xnablaC (x)〉 gt 0
which means that inft hprime(t) lt 0 contradicting the convergence of hprime(t)
to 0
14 of 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Continuous Gradient Descent
1 Define Lyapunov function h(t) = s(t)minus x22
2 h(t) is a decreasing function
3 h(t) converges to a limit and hprime(t) converges to 0
4 The limit of h(t) must be 0
5 s(t)rarr x
14 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Continuous Gradient Descent
So gradient descent finds the global minimum for functions in the classstated in the theorem
Solving general ODEs analytically is extremely difficult at best andusually impossible
To implement this strategy algorithmically to solve a minimisationproblem wersquoll need a method of solving the ODE numerically
There is no one best way to do this
15 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Discrete Gradient DescentAn intuitive straightforward discretisation of
d
dts(t) = minusnablaC (s(t))
issn+1 minus sn
εn= minusnablaC (sn) (ForwardEuler)
Although it is easy and quick to implement it has theoretical(A-)instability issues
An alternative method which is (A-)stable is the implicit Euler method
sn+1 minus snεn
= minusnablaC (sn+1)
So why settle for explicit discretisation Implicit requires solution ofnon-linear system at each step and practically explicit discretisationworks
16 of 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Discrete Gradient Descent
Also have multistep methods eg
sn+2 = sn+1 + εn+2
(3
2nablaC (sn+1)minus 1
2nablaC (sn)
)This is the 2nd-order Adams-Bashforth method
Similar (in)stability properties to forward Euler but discretisation errorof a smaller order of magnitude at each step (O(ε3) compared withO(ε2))
17 of 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Discrete Gradient DescentExamples of discretisation with ε = 01
18 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton1)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Discrete Gradient DescentExamples of discretisation with ε = 007
19 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton2)ocgs[i]state=false
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Discrete Gradient DescentExamples of discretisation with ε = 001
20 of 27
var ocgs=hostgetOCGs(hostpageNum)for(var i=0iltocgslengthi++)if(ocgs[i]name==MediaPlayButton3)ocgs[i]state=false
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Discrete Gradient Descent
Proposition If sn+1 = sn minus εnnablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 nablaC (x)22 le A + Bx minus x2
2 for some AB ge 0
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the continuouscase but the extra conditions in the statement deal with the errorsintroduced by discretisation
21 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Define Lyapunov sequence hn = sn minus x22
Because of the error that the discretisation introduced hn is notguaranteed to be decreasing - have to adjust proof
Idea hn may fluctuate but if we can showthat the cumulative lsquouprsquo movements arenrsquottoo big we can still prove convergence ofhn Illustration credit Online learning
and stochastic approximation LeonBottou (1998)
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
With some algebra note that
hn+1 minus hn = 〈sn+1 minus x sn+1 minus x〉 minus 〈sn minus x sn minus x〉= 〈sn+1 sn+1〉 minus 〈sn sn〉 minus 2 〈sn+1 minus sn x
〉= 〈sn minus εnnablaC (sn) sn minus εnnablaC (sn)〉 minus 〈sn sn〉+ 2εn 〈nablaC (sn) x〉= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nnablaC (sn)22
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2n(A + Bhn)
=rArr hn+1 minus (1 + ε2nB)hn le minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
le ε2nA
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show thatsuminfin
n=1 h+n ltinfin
Assuming nablaC (x)22 le A + Bx minus x2
2 we get
hn+1 minus (1minus ε2nB)hn le ε2
nA
Writing micron =prodn
i=11
1+ε2i B
and hprimen = micronhn we get
hprimen+1 minus hprimen le ε2nmicronA
Assumingsuminfin
n=1 ε2n ltinfin micron converges away from 0 so RHS is
summable22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that this implies that hn convergesIfsuminfin
n=1 max(0 hn+1 minus hn) ltinfin thensuminfin
n=1 min(0 hn+1 minus hn) ltinfin(why)
But hn+1 = h0 +sumn
k=1(max(0 hk+1 minus hk) + min(0 hk+1 minus hk))
So (hn)infinn=1 converges
22 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
Show that hn must converge to 0 Assume hn converges to somepositive number Previously we had
hn+1 minus hn = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nnablaC (sn)2
2
This is summable and if we assume further thatsuminfin
n=1 εn =infin thenwe get
〈sn minus xnablaC (sn)〉 rarr 0
contradicting hn converging away from 022 of 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Discrete Gradient Descent
1 Define Lyapunov sequence hn = sn minus x22
2 Consider the positive variations h+n = max(0 hn+1 minus hn)
3 Show thatsuminfin
n=1 h+n ltinfin
4 Show that this implies that hn converges
5 Show that hn must converge to 0
6 sn rarr x
22 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Stochastic Gradient DescentWersquore now ready to introduce the stochastic approximation of thegradient into our algorithmIf C is a cost function averaged across a data set often have truegradient of the form
nablaC (x) =1
N
Nsumn=1
fn(x)
and an approximation is formed by subsampling
nablaC (x) =1
K
sumkisinIK
fk(x) (Ik sim Unif (subsets of size K))
Wersquoll treat the more general case where the gradient estimates nablaC (x)are unbiased and independentThe introduced randomness means that the Lyapunov sequence in ourproof will now be a stochastic process and wersquoll need some additionalmachinery to deal with it
23 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Measure-theory-free martingale theory
Let (Xn)nge0 be a stochastic process
Fn denotes ldquothe information describing the stochastic process up totime nrdquo denoted mathematically by Fn = σ(Xm|m le n)
Definition A stochastic process (Xn)infinn=0 is a martingale3 if
I E [|Xn|] ltinfin for all n
I E [Xn|Fm] = Xm for n ge m
Martingale convergence theorem (Doob 1953) If (Xn)infinn=1 is amartingale and supE [|Xn|] ltinfin then Xn rarr Xinfin almost surelyQuasimartingale convergence theorem (Fisk 1965) If (Xn)infinn=1 is apositive stochastic process and
infinsumn=1
E[(E [Xn+1|Fn]minus Xn)1E[Xn+1|Fn]minusXngt0
]ltinfin
then Xn rarr Xinfin almost surely3on a filtered probability space (ΩF (Fn)infinn=0P) with Fn = σ(Xm|m le n)foralln
24 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Stochastic Gradient Descent
Proposition If sn+1 = sn minus εnHn(sn) with Hn(sn) an unbiasedestimator for nablaC (sn) and C satisfies
1 C has a unique minimiser x
2 forallε gt 0 infxminusx2
2gtε〈x minus xnablaC (x)〉 gt 0
3 E[Hn(x)2
2
]le A + Bx minus x2
2 for some AB ge 0 independentof n
then subject tosum
n εn =infin andsum
n ε2n ltinfin we have sn rarr x
Proof structure The proof is largely the same as for the deterministicdiscrete case but the extra conditions in the statement deal with thecontrol we need over the unbiased gradient estimators
25 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Consider the positive variations h+n = max(0 hn+1 minus hn)
By exactly the same calculation as in the deterministic discrete case
hn+1 minus hn = minus2εn 〈sn minus xHn(sn)〉+ ε2nHn(sn)2
2
So
E [hn+1 minus hn|Fn] = minus2εn 〈sn minus xnablaC (sn)〉+ ε2nE[Hn(sn)2
2
∣∣Fn
]26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
Show that hn converges almost surelyExactly the same as for discrete gradient descent
I Assume E[Hn(x)2
2
]le A + Bx minus x2
2I Introduce micron =
prodni=1
11+ε2
i Band hprimen = micronhn
Get
E[hprimen+1 minus hprimen
∣∣Fn
]le ε2
nmicronA
=rArr E[(hprimen+1 minus hprimen)1E[hprimen+1minushprimen|Fn]gt0
∣∣∣Fn
]le ε2
nmicronAsuminfinn=1 ε
2n ltinfin =rArr Quasimartingale convergence =rArr (hprimen)infinn=1
converges as =rArr (hn)infinn=1 converges as26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn3 Show that hn converges almost surely4 Show that hn must converge to 0 almost surely5 sn rarr x
Show that hn must converge to 0 almost surelyFrom previous calculations
E[hn+1 minus (1minus ε2
nB)hn∣∣Fn
]= minus2εn 〈sn minus xnablaC (sn)〉+ ε2
nA
(hn)infinn=1 converges so sequence is summable as Assumesuminfin
n=1 ε2n ltinfin so
right term is summable as so left term side is also summable as infinsumn=1
εn 〈sn minus xnablaC (sn)〉 ltinfinalmost surely
If we assume in addition thatsuminfin
n=1 εn =infin this forces
〈sn minus xnablaC (sn)〉 rarr 0almost surely
And by our initial assumptions about C (hn)infinn=1 must converge to 0 almost
surely 26 of 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Stochastic Gradient Descent
1 Define Lyapunov process hn = sn minus x22
2 Consider the variations hn+1 minus hn
3 Show that hn converges almost surely
4 Show that hn must converge to 0 almost surely
5 sn rarr x
26 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Relaxing the pseudo-convexity conditions
Often the conditions on C in the preceding theorems are not satisfiedin practice One possible way of extending ideas above Assume
I C is a non-negative three-times differentiable functionI Robbins-Monro learning rate conditions hold
suminfinn=1 εn =infinsuminfin
n=1 ε2n ltinfin
I Gradient estimate H satisfies E[Hn(x)k
]le Ak + Bkxk for
k = 2 3 4I There exists D gt 0 such that infx2gtD 〈x nablaC (x)〉 gt 0
Idea for proof is then
1 For a given start position the sequence (sn)infinn=1 is confined to abounded neighbourhood of 0 almost surely
2 Introduce the Lyapunov function hn = C (sn) and prove itsalmost-sure convergence
3 Prove that nablaC (sn) necessarily converges almost surely
Note that this guarantees we settle at some critical point for thefunction (which may be a maximum minimum or saddle) rather thanreaching the global optimum
27 of 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Motivation
So far wersquove told you why SGD is ldquosaferdquo )
but Robbins-Monro is just a sufficient condition
then how to choose learning rates to achieve
fast convergencebetter local optimum
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 1 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Motivation (cont)
Figure from [Bergstra and Bengio 2012]
idea 1 search the best learning rate schedulegrid search random search cross validationBayesian optimization
But Irsquom lazy and I donrsquot want to spend too much timeYingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 2 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Motivation (cont)
idea 2 let the learning rates adapt themselves
pre-conditioning with |diag(H)| [Becker and LeCun 1988]adaGrad [Duchi et al 2010]adaDelta [Zeiler 2012] RMSprop [Tieleman and Hinton 2012]ADAM [Kingma and Ba 2014]Equilibrated SGD [Dauphin et al 2015] (this yearrsquos NIPS)
now wersquoll talk a bit about online learning
a good tutorial [Shalev-Shwartz 2012]milestone paper [Zinkevich 2003]
and some practical comparisons
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 3 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Online learning
From batch to online learning
Batch learning often assumes
Wersquove got the whole dataset Dthe cost function C (wD) = ExsimD[c(w x)]
SGDmini-batch learning accelerate training in real time by
processing onea mini-batch of datapoint each iterationconsidering gradients with data point cost functions
nablaC (wD) asymp 1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 4 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Online learning
From batch to online learning (cont)
Letrsquos forget the cost function on the batch for a moment
online gradient descent (OGD)
each iteration t we receive a loss ft(w)and a (noisy) (sub-)gradient gt isin partft(w)then we update the weights with wlarr w minus ηgt
for SGD the received gradient is defined by
gt =1
M
Msumm=1
nablac(w xm)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 5 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Online learning
From batch to online learning (cont)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 6 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Online learning
From batch to online learning (cont)
Online learning assumes
each iteration t we receive a loss ft(w)(in batch learning context ft(w) = c(w xt))then we learn wt+1 following some rules
Performance is measured by regret
RlowastT =Tsumt=1
ft(wt)minus infwisinS
Tsumt=1
ft(w) (1)
General definition RT (u) =sumT
t=1 [ft(wt)minus ft(u)]
SGD hArr ldquofollow the regularized leaderrdquo with L2 regularization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 7 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Online learning
From batch to online learning (cont)
Follow the Leader (FTL)
wt+1 = argminwisinS
tsumi=1
fi(w) (2)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTL then forallu isin S
RT (u) =Tsumt=1
[ft(wt)minus ft(u)] leTsumt=1
[ft(wt)minus ft(wt+1)] (3)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 8 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Online learning
From batch to online learning (cont)
A game that fools FTL
Let w isin S = [minus1 1] and the loss function at time t
ft(w) =
minus05w t = 1
w t is even
minusw t gt 1 and t is odd
FTL is easily fooled
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 9 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Online learning
From batch to online learning (cont)
Follow the Regularized Leader (FTRL)
wt+1 = argminwisinS
tsumi=1
fi(w) + ϕ(w) (4)
Lemma (upper bound of regret)
Let w1 w2 be the sequence produced by FTRL then forallu isin S
Tsumt=1
[ft(wt)minus ft(u)] le ϕ(u)minus ϕ(w1) +Tsumt=1
[ft(wt)minus ft(wt+1)]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 10 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Online learning
From batch to online learning (cont)
Letrsquos assume the loss ft are convex functions
forallwu isin S ft(u)minus ft(w) ge 〈uminusw g(w)〉forallg(w) isin partft(w)
use linearisation ft(w) = ft(wt) + 〈w g(w)〉 as a surrogate
Tsumt=1
ft(wt)minus ft(u) leTsumt=1
〈wt gt〉 minus 〈u gt〉 forallgt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 11 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
From FTRL to online gradient descent
use linearisation as surrogate loss (upper bound LHS)use L2 regularizer ϕ(w) = 1
2η ||w minusw1||22apply FTRL regret bound to the surrogate loss
wt+1 = argminwisinS
tsumi=1
〈w gi〉+ ϕ(w)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Online learning
From batch to online learning (cont)
Online Gradient Descent (OGD)
wt+1 = wt minus ηgt gt isin partft(wt) (5)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 12 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Online learning
From batch to online learning (cont)
Theorem (regret bound for online gradient descent)
Assume we run online gradient descent on convex loss functions f1f2 fT with regularizer ϕ(w) = 1
2η||w minusw1||22 then for all u isin S
RT (u) le 1
2η||uminusw1||22 + η
Tsumt=1
||gt ||22 gt isin partft(wt)
keep in mind for later discussion of adaGrad
I want ϕ to be strongly convex wrt (semi-)norm || middot ||then the L2 norm will be changed to || middot ||lowastand we use Holderrsquos inequality for proof
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 13 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010]
Use proximal terms ψt
ψt is a strongly-convex function and
Bψt (wwt) = ψt(w)minus ψt(wt)minus 〈nablaψt(wt)w minuswt〉
Primal-dual sub-gradient update
wt+1 = argminw
η〈w gt〉+ ηϕ(w) +1
tψt(w) (6)
Proximal gradientcomposite mirror descent
wt+1 = argminw
η〈w gt〉+ ηϕ(w) + Bψt (wwt) (7)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 14 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
adaGrad with diagonal matrices
Gt =tsum
i=1
gtgTt Ht = δI + diag(G
12t )
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2twTHtw (8)
wt+1 = argminwisinS
η〈w gt〉+ ηϕ(w) +1
2(w minuswt)
THt(w minuswt)
(9)
To get adaGrad from online gradient descentadd a proximal term or a Bregman divergence to the objectiveadapt the proximal term through time
ψt(w) =1
2wTHtw
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 15 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
example composite mirror descent on S = RD with ϕ(w) = 0
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
T [δI + diag(Gt)](w minuswt)
rArr wt+1 = wt minusηgt
δ +radicsumt
i=1 g2i
(10)
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 16 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
WHY
time-varying learning rate vs time-varying proximal term
I want the contribution of ||gt ||2lowast induced by ψt be upperbounded by ||gt ||2
define || middot ||ψt =radic〈middotHt middot〉
ψt(w) = 12w
THtw is strongly convex wrt || middot ||ψt
a little math can show thatTsumt=1
||gt ||2ψtlowast le 2Dsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 17 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Advanced adaptive learning rates
adaGrad [Duchi et al 2010] (cont)
Theorem (Thm 5 in the paper)
Assume sequence wt is generated by adaGrad using primal-dualupdate (8) with δ ge maxt ||gt ||infin then for any u isin S
RT (u) le δ
η||u||22 +
1
η||u||2infin
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
For wt generated using composite mirror descent update (9)
RT (u) le 1
2ηmaxtleT||uminuswt ||22
Dsumd=1
||g1T d ||2 + ηDsum
d=1
||g1T d ||2
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 18 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Advanced adaptive learning rates
Improving adaGrad
possible problems of adaGrad
global learning rate η still need hand tuningsensitive to initial conditions
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
existing methods
RMSprop [Tieleman and Hinton 2012]adaDelta [Zeiler 2012]ADAM [Kingma and Ba 2014]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 19 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Advanced adaptive learning rates
Improving adaGrad (cont)
RMSprop
Gt = ρGtminus1 + (1minus ρ)gtgTt Ht = (δI + diag(Gt))12
wt+1 = argminw
η〈w gt〉+1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusηgt
RMS[g]t RMS[g]t = diag(Ht) (11)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 20 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Advanced adaptive learning rates
Improving adaGrad (cont)
∆w prop Hminus1g prop partf partw
part2f partw2
rArr part2f partw2 prop partf partw
∆w
rArr diag(H) asymp RMS[g]tRMS[∆w]tminus1
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Advanced adaptive learning rates
Improving adaGrad (cont)
adaDelta
diag(Ht) =RMS[g]t
RMS[∆w]tminus1
wt+1 = argminw〈w gt〉+
1
2(w minuswt)
THt(w minuswt)
rArr wt+1 = wt minusRMS[∆w]tminus1RMS[g]t
gt (12)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 21 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Advanced adaptive learning rates
Improving adaGrad (cont)
ADAM
mt = β1mtminus1 + (1minus β1)gt vt = β2vtminus1 + (1minus β2)g2t
mt =mt
1minus βt1
vt =vt
1minus βt2
wt+1 = wt minusηmt
vt + δ (13)
possible improving directions
use truncated sum running averageuse (approximate) second-order informationadd momentumcorrect the bias of running average
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 22 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Advanced adaptive learning rates
Demo time
wikipedia page ldquotest functions for optimizationrdquohttpsenwikipediaorgwikiTest_functions_for_optimization
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 23 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Advanced adaptive learning rates
Learn the learning rates
idea 3 learn the (global) learning rates as well
wt+1 = wt minus ηgt rArr wt = w(t ηw1 giiletminus1)
wt is a function of η
learn η (with gradient descent) by
η = argminsumslet
fs(w(s η))
Reverse-mode differentiation to back-track the learning process[Maclaurin et al 2015]
stochastic gradient descent to learn η on the fly [Masse andOllivier 2015]
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 24 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Advanced adaptive learning rates
Summary
Recent successes of machine learning rely on stochasticapproximation and optimisation methods
Theoretical analysis guarantees good behaviour
Adaptive learning rates provides good performance in practice
Future work will reduce labour on tuning hyper-parameters
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 25 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Advanced adaptive learning rates
References
Bergstra J Bengio Y Random search for hyper-parameteroptimization The Journal of Machine Learning Research 201213(1) 281-305S Becker and Y LeCun Improving the convergence ofback-propagation learning with second order methods Tech RepDepartment of Computer Science University of Toronto TorontoON Canada 1988Duchi J Hazan E Singer Y Adaptive subgradient methods for onlinelearning and stochastic optimization The Journal of MachineLearning Research 2011 12 2121-2159Zeiler M D ADADELTA An adaptive learning rate method arXivpreprint arXiv12125701 2012
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 26 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Advanced adaptive learning rates
References
Tieleman T Hinton G Lecture 65-rmsprop Divide the gradient by arunning average of its recent magnitude COURSERA NeuralNetworks for Machine Learning 2012 4Kingma D Ba J Adam A method for stochastic optimization arXivpreprint arXiv14126980 2014Dauphin Y N de Vries H Chung J et al RMSProp and equilibratedadaptive learning rates for non-convex optimization arXiv preprintarXiv150204390 2015Zinkevich M Online convex programming and generalizedinfinitesimal gradient ascent ICML 2003
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 27 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27
Advanced adaptive learning rates
Reference
Shalev-Shwartz S Online learning and online convex optimizationFoundations and Trends in Machine Learning 2011 4(2) 107-194Maclaurin D Duvenaud D Adams R P Gradient-basedHyperparameter Optimization through Reversible Learning arXivpreprint arXiv150203492 2015Masse P Y Ollivier Y Speed learning on the fly arXiv preprintarXiv151102540 2015
Yingzhen Li (University of Cambridge) Adaptive learning rates Nov 26 2015 28 27