Adam: AMethodforStochasticOptimization - 2017/18 moodle2 HUJI · Adam:...

Adam: A Method for Stochastic OptimizationDiederik P. Kingma and Jimmy Lei Ba

Nadav Cohen

The Hebrew University of Jerusalem

Advanced Seminar in Deep Learning (#67679)

October 18, 2015

Nadav Cohen Adam (Kingma & Ba) October 18, 2015 1 / 32

Optimization in Deep Learning

Outline

1 Optimization in Deep Learning

2 Adaptive Moment Estimation (Adam)

3 Convergence Analysis

4 Relations to Existing Algorithms

5 Experiments

6 AdaMax

7 Conclusion

Loss Minimization

Loss minimization problem:

{L(W ) :=

1m∑m

i=1`(W ; xi , yi ) + λr(W )

{(xi , yi )}mi=1 – training instances (xi) and corresponding labels (yi)

W – network parameters to learn

`(W ; xi , yi ) – loss of network parameterized by W w.r.t. (xi , yi )

r(W ) – regularization function (e.g. ‖W ‖22)

λ > 0 – regularization weight

Large-Scale −→ First-Order Stochastic Methods

Large-scale setting:

Many network parameters (e.g. dim(W ) ∼ 108)=⇒ computing Hessian (second order derivatives) is expensive

Many training examples (e.g. m ∼ 106)=⇒ computing full objective at every iteration is expensive

Optimization methods must be:

First-order – update based on objective value and gradient only

Stochastic – update based on subset of training examples:

Lt(W ) :=1b∑b

j=1`(W ; xij , yij ) + λr(W )

{(xij , yij )}bj=1 – random mini-batch chosen at iteration t

Large-Scale −→ First-Order Stochastic Methods

Large-scale setting:

Many network parameters (e.g. dim(W ) ∼ 108)=⇒ computing Hessian (second order derivatives) is expensive

Many training examples (e.g. m ∼ 106)=⇒ computing full objective at every iteration is expensive

Optimization methods must be:

First-order – update based on objective value and gradient only

Stochastic – update based on subset of training examples:

Lt(W ) :=1b∑b

j=1`(W ; xij , yij ) + λr(W )

{(xij , yij )}bj=1 – random mini-batch chosen at iteration t

Stochastic Gradient Descent (SGD)

Update rule:

Vt = µVt−1 − α∇Lt(Wt−1)

Wt = Wt−1 + Vt

α > 0 – learning rate (typical choices: 0.01, 0.1)

µ ∈ [0, 1) – momentum (typical choices: 0.9, 0.95, 0.99)

Momentum smooths updates, enhancing stability and speed.

Nesterov’s Accelerated Gradient (NAG)

Update rule:

Vt = µVt−1 − α∇Lt(Wt−1+µVt−1)

Wt = Wt−1 + Vt

Only difference from SGD is partial update (+µVt) in gradientcomputation. May increase stability and speed in ill-conditioned problems1.

1See "On the Importance of Initialization and Momentum in Deep Learning" bySutskever et al.

Adaptive Gradient (AdaGrad)

Update rule:Wt = Wt−1 − α

∇Lt(Wt−1)√∑tt′=1∇Lt′(Wt′−1)2

Learning rate adapted per coordinate:Highly varying coordinate −→ suppressRarely varying coordinate −→ enhance

Disadvantage in non-stationary settings:All gradients (recent and old) weighted equally

Root Mean Square Propagation (RMSProp)

Update rule:

Rt = γRt−1 + (1− γ)∇Lt(Wt−1)2

Wt = Wt−1 − α∇Lt(Wt−1)√

Similar to AdaGrad but with an exponential moving average controlled byγ ∈ [0, 1) (smaller γ =⇒ more emphasis on recent gradients).

May be combined with NAG:

Rt = γRt−1 + (1− γ)∇Lt(Wt−1 + µVt−1)2

Vt = µVt−1 −α√Rt∇Lt(Wt−1 + µVt−1)

Wt = Wt−1 + Vt

Root Mean Square Propagation (RMSProp)

Update rule:

Rt = γRt−1 + (1− γ)∇Lt(Wt−1)2

Wt = Wt−1 − α∇Lt(Wt−1)√

Similar to AdaGrad but with an exponential moving average controlled byγ ∈ [0, 1) (smaller γ =⇒ more emphasis on recent gradients).

May be combined with NAG:

Rt = γRt−1 + (1− γ)∇Lt(Wt−1 + µVt−1)2

Vt = µVt−1 −α√Rt∇Lt(Wt−1 + µVt−1)

Wt = Wt−1 + Vt

Adaptive Moment Estimation (Adam)

Outline

5 Experiments

6 AdaMax

7 Conclusion

Rationale

Motivation

Combine the advantages of:

AdaGrad – works well with sparse gradients

RMSProp – works well in non-stationary settings

Maintain exponential moving averages of gradient and its square

Update proportional to average gradient√average squared gradient

AlgorithmAdamM0 = 0,R0 = 0 (Initialization)For t = 1, . . . ,T :

Mt = β1Mt−1 + (1− β1)∇Lt(Wt−1) (1st moment estimate)Rt = β2Rt−1 + (1− β2)∇Lt(Wt−1)2 (2nd moment estimate)M̂t = Mt/ (1− (β1)t) (1st moment bias correction)R̂t = Rt/ (1− (β2)t) (2nd moment bias correction)Wt = Wt−1 − α M̂t√

R̂t+ε(Update)

Return WT

Hyper-parameters:α > 0 – learning rate (typical choice: 0.001)β1 ∈ [0, 1) – 1st moment decay rate (typical choice: 0.9)β2 ∈ [0, 1) – 2nd moment decay rate (typical choice: 0.999)ε > 0 – numerical term (typical choice: 10−8)

Parameter Updates

Adam’s step at iteration t (assuming ε = 0):

∆t = −α M̂t√R̂t

Properties:

Scale-invariance:

L(W )→ c · L(W ) =⇒ M̂t → c · M̂t ∧ R̂t → c2 · R̂t

=⇒ ∆t does not change

Bounded norm:

‖∆t‖∞ ≤{α · (1− β1)/

√1− β2 , (1− β1) >

√1− β2

α , otherwise

Parameter Updates

Properties:

Scale-invariance:

Bounded norm:

‖∆t‖∞ ≤{α · (1− β1)/

√1− β2 , (1− β1) >

√1− β2

α , otherwise

Parameter Updates

Properties:

Scale-invariance:

Bounded norm:

‖∆t‖∞ ≤{α · (1− β1)/

√1− β2 , (1− β1) >

√1− β2

α , otherwise

Bias Correction

Taking into account the initialization M0 = 0, we have:

Mt = β1Mt−1 + (1− β1)∇Lt(Wt−1)

i=1(1− β1)(β1)t−i · ∇Li (Wi−1)

∑ti=1(1− β1)(β1)t−i = 1− (β1)t , so to obtain an unbiased estimate we

divided by 1− (β1)t :M̂t = Mt/

(1− (β1)t)

An analogous argument derives the bias correction of Rt .

Convergence Analysis

Outline

5 Experiments

6 AdaMax

7 Conclusion

Convergence in Online Convex Regime

Regret at iteration T :

R(T ) :=∑T

t=1[Lt(Wt)− Lt(W ∗)]

where:W ∗ := argmin

∑Tt=1

Lt(W )

In convex regime, Adam gives regret bound comparable to best known:

TheoremIf all batch objectives Lt(W ) are convex and have bounded gradients, andall points Wt generated by Adam are within bounded distance from eachother, then for every T ∈ N:

T = O( 1√

Convergence in Online Convex Regime

Regret at iteration T :

R(T ) :=∑T

t=1[Lt(Wt)− Lt(W ∗)]

where:W ∗ := argmin

∑Tt=1

Lt(W )

In convex regime, Adam gives regret bound comparable to best known:

TheoremIf all batch objectives Lt(W ) are convex and have bounded gradients, andall points Wt generated by Adam are within bounded distance from eachother, then for every T ∈ N:

T = O( 1√

)Nadav Cohen Adam (Kingma & Ba) October 18, 2015 15 / 32

Relations to Existing Algorithms

Outline

5 Experiments

6 AdaMax

7 Conclusion

Relation to SGD

Setting β2 = 1, Adam’s update rule may be written as:

Mt = µMt−1 − η∇Lt(Wt−1)

Wt = Wt−1 + Mt

where:µ :=

β1(1− (β1)t−1)

1− (β1)t η :=α(1− β1)

(1− (β1)t)√ε

ConclusionDisabling 2nd moment estimation (β2 = 1) reduces Adam to SGD with:

Learning rate descending towards α(1− β1)/√ε

Momentum ascending towards β1

Relation to AdaGrad

Setting β1 = 0 and β2 → 1− (and assuming ε = 0), Adam’s update rulemay be written as:

Wt = Wt−1 − α∇Lt(Wt−1)

t1/2√∑t

i=1∇Lt(Wt−1)2

ConclusionIn the limit β2 → 1−, with β1 = 0, Adam reduces to AdaGrad withannealing learning rate α · t−1/2.

Relation to RMSProp

RMSProp with momentum is the method most closely related to Adam.

Main differences:

RMSProp rescales gradient and then applies momentum, Adam firstapplies momentum (moving average) and then rescales.

RMSProp lacks bias correction, often leading to large stepsizes inearly stages of run (especially when β2 is close to 1).

Experiments

Outline

5 Experiments

6 AdaMax

7 Conclusion

Experiments

Logistic Regression on MNIST

L2-regularized logistic regression applied directly to image pixels:

Experiments

Logistic Regression on IMDBDropout regularized logistic regression applied to sparse Bag-of-Wordsfeatures:

Experiments

Multi-Layer Neural Networks (Fully-Connected) on MNIST

2 hidden layers, 1000 units each, ReLU activation, dropout regularization:

Experiments

Convolutional Neural Networks on CIFAR-10Network architecture:conv-5x5@64 → pool-3x3(stride-2) → conv-5x5@64 → pool-3x3(stride-2) →conv-5x5@128 → pool-3x3(stride-2) → dense@1000 → dense@10:

Experiments

Bias Correction Term on Variational Auto-Encoder

Training variational auto-encoder (single hidden layer network) with (red)and without (green) bias correction, for different values of β1, β2, α:

AdaMax

Outline

5 Experiments

6 AdaMax

7 Conclusion

AdaMax

Lp Generalization

Adam may be generalized by replacing gradient L2 norm with Lp norm:

Adam – Lp GeneralizationM0 = 0,R0 = 0 (Initialization)For t = 1, . . . ,T :

Mt = β1Mt−1 + (1− β1)∇Lt(Wt−1) (1st moment estimate)Rt = β2Rt−1 + (1− (β2)p)∇Lt(Wt−1)p (p’th moment estimate)M̂t = Mt/ (1− (β1)t) (1st moment bias correction)R̂t = Rt/ (1− (β2)pt) (p’th moment bias correction)Wt = Wt−1 − α M̂t

(R̂t+ε)1/p (Update)Return WT

(β2 re-parameterized as (β2)p for convenience)

AdaMax

p →∞ =⇒ AdaMax

When p →∞, ‖·‖p → max{·} and we get:

AdaMaxM0 = 0,U0 = 0 (Initialization)For t = 1, . . . ,T :

Mt = β1Mt−1 + (1− β1)∇Lt(Wt−1) (1st moment estimate)Ut = max {β2Ut−1, |∇Lt(Wt−1)|} (“∞” moment estimate)M̂t = Mt/ (1− (β1)t) (1st moment bias correction)Wt = Wt−1 − α M̂t

Ut(Update)

Return WT

In AdaMax step size is always bounded by α:

‖∆t‖∞ ≤ α

Outline

5 Experiments

6 AdaMax

7 Conclusion

Conclusion

Efficient first-order stochastic optimization method

Combines the advantages of:AdaGrad – works well with sparse gradientsRMSProp – deals with non-stationary objectives

Parameter updates:Have bounded normAre scale-invariant

Widely used in deep learning community (e.g. Google DeepMind)

Conclusion

Outline

5 Experiments

6 AdaMax

7 Conclusion

Thank You

Adam: AMethodforStochasticOptimization - 2017/18 moodle2 HUJI · Adam:...

Documents

Scala at HUJI PL Seminar 2008

Lecture 7: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2017/cs231n_2017_lecture7.pdfLecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam:

ember 2013 - opel-infos.de · Opel ADAM 2 Modell-/Motorenübersicht ADAM ADAM ADAM JAM ADAM GLAM ADAM SLAM ADAM White Link ADAM Black Link Motor CO 2-Emission in g/km kombiniert Getriebe

What Homeowners Need to Learn About Lock Security Joseph Kingma Medeco Security Locks

The acquisition of simple sentences. One-word utterances / holophrases Daddy.[Adam 1;4] Mommy.[Adam 1;4] Doggy.[Adam 1;5] Goodbye.[Adam 1;5] Allgone.[Adam

Collaborationinpractice:* …researchonline.rcm.ac.uk/385/1/Rees, Carla - PhD.pdfCollaborationinpractice:* developingarepertoireof** extendedtechniquesforthe** KingmaSystemaltoandbassflute!

A Short Introduction to HUJI-CSE Computing Systems

Understanding Changes in Corporate Credit Spreads - Pluto Huji Ac Il

Lecture 7: Training Neural Networks, Part 2€¦ · Lecture 7 - 37 April 25, 2017 Adam (full form) Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015 Momentum

Adam - A Method for Stochastic Optimization v2 · 2018-08-20 · Adam: A Method for Stochastic Optimization Diederik P. Kingma, Jimmy Ba Presented by Dor Ringel. Dor Ringel Content

HUJI FINAL 8-9-14

Bed side examination of the dizzy patient Herman Kingma, ORL-HNS department

ESTHER: ADAM: ESTHER: ADAM: ESTHER: ADAM: ESTHER

FRM mswiener/zvi.html HUJI-03 Zvi Wiener mswiener@mscc.huji.ac.il 02-588-3049 Financial Risk Management

Kingma & Welling's variational auto-encoder (VAE) - arXiv · PDF fileAuto-Encoding Variational Bayes Diederik P. Kingma Machine Learning Group Universiteit van Amsterdam dpkingma@gmail.com

New Human Computer Interfaces Amnon Dekel HUJI – CSE, Spring 2007 Class 2 March 7 2007

SCIENTIFIC ABSTRACT ADAM, G. - ADAM,M. · Title: SCIENTIFIC ABSTRACT ADAM, G. - ADAM,M. Subject: SCIENTIFIC ABSTRACT ADAM, G. - ADAM,M. Keywords

Bayesian Portfolio Analysis - Pluto Huji Ac Il

The crane project Jur Kingma

E. Grossman (HUJi) & St. Polis (F.R.S.-FNRS – ULg)