View
233
Download
1
Category
Preview:
Citation preview
Adam: A Method for Stochastic OptimizationDiederik P. Kingma and Jimmy Lei Ba
Nadav Cohen
The Hebrew University of Jerusalem
Advanced Seminar in Deep Learning (#67679)
October 18, 2015
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 1 / 32
Optimization in Deep Learning
Outline
1 Optimization in Deep Learning
2 Adaptive Moment Estimation (Adam)
3 Convergence Analysis
4 Relations to Existing Algorithms
5 Experiments
6 AdaMax
7 Conclusion
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 2 / 32
Optimization in Deep Learning
Loss Minimization
Loss minimization problem:
minW
{L(W ) :=
1m∑m
i=1`(W ; xi , yi ) + λr(W )
}
{(xi , yi )}mi=1 – training instances (xi) and corresponding labels (yi)
W – network parameters to learn
`(W ; xi , yi ) – loss of network parameterized by W w.r.t. (xi , yi )
r(W ) – regularization function (e.g. ‖W ‖22)
λ > 0 – regularization weight
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 3 / 32
Optimization in Deep Learning
Large-Scale −→ First-Order Stochastic Methods
Large-scale setting:
Many network parameters (e.g. dim(W ) ∼ 108)=⇒ computing Hessian (second order derivatives) is expensive
Many training examples (e.g. m ∼ 106)=⇒ computing full objective at every iteration is expensive
Optimization methods must be:
First-order – update based on objective value and gradient only
Stochastic – update based on subset of training examples:
Lt(W ) :=1b∑b
j=1`(W ; xij , yij ) + λr(W )
{(xij , yij )}bj=1 – random mini-batch chosen at iteration t
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 4 / 32
Optimization in Deep Learning
Large-Scale −→ First-Order Stochastic Methods
Large-scale setting:
Many network parameters (e.g. dim(W ) ∼ 108)=⇒ computing Hessian (second order derivatives) is expensive
Many training examples (e.g. m ∼ 106)=⇒ computing full objective at every iteration is expensive
Optimization methods must be:
First-order – update based on objective value and gradient only
Stochastic – update based on subset of training examples:
Lt(W ) :=1b∑b
j=1`(W ; xij , yij ) + λr(W )
{(xij , yij )}bj=1 – random mini-batch chosen at iteration t
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 4 / 32
Optimization in Deep Learning
Stochastic Gradient Descent (SGD)
Update rule:
Vt = µVt−1 − α∇Lt(Wt−1)
Wt = Wt−1 + Vt
α > 0 – learning rate (typical choices: 0.01, 0.1)
µ ∈ [0, 1) – momentum (typical choices: 0.9, 0.95, 0.99)
Momentum smooths updates, enhancing stability and speed.
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 5 / 32
Optimization in Deep Learning
Nesterov’s Accelerated Gradient (NAG)
Update rule:
Vt = µVt−1 − α∇Lt(Wt−1+µVt−1)
Wt = Wt−1 + Vt
Only difference from SGD is partial update (+µVt) in gradientcomputation. May increase stability and speed in ill-conditioned problems1.
1See "On the Importance of Initialization and Momentum in Deep Learning" bySutskever et al.
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 6 / 32
Optimization in Deep Learning
Adaptive Gradient (AdaGrad)
Update rule:Wt = Wt−1 − α
∇Lt(Wt−1)√∑tt′=1∇Lt′(Wt′−1)2
Learning rate adapted per coordinate:Highly varying coordinate −→ suppressRarely varying coordinate −→ enhance
Disadvantage in non-stationary settings:All gradients (recent and old) weighted equally
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 7 / 32
Optimization in Deep Learning
Root Mean Square Propagation (RMSProp)
Update rule:
Rt = γRt−1 + (1− γ)∇Lt(Wt−1)2
Wt = Wt−1 − α∇Lt(Wt−1)√
Rt
Similar to AdaGrad but with an exponential moving average controlled byγ ∈ [0, 1) (smaller γ =⇒ more emphasis on recent gradients).
May be combined with NAG:
Rt = γRt−1 + (1− γ)∇Lt(Wt−1 + µVt−1)2
Vt = µVt−1 −α√Rt∇Lt(Wt−1 + µVt−1)
Wt = Wt−1 + Vt
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 8 / 32
Optimization in Deep Learning
Root Mean Square Propagation (RMSProp)
Update rule:
Rt = γRt−1 + (1− γ)∇Lt(Wt−1)2
Wt = Wt−1 − α∇Lt(Wt−1)√
Rt
Similar to AdaGrad but with an exponential moving average controlled byγ ∈ [0, 1) (smaller γ =⇒ more emphasis on recent gradients).
May be combined with NAG:
Rt = γRt−1 + (1− γ)∇Lt(Wt−1 + µVt−1)2
Vt = µVt−1 −α√Rt∇Lt(Wt−1 + µVt−1)
Wt = Wt−1 + Vt
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 8 / 32
Adaptive Moment Estimation (Adam)
Outline
1 Optimization in Deep Learning
2 Adaptive Moment Estimation (Adam)
3 Convergence Analysis
4 Relations to Existing Algorithms
5 Experiments
6 AdaMax
7 Conclusion
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 9 / 32
Adaptive Moment Estimation (Adam)
Rationale
Motivation
Combine the advantages of:
AdaGrad – works well with sparse gradients
RMSProp – works well in non-stationary settings
Idea
Maintain exponential moving averages of gradient and its square
Update proportional to average gradient√average squared gradient
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 10 / 32
Adaptive Moment Estimation (Adam)
AlgorithmAdamM0 = 0,R0 = 0 (Initialization)For t = 1, . . . ,T :
Mt = β1Mt−1 + (1− β1)∇Lt(Wt−1) (1st moment estimate)Rt = β2Rt−1 + (1− β2)∇Lt(Wt−1)2 (2nd moment estimate)M̂t = Mt/ (1− (β1)t) (1st moment bias correction)R̂t = Rt/ (1− (β2)t) (2nd moment bias correction)Wt = Wt−1 − α M̂t√
R̂t+ε(Update)
Return WT
Hyper-parameters:α > 0 – learning rate (typical choice: 0.001)β1 ∈ [0, 1) – 1st moment decay rate (typical choice: 0.9)β2 ∈ [0, 1) – 2nd moment decay rate (typical choice: 0.999)ε > 0 – numerical term (typical choice: 10−8)
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 11 / 32
Adaptive Moment Estimation (Adam)
Parameter Updates
Adam’s step at iteration t (assuming ε = 0):
∆t = −α M̂t√R̂t
Properties:
Scale-invariance:
L(W )→ c · L(W ) =⇒ M̂t → c · M̂t ∧ R̂t → c2 · R̂t
=⇒ ∆t does not change
Bounded norm:
‖∆t‖∞ ≤{α · (1− β1)/
√1− β2 , (1− β1) >
√1− β2
α , otherwise
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 12 / 32
Adaptive Moment Estimation (Adam)
Parameter Updates
Adam’s step at iteration t (assuming ε = 0):
∆t = −α M̂t√R̂t
Properties:
Scale-invariance:
L(W )→ c · L(W ) =⇒ M̂t → c · M̂t ∧ R̂t → c2 · R̂t
=⇒ ∆t does not change
Bounded norm:
‖∆t‖∞ ≤{α · (1− β1)/
√1− β2 , (1− β1) >
√1− β2
α , otherwise
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 12 / 32
Adaptive Moment Estimation (Adam)
Parameter Updates
Adam’s step at iteration t (assuming ε = 0):
∆t = −α M̂t√R̂t
Properties:
Scale-invariance:
L(W )→ c · L(W ) =⇒ M̂t → c · M̂t ∧ R̂t → c2 · R̂t
=⇒ ∆t does not change
Bounded norm:
‖∆t‖∞ ≤{α · (1− β1)/
√1− β2 , (1− β1) >
√1− β2
α , otherwise
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 12 / 32
Adaptive Moment Estimation (Adam)
Bias Correction
Taking into account the initialization M0 = 0, we have:
Mt = β1Mt−1 + (1− β1)∇Lt(Wt−1)
=t∑
i=1(1− β1)(β1)t−i · ∇Li (Wi−1)
∑ti=1(1− β1)(β1)t−i = 1− (β1)t , so to obtain an unbiased estimate we
divided by 1− (β1)t :M̂t = Mt/
(1− (β1)t)
An analogous argument derives the bias correction of Rt .
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 13 / 32
Convergence Analysis
Outline
1 Optimization in Deep Learning
2 Adaptive Moment Estimation (Adam)
3 Convergence Analysis
4 Relations to Existing Algorithms
5 Experiments
6 AdaMax
7 Conclusion
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 14 / 32
Convergence Analysis
Convergence in Online Convex Regime
Regret at iteration T :
R(T ) :=∑T
t=1[Lt(Wt)− Lt(W ∗)]
where:W ∗ := argmin
W
∑Tt=1
Lt(W )
In convex regime, Adam gives regret bound comparable to best known:
TheoremIf all batch objectives Lt(W ) are convex and have bounded gradients, andall points Wt generated by Adam are within bounded distance from eachother, then for every T ∈ N:
R(T )
T = O( 1√
T
)
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 15 / 32
Convergence Analysis
Convergence in Online Convex Regime
Regret at iteration T :
R(T ) :=∑T
t=1[Lt(Wt)− Lt(W ∗)]
where:W ∗ := argmin
W
∑Tt=1
Lt(W )
In convex regime, Adam gives regret bound comparable to best known:
TheoremIf all batch objectives Lt(W ) are convex and have bounded gradients, andall points Wt generated by Adam are within bounded distance from eachother, then for every T ∈ N:
R(T )
T = O( 1√
T
)Nadav Cohen Adam (Kingma & Ba) October 18, 2015 15 / 32
Relations to Existing Algorithms
Outline
1 Optimization in Deep Learning
2 Adaptive Moment Estimation (Adam)
3 Convergence Analysis
4 Relations to Existing Algorithms
5 Experiments
6 AdaMax
7 Conclusion
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 16 / 32
Relations to Existing Algorithms
Relation to SGD
Setting β2 = 1, Adam’s update rule may be written as:
Mt = µMt−1 − η∇Lt(Wt−1)
Wt = Wt−1 + Mt
where:µ :=
β1(1− (β1)t−1)
1− (β1)t η :=α(1− β1)
(1− (β1)t)√ε
ConclusionDisabling 2nd moment estimation (β2 = 1) reduces Adam to SGD with:
Learning rate descending towards α(1− β1)/√ε
Momentum ascending towards β1
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 17 / 32
Relations to Existing Algorithms
Relation to AdaGrad
Setting β1 = 0 and β2 → 1− (and assuming ε = 0), Adam’s update rulemay be written as:
Wt = Wt−1 − α∇Lt(Wt−1)
t1/2√∑t
i=1∇Lt(Wt−1)2
ConclusionIn the limit β2 → 1−, with β1 = 0, Adam reduces to AdaGrad withannealing learning rate α · t−1/2.
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 18 / 32
Relations to Existing Algorithms
Relation to RMSProp
RMSProp with momentum is the method most closely related to Adam.
Main differences:
RMSProp rescales gradient and then applies momentum, Adam firstapplies momentum (moving average) and then rescales.
RMSProp lacks bias correction, often leading to large stepsizes inearly stages of run (especially when β2 is close to 1).
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 19 / 32
Experiments
Outline
1 Optimization in Deep Learning
2 Adaptive Moment Estimation (Adam)
3 Convergence Analysis
4 Relations to Existing Algorithms
5 Experiments
6 AdaMax
7 Conclusion
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 20 / 32
Experiments
Logistic Regression on MNIST
L2-regularized logistic regression applied directly to image pixels:
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 21 / 32
Experiments
Logistic Regression on IMDBDropout regularized logistic regression applied to sparse Bag-of-Wordsfeatures:
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 22 / 32
Experiments
Multi-Layer Neural Networks (Fully-Connected) on MNIST
2 hidden layers, 1000 units each, ReLU activation, dropout regularization:
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 23 / 32
Experiments
Convolutional Neural Networks on CIFAR-10Network architecture:conv-5x5@64 → pool-3x3(stride-2) → conv-5x5@64 → pool-3x3(stride-2) →conv-5x5@128 → pool-3x3(stride-2) → dense@1000 → dense@10:
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 24 / 32
Experiments
Bias Correction Term on Variational Auto-Encoder
Training variational auto-encoder (single hidden layer network) with (red)and without (green) bias correction, for different values of β1, β2, α:
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 25 / 32
AdaMax
Outline
1 Optimization in Deep Learning
2 Adaptive Moment Estimation (Adam)
3 Convergence Analysis
4 Relations to Existing Algorithms
5 Experiments
6 AdaMax
7 Conclusion
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 26 / 32
AdaMax
Lp Generalization
Adam may be generalized by replacing gradient L2 norm with Lp norm:
Adam – Lp GeneralizationM0 = 0,R0 = 0 (Initialization)For t = 1, . . . ,T :
Mt = β1Mt−1 + (1− β1)∇Lt(Wt−1) (1st moment estimate)Rt = β2Rt−1 + (1− (β2)p)∇Lt(Wt−1)p (p’th moment estimate)M̂t = Mt/ (1− (β1)t) (1st moment bias correction)R̂t = Rt/ (1− (β2)pt) (p’th moment bias correction)Wt = Wt−1 − α M̂t
(R̂t+ε)1/p (Update)Return WT
(β2 re-parameterized as (β2)p for convenience)
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 27 / 32
AdaMax
p →∞ =⇒ AdaMax
When p →∞, ‖·‖p → max{·} and we get:
AdaMaxM0 = 0,U0 = 0 (Initialization)For t = 1, . . . ,T :
Mt = β1Mt−1 + (1− β1)∇Lt(Wt−1) (1st moment estimate)Ut = max {β2Ut−1, |∇Lt(Wt−1)|} (“∞” moment estimate)M̂t = Mt/ (1− (β1)t) (1st moment bias correction)Wt = Wt−1 − α M̂t
Ut(Update)
Return WT
In AdaMax step size is always bounded by α:
‖∆t‖∞ ≤ α
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 28 / 32
Outline
1 Optimization in Deep Learning
2 Adaptive Moment Estimation (Adam)
3 Convergence Analysis
4 Relations to Existing Algorithms
5 Experiments
6 AdaMax
7 Conclusion
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 29 / 32
Conclusion
Adam:
Efficient first-order stochastic optimization method
Combines the advantages of:AdaGrad – works well with sparse gradientsRMSProp – deals with non-stationary objectives
Parameter updates:Have bounded normAre scale-invariant
Widely used in deep learning community (e.g. Google DeepMind)
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 30 / 32
Conclusion
Adam:
Efficient first-order stochastic optimization method
Combines the advantages of:AdaGrad – works well with sparse gradientsRMSProp – deals with non-stationary objectives
Parameter updates:Have bounded normAre scale-invariant
Widely used in deep learning community (e.g. Google DeepMind)
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 30 / 32
Conclusion
Adam:
Efficient first-order stochastic optimization method
Combines the advantages of:AdaGrad – works well with sparse gradientsRMSProp – deals with non-stationary objectives
Parameter updates:Have bounded normAre scale-invariant
Widely used in deep learning community (e.g. Google DeepMind)
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 30 / 32
Conclusion
Adam:
Efficient first-order stochastic optimization method
Combines the advantages of:AdaGrad – works well with sparse gradientsRMSProp – deals with non-stationary objectives
Parameter updates:Have bounded normAre scale-invariant
Widely used in deep learning community (e.g. Google DeepMind)
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 30 / 32
Conclusion
Adam:
Efficient first-order stochastic optimization method
Combines the advantages of:AdaGrad – works well with sparse gradientsRMSProp – deals with non-stationary objectives
Parameter updates:Have bounded normAre scale-invariant
Widely used in deep learning community (e.g. Google DeepMind)
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 30 / 32
Outline
1 Optimization in Deep Learning
2 Adaptive Moment Estimation (Adam)
3 Convergence Analysis
4 Relations to Existing Algorithms
5 Experiments
6 AdaMax
7 Conclusion
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 31 / 32
Thank You
Nadav Cohen Adam (Kingma & Ba) October 18, 2015 32 / 32
Recommended