Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
On Momentum Methods and Acceleration in
Stochastic OptimizationPraneeth Netrapalli
MSR India
Based on joint work with
Prateek Jain, Sham M. Kakade, Rahul Kidambi and Aaron Sidford
Univ. of Washington, Seattle
StanfordUniversity
MSR India
Overview• Optimization is a big part of large scale machine learning
• Stochastic gradient descent (SGD) is the workhorse
• Three key aspects➢Minibatching (parallelization)
➢Model averaging (distributed over multiple machines)
➢Acceleration (speeding up convergence)
• Empirically widely used but little understanding
Overview• Optimization is a big part of large scale machine learning
• Stochastic gradient descent (SGD) is the workhorse
• Three key aspects➢Minibatching (parallelization)
➢Model averaging (distributed over multiple machines)
➢Acceleration (speeding up convergence)
• Empirically widely used but little understanding
Thorough investigation of these three aspects for linear regression
Our work
[accepted to JMLR]
[accepted to JMLR]
[COLT 2018 & ICLR 2018]
Gradient descent (GD) (Cauchy 1847)
Gradient descent:𝑤𝑡+1 = 𝑤𝑡 − 𝛿 ⋅ 𝛻𝑓 𝑤𝑡
Stepsize Gradient
min𝑤
𝑓 𝑤
Problem:
𝑤𝑡 𝑤𝑡+1
−𝛿 ⋅ 𝛻𝑓 𝑤𝑡
𝑓 𝑤
Linear regression
• Basic problem and arises pervasively in applications
• Deeply studied in literature
𝑓 𝑤 = 𝑋⊤ 𝑤 𝑌
2
2
Gradient descent for linear regression
𝑤𝑡+1 = 𝑤𝑡 − 𝛿 ⋅ 𝑋 𝑋⊤𝑤𝑡 − 𝑏
Gradient
• 𝑓∗ = min𝑤
𝑓 𝑤 ; Condition number: 𝜅 ≝𝜎𝑚𝑎𝑥 𝑋𝑋⊤
𝜎𝑚𝑖𝑛 𝑋𝑋⊤
• Find 𝑤 such that 𝑓 𝑤 − 𝑓∗ ≤ 𝜖
• Convergence rate: Θ 𝜅 log𝑓 𝑤0 −𝑓∗
𝜖iterations
Question: Is it possible to do better?
• Hope: GD does not reuse past gradients
• Answer: Yes!
➢Conjugate gradient (Hestenes and Stiefel 1952)
➢Heavy ball method (Polyak 1964)
➢Accelerated gradient (Nemirovsky and Yudin 1977, Nesterov 1983)
➢Also known as momentum methods
𝑤𝑡+1 = 𝑤𝑡 − 𝛿 ⋅ 𝛻𝑓 𝑤𝑡
Gradient descent:
Nesterov’s accelerated gradient (NAG)
• 𝑓∗ = min𝑤
𝑓 𝑤 ; Condition number: 𝜅 ≝𝜎𝑚𝑎𝑥 𝑋𝑋⊤
𝜎𝑚𝑖𝑛 𝑋𝑋⊤
• Find 𝑤 such that 𝑓 𝑤 − 𝑓∗ ≤ 𝜖
• Convergence rate: Θ 𝜅 log𝑓 𝑤0 −𝑓∗
𝜖iterations
𝑤𝑡
𝑤𝑡+1 𝑤𝑡+2
𝑣𝑡𝑣𝑡+1
Gradient stepsMomentum steps
Nesterov’s accelerated gradient (NAG)
𝑤𝑡+1 = 𝑣𝑡 − 𝛿𝛻𝑓 𝑣𝑡𝑣𝑡+1 = 𝑤𝑡+1 + 𝛾 𝑤𝑡+1 −𝑤𝑡
• 𝑓∗ = min𝑤
𝑓 𝑤 ; Condition number: 𝜅 ≝𝜎𝑚𝑎𝑥 𝑋𝑋⊤
𝜎𝑚𝑖𝑛 𝑋𝑋⊤
• Find 𝑤 such that 𝑓 𝑤 − 𝑓∗ ≤ 𝜖
• Convergence rate: Θ 𝜅 log𝑓 𝑤0 −𝑓∗
𝜖iterations
Gradient stepsMomentum steps
Nesterov’s accelerated gradient (NAG)
Compared to: 𝑂 𝜅 log𝑓 𝑤0 −𝑓∗
𝜖for GD
• 𝑓∗ = min𝑤
𝑓 𝑤 ; Condition number: 𝜅 ≝𝜎𝑚𝑎𝑥 𝑋𝑋⊤
𝜎𝑚𝑖𝑛 𝑋𝑋⊤
• Find 𝑤 such that 𝑓 𝑤 − 𝑓∗ ≤ 𝜖
• Convergence rate: Θ 𝜅 log𝑓 𝑤0 −𝑓∗
𝜖iterations
Source: http://blog.mrtz.org/2014/08/18/robustness-versus-acceleration.html
Optimization in machine learning
• Observe 𝑛 samples 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛, 𝑦𝑛 ∼ 𝒟 ℝ𝑑 × ℝ
min𝑤
መ𝑓 𝑤 ≜1
𝑛
𝑖
𝑥𝑖⊤𝑤 − 𝑦𝑖
2
• Ultimate goal: min𝑤
𝑓 𝑤 ≜ 𝔼 𝑥,𝑦 ∼𝒟 𝑥⊤𝑤 − 𝑦 2
• GD not applicable: cannot compute exact gradients
• Well studied: “stochastic approximation”
Stochastic algorithms (Robbins & Monro 1951)
• 𝛻𝑓 𝑤𝑡 → 𝛻𝑓 𝑤𝑡 ; 𝛻𝑓 𝑤𝑡 computed just using 𝑥𝑡 , 𝑦𝑡
• 𝔼 𝛻𝑓 𝑤𝑡 = 𝛻𝑓 𝑤𝑡
• Return1
𝑛σ𝑖𝑤𝑖 (Polyak and Juditsky 1992)
• For linear regression, SGD: 𝑤𝑡+1 = 𝑤𝑡 − 𝛿 ⋅ 𝑥𝑡⊤𝑤𝑡 − 𝑦𝑡 𝑥𝑡
• Streaming algorithm: extremely efficient and widely used in practice
ExamplesGradient descent Stochastic GD
Heavy ball Stochastic HBNesterov Stochastic NAG
Convergence rate of SGD
• Consider special case: 𝑦 = 𝑥⊤𝑤∗ i.e., no noise
• Convergence rate: ෩Θ 𝜅 log𝑓 𝑤0 −𝑓∗
𝜖iterations
• 𝑓∗ = min𝑤
𝑓 𝑤 ; 𝜖 = Target suboptimality
• Condition number: 𝜅 ≜max 𝑥 2
2
𝜎𝑚𝑖𝑛 𝔼 𝑥𝑥⊤
• Noisy case: 𝑦 = 𝑥⊤𝑤∗ + n. Additive term based on 𝜎2 ≝ 𝔼 n2
State of the art (# iterations)
Deterministic case (𝑂(𝑛𝑑) work per iteration) Stochastic approximation (𝑂(𝑑) work per iteration)
GD
Θ 𝜅 log𝑓 𝑤0 − 𝑓∗
𝜖
SGD
෩Θ 𝜅 log𝑓 𝑤0 − 𝑓∗
𝜖+𝜎2𝑑
𝜖
NAG
Θ 𝜅 log𝑓 𝑤0 − 𝑓∗
𝜖
Accelerated SGD?Unknown
Main question: Is accelerating SGD possible?
State of the art (# iterations)
Deterministic case (𝑂(𝑛𝑑) work per iteration) Stochastic approximation (𝑂(𝑑) work per iteration)
GD
Θ 𝜅 log𝑓 𝑤0 − 𝑓∗
𝜖
SGD
෩Θ 𝜅 log𝑓 𝑤0 − 𝑓∗
𝜖+𝜎2𝑑
𝜖
NAG
Θ 𝜅 log𝑓 𝑤0 − 𝑓∗
𝜖
Accelerated SGD?Unknown
Main question: Is accelerating SGD possible?
Optimal!van der Vaart 2000
Optimal?
Is this really important?
• Extremely important in practice➢As we saw, acceleration can give orders of magnitude improvement➢Almost all deep learning packages use SGD+momentum➢Our earlier work shows acceleration leads to more parallelizability
• No understanding of stochastic HB/NAG in ML setting• Studied non-ML settings by [d’Aspremont 2008, Ghadimi & Lan 2010]
Outline of our results
1. Is it always possible to improve 𝜅 to 𝜅?
➢ No
2. Is improvement ever possible?
➢ Perhaps, depends on other problem parameters
3. Do existing algorithms (stochastic HB/NAG) achieve this improvement?
➢ No, in fact they are no better than SGD
4. Can we design an algorithm improving over SGD?
➢ Yes, we design such an algorithm (ASGD)
Question 1: Is acceleration always possible?
• 𝑦 = 𝑥⊤𝑤∗ (Noiseless)
• SGD convergence rate: ෩Θ 𝜅 log𝑓 𝑤0 −𝑓∗
𝜖
• Accelerated rate: ෨𝑂 𝜅 log𝑓 𝑤0 −𝑓∗
𝜖?
Example I: Discrete distribution
10
w.p. 0.9999,01
w.p. 0.0001
• In this case, 𝜅 ≜max 𝑥 2
2
𝜎𝑚𝑖𝑛 𝔼 𝑥𝑥⊤=
1
0.0001= 104
• Is ෨𝑂 𝜅 log𝑓 𝑤0 −𝑓∗
𝜖possible?
• Or even, halve the error using ∼ 𝜅 = 100 samples?
𝑥 =
Example I: Discrete distribution10
w.p. 0.999901
w.p. 0.0001;
• Fewer than 𝜅 samples ⇒ do not observe 01
direction
⇒ cannot estimate 𝑤1∗ at all
• Cannot do better than 𝑂 𝜅
• Acceleration not possible for this distribution
𝑥 = 𝜅 = 104
Example I: Discrete distribution10
w.p. 0.999901
w.p. 0.0001;
• Fewer than 𝜅 samples ⇒ do not observe 01
direction
⇒ cannot estimate 𝑤1∗ at all
• Cannot do better than 𝑂 𝜅
• Acceleration not possible for this distribution
𝑥 = 𝜅 = 104
Question 1Is it always possible to
improve 𝜅 to 𝜅?Answer: No
Example II: Gaussian
• 𝑥 ∼ 𝒩 0,𝐻 , 𝐻 =0.9999 00 0.0001
• In this case, 𝜅 ∼Tr 𝐻
𝜎min 𝐻= 104 ≫ 2
• However, after 2 samples: 1
𝑛σ𝑖 𝑥𝑖𝑥𝑖
⊤ invertible
• That is, possible to find 𝑤∗ = (σ𝑖 𝑥𝑖𝑥𝑖𝑇)
−1σ𝑖 𝑥𝑖𝑦𝑖
• Possible to find 𝑤∗ after 2 samples!
• Acceleration might be possible for this distribution
Example II: Gaussian
• 𝑥 ∼ 𝒩 0,𝐻 , 𝐻 =0.9999 00 0.0001
• In this case, 𝜅 ∼Tr 𝐻
𝜎min 𝐻= 104 ≫ 2
• However, after 2 samples: 1
𝑛σ𝑖 𝑥𝑖𝑥𝑖
⊤ invertible
• That is, possible to find 𝑤∗ = (σ𝑖 𝑥𝑖𝑥𝑖𝑇)
−1σ𝑖 𝑥𝑖𝑦𝑖
• Possible to find 𝑤∗ after 2 samples!
• Acceleration might be possible for this distribution
Question 2
Is acceleration ever possible?Answer: Perhaps, but depends on other problem parameters
Matrix spectral concentration
Recall: 𝑥𝑖 ∼ 𝒟. Let 𝐻 ≝ 𝔼 𝑥𝑖𝑥𝑖⊤
𝑥𝑖⊤𝑥𝑖
1
𝑛
𝑖=1
𝑛
𝐻Need small
How many samples (𝑛) are required?
For scalars: variance, 𝔼 𝑥2 − 𝔼 𝑥22
For matrices: matrix variance (Tropp 2015), ǁ𝜅
statistical condition number
Statistical vs computational condition number
Computational: 𝜅
𝑥𝑖2
min𝑒𝔼 𝑒⊤𝑥𝑖
2
Statistical: ǁ𝜅
𝑒
𝑒⊤𝑥𝑖2
𝔼 𝑒⊤𝑥𝑖2
𝜎𝑚𝑎𝑥 𝐻
𝜎𝑚𝑖𝑛(𝐻)
𝐻 ≝ 𝔼 𝑥𝑖𝑥𝑖⊤
𝑒
Define random variable: 𝑒⊤𝑥𝑖2
𝔼 𝑒⊤𝑥𝑖2 = 𝑒⊤𝐻𝑒
Statistical vs computational condition number
Computational: 𝜅
𝑥𝑖2
min𝑒𝔼 𝑒⊤𝑥𝑖
2
Statistical: ǁ𝜅
𝑒
𝑒⊤𝑥𝑖2
𝔼 𝑒⊤𝑥𝑖2
𝜎𝑚𝑎𝑥 𝐻
𝜎𝑚𝑖𝑛(𝐻)
𝐻 ≝ 𝔼 𝑥𝑖𝑥𝑖⊤
𝑒
Define random variable: 𝑒⊤𝑥𝑖2
𝔼 𝑒⊤𝑥𝑖2 = 𝑒⊤𝐻𝑒
≤Acceleration might be possible if ǁ𝜅 ≪ 𝜅
Discrete vs Gaussian
Computational: 𝜅
𝑥𝑖2
min𝑒𝔼 𝑒⊤𝑥𝑖
2
Statistical: ǁ𝜅
𝑒
𝑒⊤𝑥𝑖2
𝔼 𝑒⊤𝑥𝑖2
≤
Discrete
Gaussian
104
100 104
104∼≪
Discrete vs Gaussian
Discrete distribution Gaussian distributionǁ𝜅
𝜅
Question 3: Do existing algorithms stochastic HB/NAG achieve this improvement?• Answer: No, there exist distributions where ǁ𝜅 ≪ 𝜅 but rate of HB is ෩Ω 𝜅 log
𝑓 𝑤0 −𝑓∗
𝜖
• No better than SGD
• Same statement seems true empirically for NAG as well
• Fairly natural distributions
Empirical behavior of stochastic HB/NAG
Gaussian distribution
Question 4: Can we design an algorithm improving over SGD?• Yes!
• Convergence rate of ASGD: ෨𝑂 𝜅 ǁ𝜅 log𝑓 𝑤0 −𝑓∗
𝜖+
𝜎2𝑑
𝜖
• Compared to SGD: ෩Θ 𝜅 log𝑓 𝑤0 −𝑓∗
𝜖+
𝜎2𝑑
𝜖
• Improvement since ǁ𝜅 ≤ 𝜅
Simulations
Does not achieve lower bound ǁ𝜅 ∼ 100. We believe not possible.
Algorithm
• Several equivalent versions of NAG in deterministic setting• 2 iterate version – most popular (gradient and momentum interpretation)
• 4 iterate version – less popular (upper and lower bound interpretation)
• Not equivalent in the stochastic setting
Parameters: 𝛼, 𝛽, 𝛾, 𝛿
1. 𝑢0 = 𝑤02. 𝑣𝑡−1 = 𝛼𝑤𝑡−1 + 1 − 𝛼 𝑢𝑡−13. 𝑤𝑡 = 𝑣𝑡−1 − 𝛿 𝛻𝑡𝑓 𝑣𝑡−14. 𝑧𝑡−1 = 𝛽𝑣𝑡−1 + 1 − 𝛽 𝑢𝑡−15. 𝑢𝑡 = 𝑧𝑡−1 − 𝛾 𝛻𝑡𝑓 𝑣𝑡−1
Our algorithm
Proof overview
• Recall our guarantee: ෨𝑂 𝜅 ǁ𝜅 log𝑓 𝑤0 −𝑓∗
𝜖+
𝜎2𝑑
𝜖
• First term depends on initial error; second is statistical error
• Different analyses for the two terms
• For the first term, analyze assuming 𝜎 = 0
• For the second term, analyze assuming 𝑤0 = 𝑤∗
Part I: Potential function
• Iterates 𝑤𝑡 , 𝑣𝑡 of ASGD. 𝐻 ≜ 𝔼 𝑥𝑥⊤ .
• Existing analyses use potential function𝑤𝑡 − 𝑤∗
𝐻2 + 𝜎min 𝐻 ⋅ 𝑢𝑡 − 𝑤∗
22
• We use 𝑤𝑡 −𝑤∗22 + 𝜎min 𝐻 ⋅ 𝑢𝑡 − 𝑤∗
𝐻−12
• We show 𝑤𝑡 − 𝑤∗22 + 𝜎min 𝐻 ⋅ 𝑢𝑡 − 𝑤∗
𝐻−12
≤ 1 −1
𝜅𝜅⋅ 𝑤𝑡−1 − 𝑤∗
22 + 𝜎min 𝐻 ⋅ 𝑢𝑡−1 − 𝑤∗
𝐻−12
Part II: Stochastic process analysis
𝑤𝑡+1 −𝑤∗
𝑣𝑡+1 − 𝑤∗ = 𝐶𝑤𝑡 − 𝑤∗
𝑤𝑡 − 𝑤∗ + noise
Let 𝜃𝑡 ≜ 𝔼𝑤𝑡 −𝑤∗
𝑢𝑡 − 𝑤∗𝑤𝑡 −𝑤∗
𝑢𝑡 − 𝑤∗
⊤
𝜃𝑡+1 = 𝔅𝜃𝑡 + noise ⋅ noise⊤
𝜃𝑛 →
𝑖
𝔅𝑖 noise ⋅ noise⊤
= 𝕀 − 𝔅 −1 noise ⋅ noise⊤
Parameters: 𝛼, 𝛽, 𝛾, 𝛿
1. 𝑢0 = 𝑤02. 𝑣𝑡−1 = 𝛼𝑤𝑡−1 + 1 − 𝛼 𝑢𝑡−1
3. 𝑤𝑡 = 𝑣𝑡−1 − 𝛿 𝛻𝑓 𝑣𝑡−14. 𝑧𝑡−1 = 𝛽𝑣𝑡−1 + 1 − 𝛽 𝑢𝑡−1
5. 𝑢𝑡 = 𝑧𝑡−1 − 𝛾 𝛻𝑓 𝑣𝑡−1
Our algorithm
Part II: Stochastic process analysis
• Need to understand 𝕀 − 𝔅 −1 noise ⋅ noise⊤
• 𝔅 has singular values > 1, but fortunately eigenvalues < 1
• Solve the 1-dim version of 𝕀 − 𝔅 −1 noise ⋅ noise⊤ via explicit computations
• Combine the 1-dim bounds with (statistical) condition number bounds
• 𝔼[𝑤𝑤⊤] ≼ ǁ𝜅𝐻−1 + 𝛿 ⋅ 𝐼
Recap so far
• For linear regression• Completely (theoretically) understand the behavior of various algorithms
• Stochastic HB and NAG do not provide any improvement
• Our algorithm (ASGD) improves over SGD, HB and NAG
• In practice• What does this mean for other problems e.g., training neural nets?
• No proofs but do these intuitions help?
Stochastic HB and NAG in practice
• Why do stochastic HB and NAG still work better than SGD in practice?
• Conjecture: Minibatching; Large minibatch → Deterministic setting
• Precise quantification of small/large depends on dataset
• Our algorithm ASGD improves over SGD even for small minibatches
Deep autoencoder for mnist, small batch size (1)
ASGD (our algorithm) vs others
Deep autoencoder for mnist, small batch size (1)
Resnet for cifar-10 for small batch size (8)
Resnet for cifar-10 for small batch size (8)
ASGD (our algorithm) vs NAG
Resnet for cifar-10 for usual batch size (128)
• Fast initial convergence – e.g., 1.5X faster than others to get to 90% accuracy
• Important for speeding up exploratory search for hyperparameters
ASGD vs NAG
Recap
• Stochastic HB/NAG do not accelerate in stochastic setting
• ASGD – acceleration possible in the stochastic setting
• Theoretical results for linear regression
• Works well for training neural nets (even for small batchsizes)
• Code in PyTorch: https://github.com/rahulkidambi/AccSGD
Deterministic case Stochastic approximation
GD
Θ 𝜅 log𝑓 𝑥0 − 𝑓∗
𝜖
SGD
෩Θ 𝜅 log𝑓 𝑥0 − 𝑓∗
𝜖+𝜎2𝑑
𝜖
AGD
Θ 𝜅 log𝑓 𝑥0 − 𝑓∗
𝜖
ASGD
෨𝑂 𝜅 ǁ𝜅 log𝑓 𝑥0 − 𝑓∗
𝜖+𝜎2𝑑
𝜖
Optimization in neural networks
• Optimization methods used in training large scale neural networks not well understood, even for convex problems
• E.g., Adam, RMSProp etc. Interest in other methods as well
• Stochastic approximation provides a good framework
• Strong non-asymptotic results are of interest
Thank you!