On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

On Momentum Methods and Acceleration in

Stochastic OptimizationPraneeth Netrapalli

MSR India

Based on joint work with

Prateek Jain, Sham M. Kakade, Rahul Kidambi and Aaron Sidford

Univ. of Washington, Seattle

StanfordUniversity

MSR India

Overview• Optimization is a big part of large scale machine learning

• Stochastic gradient descent (SGD) is the workhorse

• Three key aspects➢Minibatching (parallelization)

➢Model averaging (distributed over multiple machines)

➢Acceleration (speeding up convergence)

• Empirically widely used but little understanding

Overview• Optimization is a big part of large scale machine learning

• Stochastic gradient descent (SGD) is the workhorse

• Three key aspects➢Minibatching (parallelization)

➢Model averaging (distributed over multiple machines)

➢Acceleration (speeding up convergence)

• Empirically widely used but little understanding

Thorough investigation of these three aspects for linear regression

Our work

[accepted to JMLR]

[accepted to JMLR]

[COLT 2018 & ICLR 2018]

Gradient descent (GD) (Cauchy 1847)

Gradient descent:𝑤𝑡+1 = 𝑤𝑡 − 𝛿 ⋅ 𝛻𝑓 𝑤𝑡

Stepsize Gradient

min𝑤

𝑓 𝑤

Problem:

𝑤𝑡 𝑤𝑡+1

−𝛿 ⋅ 𝛻𝑓 𝑤𝑡

𝑓 𝑤

Linear regression

• Basic problem and arises pervasively in applications

• Deeply studied in literature

𝑓 𝑤 = 𝑋⊤ 𝑤 𝑌

2

2

Gradient descent for linear regression

𝑤𝑡+1 = 𝑤𝑡 − 𝛿 ⋅ 𝑋 𝑋⊤𝑤𝑡 − 𝑏

Gradient

• 𝑓∗ = min𝑤

𝑓 𝑤 ; Condition number: 𝜅 ≝𝜎𝑚𝑎𝑥 𝑋𝑋⊤

𝜎𝑚𝑖𝑛 𝑋𝑋⊤

• Find 𝑤 such that 𝑓 𝑤 − 𝑓∗ ≤ 𝜖

• Convergence rate: Θ 𝜅 log𝑓 𝑤0 −𝑓∗

𝜖iterations

Question: Is it possible to do better?

• Hope: GD does not reuse past gradients

• Answer: Yes!

➢Conjugate gradient (Hestenes and Stiefel 1952)

➢Heavy ball method (Polyak 1964)

➢Accelerated gradient (Nemirovsky and Yudin 1977, Nesterov 1983)

➢Also known as momentum methods

𝑤𝑡+1 = 𝑤𝑡 − 𝛿 ⋅ 𝛻𝑓 𝑤𝑡

Gradient descent:

Nesterov’s accelerated gradient (NAG)






𝜖iterations

𝑤𝑡

𝑤𝑡+1 𝑤𝑡+2

𝑣𝑡𝑣𝑡+1

Gradient stepsMomentum steps


𝑤𝑡+1 = 𝑣𝑡 − 𝛿𝛻𝑓 𝑣𝑡𝑣𝑡+1 = 𝑤𝑡+1 + 𝛾 𝑤𝑡+1 −𝑤𝑡






𝜖iterations

Gradient stepsMomentum steps


Compared to: 𝑂 𝜅 log𝑓 𝑤0 −𝑓∗

𝜖for GD






𝜖iterations

Source: http://blog.mrtz.org/2014/08/18/robustness-versus-acceleration.html

Optimization in machine learning

• Observe 𝑛 samples 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛, 𝑦𝑛 ∼ 𝒟 ℝ𝑑 × ℝ

min𝑤

መ𝑓 𝑤 ≜1

𝑛

𝑖

𝑥𝑖⊤𝑤 − 𝑦𝑖

2

• Ultimate goal: min𝑤

𝑓 𝑤 ≜ 𝔼 𝑥,𝑦 ∼𝒟 𝑥⊤𝑤 − 𝑦 2

• GD not applicable: cannot compute exact gradients

• Well studied: “stochastic approximation”

Stochastic algorithms (Robbins & Monro 1951)

• 𝛻𝑓 𝑤𝑡 → 𝛻𝑓 𝑤𝑡 ; 𝛻𝑓 𝑤𝑡 computed just using 𝑥𝑡 , 𝑦𝑡

• 𝔼 𝛻𝑓 𝑤𝑡 = 𝛻𝑓 𝑤𝑡

• Return1

𝑛σ𝑖𝑤𝑖 (Polyak and Juditsky 1992)

• For linear regression, SGD: 𝑤𝑡+1 = 𝑤𝑡 − 𝛿 ⋅ 𝑥𝑡⊤𝑤𝑡 − 𝑦𝑡 𝑥𝑡

• Streaming algorithm: extremely efficient and widely used in practice

ExamplesGradient descent Stochastic GD

Heavy ball Stochastic HBNesterov Stochastic NAG

Convergence rate of SGD

• Consider special case: 𝑦 = 𝑥⊤𝑤∗ i.e., no noise

• Convergence rate: ෩Θ 𝜅 log𝑓 𝑤0 −𝑓∗

𝜖iterations


𝑓 𝑤 ; 𝜖 = Target suboptimality

• Condition number: 𝜅 ≜max 𝑥 2

2

𝜎𝑚𝑖𝑛 𝔼 𝑥𝑥⊤

• Noisy case: 𝑦 = 𝑥⊤𝑤∗ + n. Additive term based on 𝜎2 ≝ 𝔼 n2

State of the art (# iterations)

Deterministic case (𝑂(𝑛𝑑) work per iteration) Stochastic approximation (𝑂(𝑑) work per iteration)

GD

Θ 𝜅 log𝑓 𝑤0 − 𝑓∗

𝜖

SGD

෩Θ 𝜅 log𝑓 𝑤0 − 𝑓∗

𝜖+𝜎2𝑑

𝜖

NAG


𝜖

Accelerated SGD?Unknown

Main question: Is accelerating SGD possible?

State of the art (# iterations)

Deterministic case (𝑂(𝑛𝑑) work per iteration) Stochastic approximation (𝑂(𝑑) work per iteration)

GD


𝜖

SGD

෩Θ 𝜅 log𝑓 𝑤0 − 𝑓∗

𝜖+𝜎2𝑑

𝜖

NAG


𝜖

Accelerated SGD?Unknown

Main question: Is accelerating SGD possible?

Optimal!van der Vaart 2000

Optimal?

Is this really important?

• Extremely important in practice➢As we saw, acceleration can give orders of magnitude improvement➢Almost all deep learning packages use SGD+momentum➢Our earlier work shows acceleration leads to more parallelizability

• No understanding of stochastic HB/NAG in ML setting• Studied non-ML settings by [d’Aspremont 2008, Ghadimi & Lan 2010]

Outline of our results

1. Is it always possible to improve 𝜅 to 𝜅?

➢ No

2. Is improvement ever possible?

➢ Perhaps, depends on other problem parameters

3. Do existing algorithms (stochastic HB/NAG) achieve this improvement?

➢ No, in fact they are no better than SGD

4. Can we design an algorithm improving over SGD?

➢ Yes, we design such an algorithm (ASGD)

Question 1: Is acceleration always possible?

• 𝑦 = 𝑥⊤𝑤∗ (Noiseless)

• SGD convergence rate: ෩Θ 𝜅 log𝑓 𝑤0 −𝑓∗

𝜖

• Accelerated rate: ෨𝑂 𝜅 log𝑓 𝑤0 −𝑓∗

𝜖?

Example I: Discrete distribution

10

w.p. 0.9999,01

w.p. 0.0001

• In this case, 𝜅 ≜max 𝑥 2

2

𝜎𝑚𝑖𝑛 𝔼 𝑥𝑥⊤=

1

0.0001= 104

• Is ෨𝑂 𝜅 log𝑓 𝑤0 −𝑓∗

𝜖possible?

• Or even, halve the error using ∼ 𝜅 = 100 samples?

𝑥 =

Example I: Discrete distribution10

w.p. 0.999901

w.p. 0.0001;

• Fewer than 𝜅 samples ⇒ do not observe 01

direction

⇒ cannot estimate 𝑤1∗ at all

• Cannot do better than 𝑂 𝜅

• Acceleration not possible for this distribution

𝑥 = 𝜅 = 104

Example I: Discrete distribution10

w.p. 0.999901

w.p. 0.0001;

• Fewer than 𝜅 samples ⇒ do not observe 01

direction

⇒ cannot estimate 𝑤1∗ at all

• Cannot do better than 𝑂 𝜅

• Acceleration not possible for this distribution

𝑥 = 𝜅 = 104

Question 1Is it always possible to

improve 𝜅 to 𝜅?Answer: No

Example II: Gaussian

• 𝑥 ∼ 𝒩 0,𝐻 , 𝐻 =0.9999 00 0.0001

• In this case, 𝜅 ∼Tr 𝐻

𝜎min 𝐻= 104 ≫ 2

• However, after 2 samples: 1

𝑛σ𝑖 𝑥𝑖𝑥𝑖

⊤ invertible

• That is, possible to find 𝑤∗ = (σ𝑖 𝑥𝑖𝑥𝑖𝑇)

−1σ𝑖 𝑥𝑖𝑦𝑖

• Possible to find 𝑤∗ after 2 samples!

• Acceleration might be possible for this distribution

Example II: Gaussian

• 𝑥 ∼ 𝒩 0,𝐻 , 𝐻 =0.9999 00 0.0001

• In this case, 𝜅 ∼Tr 𝐻

𝜎min 𝐻= 104 ≫ 2

• However, after 2 samples: 1

𝑛σ𝑖 𝑥𝑖𝑥𝑖

⊤ invertible

• That is, possible to find 𝑤∗ = (σ𝑖 𝑥𝑖𝑥𝑖𝑇)

−1σ𝑖 𝑥𝑖𝑦𝑖

• Possible to find 𝑤∗ after 2 samples!

• Acceleration might be possible for this distribution

Question 2

Is acceleration ever possible?Answer: Perhaps, but depends on other problem parameters

Matrix spectral concentration

Recall: 𝑥𝑖 ∼ 𝒟. Let 𝐻 ≝ 𝔼 𝑥𝑖𝑥𝑖⊤

𝑥𝑖⊤𝑥𝑖

1

𝑛

𝑖=1

𝑛

𝐻Need small

How many samples (𝑛) are required?

For scalars: variance, 𝔼 𝑥2 − 𝔼 𝑥22

For matrices: matrix variance (Tropp 2015), ǁ𝜅

statistical condition number

Statistical vs computational condition number

Computational: 𝜅

𝑥𝑖2

min𝑒𝔼 𝑒⊤𝑥𝑖

2

Statistical: ǁ𝜅

𝑒

𝑒⊤𝑥𝑖2

𝔼 𝑒⊤𝑥𝑖2

𝜎𝑚𝑎𝑥 𝐻

𝜎𝑚𝑖𝑛(𝐻)

𝐻 ≝ 𝔼 𝑥𝑖𝑥𝑖⊤

𝑒

Define random variable: 𝑒⊤𝑥𝑖2

𝔼 𝑒⊤𝑥𝑖2 = 𝑒⊤𝐻𝑒

Statistical vs computational condition number

Computational: 𝜅

𝑥𝑖2


2

Statistical: ǁ𝜅

𝑒

𝑒⊤𝑥𝑖2


𝜎𝑚𝑎𝑥 𝐻

𝜎𝑚𝑖𝑛(𝐻)

𝐻 ≝ 𝔼 𝑥𝑖𝑥𝑖⊤

𝑒

Define random variable: 𝑒⊤𝑥𝑖2

𝔼 𝑒⊤𝑥𝑖2 = 𝑒⊤𝐻𝑒

≤Acceleration might be possible if ǁ𝜅 ≪ 𝜅

Discrete vs Gaussian

Computational: 𝜅

𝑥𝑖2


2

Statistical: ǁ𝜅

𝑒

𝑒⊤𝑥𝑖2


≤

Discrete

Gaussian

104

100 104

104∼≪

Discrete vs Gaussian

Discrete distribution Gaussian distributionǁ𝜅

𝜅

Question 3: Do existing algorithms stochastic HB/NAG achieve this improvement?• Answer: No, there exist distributions where ǁ𝜅 ≪ 𝜅 but rate of HB is ෩Ω 𝜅 log

𝑓 𝑤0 −𝑓∗

𝜖

• No better than SGD

• Same statement seems true empirically for NAG as well

• Fairly natural distributions

Empirical behavior of stochastic HB/NAG

Gaussian distribution

Question 4: Can we design an algorithm improving over SGD?• Yes!

• Convergence rate of ASGD: ෨𝑂 𝜅 ǁ𝜅 log𝑓 𝑤0 −𝑓∗

𝜖+

𝜎2𝑑

𝜖

• Compared to SGD: ෩Θ 𝜅 log𝑓 𝑤0 −𝑓∗

𝜖+

𝜎2𝑑

𝜖

• Improvement since ǁ𝜅 ≤ 𝜅

Simulations

Does not achieve lower bound ǁ𝜅 ∼ 100. We believe not possible.

Algorithm

• Several equivalent versions of NAG in deterministic setting• 2 iterate version – most popular (gradient and momentum interpretation)

• 4 iterate version – less popular (upper and lower bound interpretation)

• Not equivalent in the stochastic setting

Parameters: 𝛼, 𝛽, 𝛾, 𝛿

1. 𝑢0 = 𝑤02. 𝑣𝑡−1 = 𝛼𝑤𝑡−1 + 1 − 𝛼 𝑢𝑡−13. 𝑤𝑡 = 𝑣𝑡−1 − 𝛿 𝛻𝑡𝑓 𝑣𝑡−14. 𝑧𝑡−1 = 𝛽𝑣𝑡−1 + 1 − 𝛽 𝑢𝑡−15. 𝑢𝑡 = 𝑧𝑡−1 − 𝛾 𝛻𝑡𝑓 𝑣𝑡−1

Our algorithm

Proof overview

• Recall our guarantee: ෨𝑂 𝜅 ǁ𝜅 log𝑓 𝑤0 −𝑓∗

𝜖+

𝜎2𝑑

𝜖

• First term depends on initial error; second is statistical error

• Different analyses for the two terms

• For the first term, analyze assuming 𝜎 = 0

• For the second term, analyze assuming 𝑤0 = 𝑤∗

Part I: Potential function

• Iterates 𝑤𝑡 , 𝑣𝑡 of ASGD. 𝐻 ≜ 𝔼 𝑥𝑥⊤ .

• Existing analyses use potential function𝑤𝑡 − 𝑤∗

𝐻2 + 𝜎min 𝐻 ⋅ 𝑢𝑡 − 𝑤∗

22

• We use 𝑤𝑡 −𝑤∗22 + 𝜎min 𝐻 ⋅ 𝑢𝑡 − 𝑤∗

𝐻−12

• We show 𝑤𝑡 − 𝑤∗22 + 𝜎min 𝐻 ⋅ 𝑢𝑡 − 𝑤∗

𝐻−12

≤ 1 −1

𝜅𝜅⋅ 𝑤𝑡−1 − 𝑤∗

22 + 𝜎min 𝐻 ⋅ 𝑢𝑡−1 − 𝑤∗

𝐻−12

Part II: Stochastic process analysis

𝑤𝑡+1 −𝑤∗

𝑣𝑡+1 − 𝑤∗ = 𝐶𝑤𝑡 − 𝑤∗

𝑤𝑡 − 𝑤∗ + noise

Let 𝜃𝑡 ≜ 𝔼𝑤𝑡 −𝑤∗

𝑢𝑡 − 𝑤∗𝑤𝑡 −𝑤∗

𝑢𝑡 − 𝑤∗

⊤

𝜃𝑡+1 = 𝔅𝜃𝑡 + noise ⋅ noise⊤

𝜃𝑛 →

𝑖

𝔅𝑖 noise ⋅ noise⊤

= 𝕀 − 𝔅 −1 noise ⋅ noise⊤

Parameters: 𝛼, 𝛽, 𝛾, 𝛿

1. 𝑢0 = 𝑤02. 𝑣𝑡−1 = 𝛼𝑤𝑡−1 + 1 − 𝛼 𝑢𝑡−1

3. 𝑤𝑡 = 𝑣𝑡−1 − 𝛿 𝛻𝑓 𝑣𝑡−14. 𝑧𝑡−1 = 𝛽𝑣𝑡−1 + 1 − 𝛽 𝑢𝑡−1

5. 𝑢𝑡 = 𝑧𝑡−1 − 𝛾 𝛻𝑓 𝑣𝑡−1

Our algorithm

Part II: Stochastic process analysis

• Need to understand 𝕀 − 𝔅 −1 noise ⋅ noise⊤

• 𝔅 has singular values > 1, but fortunately eigenvalues < 1

• Solve the 1-dim version of 𝕀 − 𝔅 −1 noise ⋅ noise⊤ via explicit computations

• Combine the 1-dim bounds with (statistical) condition number bounds

• 𝔼[𝑤𝑤⊤] ≼ ǁ𝜅𝐻−1 + 𝛿 ⋅ 𝐼

Recap so far

• For linear regression• Completely (theoretically) understand the behavior of various algorithms

• Stochastic HB and NAG do not provide any improvement

• Our algorithm (ASGD) improves over SGD, HB and NAG

• In practice• What does this mean for other problems e.g., training neural nets?

• No proofs but do these intuitions help?

Stochastic HB and NAG in practice

• Why do stochastic HB and NAG still work better than SGD in practice?

• Conjecture: Minibatching; Large minibatch → Deterministic setting

• Precise quantification of small/large depends on dataset

• Our algorithm ASGD improves over SGD even for small minibatches

Deep autoencoder for mnist, small batch size (1)

ASGD (our algorithm) vs others

Deep autoencoder for mnist, small batch size (1)

Resnet for cifar-10 for small batch size (8)

Resnet for cifar-10 for small batch size (8)

ASGD (our algorithm) vs NAG

Resnet for cifar-10 for usual batch size (128)

• Fast initial convergence – e.g., 1.5X faster than others to get to 90% accuracy

• Important for speeding up exploratory search for hyperparameters

ASGD vs NAG

Recap

• Stochastic HB/NAG do not accelerate in stochastic setting

• ASGD – acceleration possible in the stochastic setting

• Theoretical results for linear regression

• Works well for training neural nets (even for small batchsizes)

• Code in PyTorch: https://github.com/rahulkidambi/AccSGD

Deterministic case Stochastic approximation

GD

Θ 𝜅 log𝑓 𝑥0 − 𝑓∗

𝜖

SGD

෩Θ 𝜅 log𝑓 𝑥0 − 𝑓∗

𝜖+𝜎2𝑑

𝜖

AGD

Θ 𝜅 log𝑓 𝑥0 − 𝑓∗

𝜖

ASGD

෨𝑂 𝜅 ǁ𝜅 log𝑓 𝑥0 − 𝑓∗

𝜖+𝜎2𝑑

𝜖

https://github.com/rahulkidambi/AccSGD

Optimization in neural networks

• Optimization methods used in training large scale neural networks not well understood, even for convex problems

• E.g., Adam, RMSProp etc. Interest in other methods as well

• Stochastic approximation provides a good framework

• Strong non-asymptotic results are of interest

Thank you!

Documents

On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,