48
On Momentum Methods and Acceleration in Stochastic Optimization Praneeth Netrapalli MSR India Based on joint work with Prateek Jain, Sham M. Kakade, Rahul Kidambi and Aaron Sidford Univ. of Washington, Seattle Stanford University MSR India

On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

On Momentum Methods and Acceleration in

Stochastic OptimizationPraneeth Netrapalli

MSR India

Based on joint work with

Prateek Jain, Sham M. Kakade, Rahul Kidambi and Aaron Sidford

Univ. of Washington, Seattle

StanfordUniversity

MSR India

Page 2: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Overview• Optimization is a big part of large scale machine learning

• Stochastic gradient descent (SGD) is the workhorse

• Three key aspects➢Minibatching (parallelization)

➢Model averaging (distributed over multiple machines)

➢Acceleration (speeding up convergence)

• Empirically widely used but little understanding

Page 3: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Overview• Optimization is a big part of large scale machine learning

• Stochastic gradient descent (SGD) is the workhorse

• Three key aspects➢Minibatching (parallelization)

➢Model averaging (distributed over multiple machines)

➢Acceleration (speeding up convergence)

• Empirically widely used but little understanding

Thorough investigation of these three aspects for linear regression

Our work

[accepted to JMLR]

[accepted to JMLR]

[COLT 2018 & ICLR 2018]

Page 4: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Gradient descent (GD) (Cauchy 1847)

Gradient descent:𝑤𝑡+1 = 𝑤𝑡 − 𝛿 ⋅ 𝛻𝑓 𝑤𝑡

Stepsize Gradient

min𝑤

𝑓 𝑤

Problem:

𝑤𝑡 𝑤𝑡+1

−𝛿 ⋅ 𝛻𝑓 𝑤𝑡

𝑓 𝑤

Page 5: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Linear regression

• Basic problem and arises pervasively in applications

• Deeply studied in literature

𝑓 𝑤 = 𝑋⊤ 𝑤 𝑌

2

2

Page 6: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Gradient descent for linear regression

𝑤𝑡+1 = 𝑤𝑡 − 𝛿 ⋅ 𝑋 𝑋⊤𝑤𝑡 − 𝑏

Gradient

• 𝑓∗ = min𝑤

𝑓 𝑤 ; Condition number: 𝜅 ≝𝜎𝑚𝑎𝑥 𝑋𝑋⊤

𝜎𝑚𝑖𝑛 𝑋𝑋⊤

• Find 𝑤 such that 𝑓 𝑤 − 𝑓∗ ≤ 𝜖

• Convergence rate: Θ 𝜅 log𝑓 𝑤0 −𝑓∗

𝜖iterations

Page 7: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Question: Is it possible to do better?

• Hope: GD does not reuse past gradients

• Answer: Yes!

➢Conjugate gradient (Hestenes and Stiefel 1952)

➢Heavy ball method (Polyak 1964)

➢Accelerated gradient (Nemirovsky and Yudin 1977, Nesterov 1983)

➢Also known as momentum methods

𝑤𝑡+1 = 𝑤𝑡 − 𝛿 ⋅ 𝛻𝑓 𝑤𝑡

Gradient descent:

Page 8: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Nesterov’s accelerated gradient (NAG)

• 𝑓∗ = min𝑤

𝑓 𝑤 ; Condition number: 𝜅 ≝𝜎𝑚𝑎𝑥 𝑋𝑋⊤

𝜎𝑚𝑖𝑛 𝑋𝑋⊤

• Find 𝑤 such that 𝑓 𝑤 − 𝑓∗ ≤ 𝜖

• Convergence rate: Θ 𝜅 log𝑓 𝑤0 −𝑓∗

𝜖iterations

𝑤𝑡

𝑤𝑡+1 𝑤𝑡+2

𝑣𝑡𝑣𝑡+1

Gradient stepsMomentum steps

Page 9: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Nesterov’s accelerated gradient (NAG)

𝑤𝑡+1 = 𝑣𝑡 − 𝛿𝛻𝑓 𝑣𝑡𝑣𝑡+1 = 𝑤𝑡+1 + 𝛾 𝑤𝑡+1 −𝑤𝑡

• 𝑓∗ = min𝑤

𝑓 𝑤 ; Condition number: 𝜅 ≝𝜎𝑚𝑎𝑥 𝑋𝑋⊤

𝜎𝑚𝑖𝑛 𝑋𝑋⊤

• Find 𝑤 such that 𝑓 𝑤 − 𝑓∗ ≤ 𝜖

• Convergence rate: Θ 𝜅 log𝑓 𝑤0 −𝑓∗

𝜖iterations

Gradient stepsMomentum steps

Page 10: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Nesterov’s accelerated gradient (NAG)

Compared to: 𝑂 𝜅 log𝑓 𝑤0 −𝑓∗

𝜖for GD

• 𝑓∗ = min𝑤

𝑓 𝑤 ; Condition number: 𝜅 ≝𝜎𝑚𝑎𝑥 𝑋𝑋⊤

𝜎𝑚𝑖𝑛 𝑋𝑋⊤

• Find 𝑤 such that 𝑓 𝑤 − 𝑓∗ ≤ 𝜖

• Convergence rate: Θ 𝜅 log𝑓 𝑤0 −𝑓∗

𝜖iterations

Page 11: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Source: http://blog.mrtz.org/2014/08/18/robustness-versus-acceleration.html

Page 12: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Optimization in machine learning

• Observe 𝑛 samples 𝑥1, 𝑦1 , ⋯ , 𝑥𝑛, 𝑦𝑛 ∼ 𝒟 ℝ𝑑 × ℝ

min𝑤

መ𝑓 𝑤 ≜1

𝑛

𝑖

𝑥𝑖⊤𝑤 − 𝑦𝑖

2

• Ultimate goal: min𝑤

𝑓 𝑤 ≜ 𝔼 𝑥,𝑦 ∼𝒟 𝑥⊤𝑤 − 𝑦 2

• GD not applicable: cannot compute exact gradients

• Well studied: “stochastic approximation”

Page 13: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Stochastic algorithms (Robbins & Monro 1951)

• 𝛻𝑓 𝑤𝑡 → 𝛻𝑓 𝑤𝑡 ; 𝛻𝑓 𝑤𝑡 computed just using 𝑥𝑡 , 𝑦𝑡

• 𝔼 𝛻𝑓 𝑤𝑡 = 𝛻𝑓 𝑤𝑡

• Return1

𝑛σ𝑖𝑤𝑖 (Polyak and Juditsky 1992)

• For linear regression, SGD: 𝑤𝑡+1 = 𝑤𝑡 − 𝛿 ⋅ 𝑥𝑡⊤𝑤𝑡 − 𝑦𝑡 𝑥𝑡

• Streaming algorithm: extremely efficient and widely used in practice

ExamplesGradient descent Stochastic GD

Heavy ball Stochastic HBNesterov Stochastic NAG

Page 14: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Convergence rate of SGD

• Consider special case: 𝑦 = 𝑥⊤𝑤∗ i.e., no noise

• Convergence rate: ෩Θ 𝜅 log𝑓 𝑤0 −𝑓∗

𝜖iterations

• 𝑓∗ = min𝑤

𝑓 𝑤 ; 𝜖 = Target suboptimality

• Condition number: 𝜅 ≜max 𝑥 2

2

𝜎𝑚𝑖𝑛 𝔼 𝑥𝑥⊤

• Noisy case: 𝑦 = 𝑥⊤𝑤∗ + n. Additive term based on 𝜎2 ≝ 𝔼 n2

Page 15: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

State of the art (# iterations)

Deterministic case (𝑂(𝑛𝑑) work per iteration) Stochastic approximation (𝑂(𝑑) work per iteration)

GD

Θ 𝜅 log𝑓 𝑤0 − 𝑓∗

𝜖

SGD

෩Θ 𝜅 log𝑓 𝑤0 − 𝑓∗

𝜖+𝜎2𝑑

𝜖

NAG

Θ 𝜅 log𝑓 𝑤0 − 𝑓∗

𝜖

Accelerated SGD?Unknown

Main question: Is accelerating SGD possible?

Page 16: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

State of the art (# iterations)

Deterministic case (𝑂(𝑛𝑑) work per iteration) Stochastic approximation (𝑂(𝑑) work per iteration)

GD

Θ 𝜅 log𝑓 𝑤0 − 𝑓∗

𝜖

SGD

෩Θ 𝜅 log𝑓 𝑤0 − 𝑓∗

𝜖+𝜎2𝑑

𝜖

NAG

Θ 𝜅 log𝑓 𝑤0 − 𝑓∗

𝜖

Accelerated SGD?Unknown

Main question: Is accelerating SGD possible?

Optimal!van der Vaart 2000

Optimal?

Page 17: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Is this really important?

• Extremely important in practice➢As we saw, acceleration can give orders of magnitude improvement➢Almost all deep learning packages use SGD+momentum➢Our earlier work shows acceleration leads to more parallelizability

• No understanding of stochastic HB/NAG in ML setting• Studied non-ML settings by [d’Aspremont 2008, Ghadimi & Lan 2010]

Page 18: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Outline of our results

1. Is it always possible to improve 𝜅 to 𝜅?

➢ No

2. Is improvement ever possible?

➢ Perhaps, depends on other problem parameters

3. Do existing algorithms (stochastic HB/NAG) achieve this improvement?

➢ No, in fact they are no better than SGD

4. Can we design an algorithm improving over SGD?

➢ Yes, we design such an algorithm (ASGD)

Page 19: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Question 1: Is acceleration always possible?

• 𝑦 = 𝑥⊤𝑤∗ (Noiseless)

• SGD convergence rate: ෩Θ 𝜅 log𝑓 𝑤0 −𝑓∗

𝜖

• Accelerated rate: ෨𝑂 𝜅 log𝑓 𝑤0 −𝑓∗

𝜖?

Page 20: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Example I: Discrete distribution

10

w.p. 0.9999,01

w.p. 0.0001

• In this case, 𝜅 ≜max 𝑥 2

2

𝜎𝑚𝑖𝑛 𝔼 𝑥𝑥⊤=

1

0.0001= 104

• Is ෨𝑂 𝜅 log𝑓 𝑤0 −𝑓∗

𝜖possible?

• Or even, halve the error using ∼ 𝜅 = 100 samples?

𝑥 =

Page 21: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Example I: Discrete distribution10

w.p. 0.999901

w.p. 0.0001;

• Fewer than 𝜅 samples ⇒ do not observe 01

direction

⇒ cannot estimate 𝑤1∗ at all

• Cannot do better than 𝑂 𝜅

• Acceleration not possible for this distribution

𝑥 = 𝜅 = 104

Page 22: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Example I: Discrete distribution10

w.p. 0.999901

w.p. 0.0001;

• Fewer than 𝜅 samples ⇒ do not observe 01

direction

⇒ cannot estimate 𝑤1∗ at all

• Cannot do better than 𝑂 𝜅

• Acceleration not possible for this distribution

𝑥 = 𝜅 = 104

Question 1Is it always possible to

improve 𝜅 to 𝜅?Answer: No

Page 23: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Example II: Gaussian

• 𝑥 ∼ 𝒩 0,𝐻 , 𝐻 =0.9999 00 0.0001

• In this case, 𝜅 ∼Tr 𝐻

𝜎min 𝐻= 104 ≫ 2

• However, after 2 samples: 1

𝑛σ𝑖 𝑥𝑖𝑥𝑖

⊤ invertible

• That is, possible to find 𝑤∗ = (σ𝑖 𝑥𝑖𝑥𝑖𝑇)

−1σ𝑖 𝑥𝑖𝑦𝑖

• Possible to find 𝑤∗ after 2 samples!

• Acceleration might be possible for this distribution

Page 24: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Example II: Gaussian

• 𝑥 ∼ 𝒩 0,𝐻 , 𝐻 =0.9999 00 0.0001

• In this case, 𝜅 ∼Tr 𝐻

𝜎min 𝐻= 104 ≫ 2

• However, after 2 samples: 1

𝑛σ𝑖 𝑥𝑖𝑥𝑖

⊤ invertible

• That is, possible to find 𝑤∗ = (σ𝑖 𝑥𝑖𝑥𝑖𝑇)

−1σ𝑖 𝑥𝑖𝑦𝑖

• Possible to find 𝑤∗ after 2 samples!

• Acceleration might be possible for this distribution

Question 2

Is acceleration ever possible?Answer: Perhaps, but depends on other problem parameters

Page 25: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Matrix spectral concentration

Recall: 𝑥𝑖 ∼ 𝒟. Let 𝐻 ≝ 𝔼 𝑥𝑖𝑥𝑖⊤

𝑥𝑖⊤𝑥𝑖

1

𝑛

𝑖=1

𝑛

𝐻Need small

How many samples (𝑛) are required?

For scalars: variance, 𝔼 𝑥2 − 𝔼 𝑥22

For matrices: matrix variance (Tropp 2015), ǁ𝜅

statistical condition number

Page 26: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Statistical vs computational condition number

Computational: 𝜅

𝑥𝑖2

min𝑒𝔼 𝑒⊤𝑥𝑖

2

Statistical: ǁ𝜅

𝑒

𝑒⊤𝑥𝑖2

𝔼 𝑒⊤𝑥𝑖2

𝜎𝑚𝑎𝑥 𝐻

𝜎𝑚𝑖𝑛(𝐻)

𝐻 ≝ 𝔼 𝑥𝑖𝑥𝑖⊤

𝑒

Define random variable: 𝑒⊤𝑥𝑖2

𝔼 𝑒⊤𝑥𝑖2 = 𝑒⊤𝐻𝑒

Page 27: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Statistical vs computational condition number

Computational: 𝜅

𝑥𝑖2

min𝑒𝔼 𝑒⊤𝑥𝑖

2

Statistical: ǁ𝜅

𝑒

𝑒⊤𝑥𝑖2

𝔼 𝑒⊤𝑥𝑖2

𝜎𝑚𝑎𝑥 𝐻

𝜎𝑚𝑖𝑛(𝐻)

𝐻 ≝ 𝔼 𝑥𝑖𝑥𝑖⊤

𝑒

Define random variable: 𝑒⊤𝑥𝑖2

𝔼 𝑒⊤𝑥𝑖2 = 𝑒⊤𝐻𝑒

≤Acceleration might be possible if ǁ𝜅 ≪ 𝜅

Page 28: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Discrete vs Gaussian

Computational: 𝜅

𝑥𝑖2

min𝑒𝔼 𝑒⊤𝑥𝑖

2

Statistical: ǁ𝜅

𝑒

𝑒⊤𝑥𝑖2

𝔼 𝑒⊤𝑥𝑖2

Discrete

Gaussian

104

100 104

104∼≪

Page 29: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Discrete vs Gaussian

Discrete distribution Gaussian distributionǁ𝜅

𝜅

Page 30: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Question 3: Do existing algorithms stochastic HB/NAG achieve this improvement?• Answer: No, there exist distributions where ǁ𝜅 ≪ 𝜅 but rate of HB is ෩Ω 𝜅 log

𝑓 𝑤0 −𝑓∗

𝜖

• No better than SGD

• Same statement seems true empirically for NAG as well

• Fairly natural distributions

Page 31: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Empirical behavior of stochastic HB/NAG

Gaussian distribution

Page 32: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Question 4: Can we design an algorithm improving over SGD?• Yes!

• Convergence rate of ASGD: ෨𝑂 𝜅 ǁ𝜅 log𝑓 𝑤0 −𝑓∗

𝜖+

𝜎2𝑑

𝜖

• Compared to SGD: ෩Θ 𝜅 log𝑓 𝑤0 −𝑓∗

𝜖+

𝜎2𝑑

𝜖

• Improvement since ǁ𝜅 ≤ 𝜅

Page 33: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Simulations

Does not achieve lower bound ǁ𝜅 ∼ 100. We believe not possible.

Page 34: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Algorithm

• Several equivalent versions of NAG in deterministic setting• 2 iterate version – most popular (gradient and momentum interpretation)

• 4 iterate version – less popular (upper and lower bound interpretation)

• Not equivalent in the stochastic setting

Parameters: 𝛼, 𝛽, 𝛾, 𝛿

1. 𝑢0 = 𝑤02. 𝑣𝑡−1 = 𝛼𝑤𝑡−1 + 1 − 𝛼 𝑢𝑡−13. 𝑤𝑡 = 𝑣𝑡−1 − 𝛿 𝛻𝑡𝑓 𝑣𝑡−14. 𝑧𝑡−1 = 𝛽𝑣𝑡−1 + 1 − 𝛽 𝑢𝑡−15. 𝑢𝑡 = 𝑧𝑡−1 − 𝛾 𝛻𝑡𝑓 𝑣𝑡−1

Our algorithm

Page 35: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Proof overview

• Recall our guarantee: ෨𝑂 𝜅 ǁ𝜅 log𝑓 𝑤0 −𝑓∗

𝜖+

𝜎2𝑑

𝜖

• First term depends on initial error; second is statistical error

• Different analyses for the two terms

• For the first term, analyze assuming 𝜎 = 0

• For the second term, analyze assuming 𝑤0 = 𝑤∗

Page 36: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Part I: Potential function

• Iterates 𝑤𝑡 , 𝑣𝑡 of ASGD. 𝐻 ≜ 𝔼 𝑥𝑥⊤ .

• Existing analyses use potential function𝑤𝑡 − 𝑤∗

𝐻2 + 𝜎min 𝐻 ⋅ 𝑢𝑡 − 𝑤∗

22

• We use 𝑤𝑡 −𝑤∗22 + 𝜎min 𝐻 ⋅ 𝑢𝑡 − 𝑤∗

𝐻−12

• We show 𝑤𝑡 − 𝑤∗22 + 𝜎min 𝐻 ⋅ 𝑢𝑡 − 𝑤∗

𝐻−12

≤ 1 −1

𝜅𝜅⋅ 𝑤𝑡−1 − 𝑤∗

22 + 𝜎min 𝐻 ⋅ 𝑢𝑡−1 − 𝑤∗

𝐻−12

Page 37: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Part II: Stochastic process analysis

𝑤𝑡+1 −𝑤∗

𝑣𝑡+1 − 𝑤∗ = 𝐶𝑤𝑡 − 𝑤∗

𝑤𝑡 − 𝑤∗ + noise

Let 𝜃𝑡 ≜ 𝔼𝑤𝑡 −𝑤∗

𝑢𝑡 − 𝑤∗𝑤𝑡 −𝑤∗

𝑢𝑡 − 𝑤∗

𝜃𝑡+1 = 𝔅𝜃𝑡 + noise ⋅ noise⊤

𝜃𝑛 →

𝑖

𝔅𝑖 noise ⋅ noise⊤

= 𝕀 − 𝔅 −1 noise ⋅ noise⊤

Parameters: 𝛼, 𝛽, 𝛾, 𝛿

1. 𝑢0 = 𝑤02. 𝑣𝑡−1 = 𝛼𝑤𝑡−1 + 1 − 𝛼 𝑢𝑡−1

3. 𝑤𝑡 = 𝑣𝑡−1 − 𝛿 𝛻𝑓 𝑣𝑡−14. 𝑧𝑡−1 = 𝛽𝑣𝑡−1 + 1 − 𝛽 𝑢𝑡−1

5. 𝑢𝑡 = 𝑧𝑡−1 − 𝛾 𝛻𝑓 𝑣𝑡−1

Our algorithm

Page 38: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Part II: Stochastic process analysis

• Need to understand 𝕀 − 𝔅 −1 noise ⋅ noise⊤

• 𝔅 has singular values > 1, but fortunately eigenvalues < 1

• Solve the 1-dim version of 𝕀 − 𝔅 −1 noise ⋅ noise⊤ via explicit computations

• Combine the 1-dim bounds with (statistical) condition number bounds

• 𝔼[𝑤𝑤⊤] ≼ ǁ𝜅𝐻−1 + 𝛿 ⋅ 𝐼

Page 39: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Recap so far

• For linear regression• Completely (theoretically) understand the behavior of various algorithms

• Stochastic HB and NAG do not provide any improvement

• Our algorithm (ASGD) improves over SGD, HB and NAG

• In practice• What does this mean for other problems e.g., training neural nets?

• No proofs but do these intuitions help?

Page 40: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Stochastic HB and NAG in practice

• Why do stochastic HB and NAG still work better than SGD in practice?

• Conjecture: Minibatching; Large minibatch → Deterministic setting

• Precise quantification of small/large depends on dataset

• Our algorithm ASGD improves over SGD even for small minibatches

Page 41: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Deep autoencoder for mnist, small batch size (1)

Page 42: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

ASGD (our algorithm) vs others

Deep autoencoder for mnist, small batch size (1)

Page 43: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Resnet for cifar-10 for small batch size (8)

Page 44: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Resnet for cifar-10 for small batch size (8)

ASGD (our algorithm) vs NAG

Page 45: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Resnet for cifar-10 for usual batch size (128)

• Fast initial convergence – e.g., 1.5X faster than others to get to 90% accuracy

• Important for speeding up exploratory search for hyperparameters

ASGD vs NAG

Page 46: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Recap

• Stochastic HB/NAG do not accelerate in stochastic setting

• ASGD – acceleration possible in the stochastic setting

• Theoretical results for linear regression

• Works well for training neural nets (even for small batchsizes)

• Code in PyTorch: https://github.com/rahulkidambi/AccSGD

Deterministic case Stochastic approximation

GD

Θ 𝜅 log𝑓 𝑥0 − 𝑓∗

𝜖

SGD

෩Θ 𝜅 log𝑓 𝑥0 − 𝑓∗

𝜖+𝜎2𝑑

𝜖

AGD

Θ 𝜅 log𝑓 𝑥0 − 𝑓∗

𝜖

ASGD

෨𝑂 𝜅 ǁ𝜅 log𝑓 𝑥0 − 𝑓∗

𝜖+𝜎2𝑑

𝜖

Page 47: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Optimization in neural networks

• Optimization methods used in training large scale neural networks not well understood, even for convex problems

• E.g., Adam, RMSProp etc. Interest in other methods as well

• Stochastic approximation provides a good framework

• Strong non-asymptotic results are of interest

Page 48: On Momentum Methods and Acceleration in Stochastic ... · Optimization in neural networks •Optimization methods used in training large scale neural networks not well understood,

Thank you!