50
Stochastic Optimization for Large-scale Machine Learning Wei-Cheng Chang Language Technologies Institute Carnegie Mellon University September 13, 2016 Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine Learning September 13, 2016 1 / 50

Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Stochastic Optimization for Large-scale MachineLearning

Wei-Cheng Chang

Language Technologies InstituteCarnegie Mellon University

September 13, 2016

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 1 / 50

Page 2: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Outline

1 Introduction

2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives

3 Noise Reduction MethodsDynamic SamplingGradient Aggregation

4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling

5 Conclusions

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 2 / 50

Page 3: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Problem Settings and Notations

Expected Risk: Consider prediction function h parameterized byw ∈ Rd and some loss function `(·, ·),

R(w) =

∫`(h(x ;w), y)dP(x , y) = E[`(h(x ;w), y)].

Empirical Risk: In supervised learning, only have access to a set of nI.I.D. samples {(xi , yi )} ⊆ Rdx × Rdy

Rn(w) =1

n

n∑i=1

`(h(xi ;w), yi ).

We consider Rn as the objective function from now on. Note that allmethods in the subsequent slides can be applied readily when asmooth regularization term.

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 3 / 50

Page 4: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Problem Settings and Notations (Cont’d)

a sets of samples is presented by a random seed ξ. For example, arealization of ξ as a single sample (x , y) or a set of samples{(xi , yi )}i∈S .

refer the loss incurred for a given (w , ξ) as f (w ; ξ). That is, f is thecomposition of loss function ` and prediction function h.

Then we have,

R(w) = E[f (w ; ξ)],

Rn(w) =1

n

n∑i=1

fi (w),

where fi (w) ≡ f (w ; ξ[i ]) and {ξ[i ]}ni=1 is a set of realizations of ξcorresponding to a sample set {(xi , yi )}.

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 4 / 50

Page 5: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Why Stochastic Optimization in Large-scale MachineLearning?

Batch gradient methods could achieve linear rate of convergence

Rn(wk)− R∗n ≤ O(ρk).

To obtain ε-optimality, the total work is proportion to n log(1/ε).

SG could achieve sublinear rate of convergence in expectation

E[Rn(wk)− R∗n ] = O(1/k).

To obtain ε-optimality, the total work is proportion to O(1/ε). Whenn is large and limited computational time budget, SG is favorable.(More on later slides)

Due to stochastic nature of SG, it enjoys same rate of convergencefor the generalization error R − R∗.

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 5 / 50

Page 6: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Outline

1 Introduction

2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives

3 Noise Reduction MethodsDynamic SamplingGradient Aggregation

4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling

5 Conclusions

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 6 / 50

Page 7: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Rate of Convergence

Consider optimization problem

f ∗ = minx

f (x),

linear rate of convergence means, for some c ∈ (0, 1),

f (xk)− f ∗ ≤ c(f (xk−1)− f ∗) ≤ . . . ≤ ck(f (x0)− f ∗) ≤ ε

to achieve ε-optimality, the number of iterations need at most

k ≤ log((f (x0)− f ∗)/ε)

log(1/c)= O(log(

1

ε))

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 7 / 50

Page 8: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Stochastic Gradient (SG) Method

Objective function F : Rd → R represents

R(w) = E[f (w ; ξ)] or Rn(w) =1

n

n∑i=1

fi (w)

SG updates:wk+1 ← wk − αkg(wk , ξk).

Our analyses cover the choices

g(wk , ξk) =

∇f (wk ; ξk)1nk

∑nki=1∇f (wk ; ξk,i )

Hk1nk

∑nki=1∇f (wk ; ξk,i ),

where Hk is PSD, and nk is the mini-batch size.

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 8 / 50

Page 9: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Some Intuitions of SG

{ξk} is a sequence of independent random variables.

Sample uniformly from training set, SG optimizes the empirical riskF = Rn.

Picking samples according to P, SG optimizes the expected riskF = R.

Our goal is to show the expected angle between g(wk , ξk) and∇F (wk) is sufficiently close.

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 9 / 50

Page 10: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Outline

1 Introduction

2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives

3 Noise Reduction MethodsDynamic SamplingGradient Aggregation

4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling

5 Conclusions

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 10 / 50

Page 11: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Smoothness of objective function

Assumption 4.1 (Lipschitz-continuous objective gradients)

F is continuously differentiable and the gradient function of F , namely∇F , is Lipschitz-continuous with Lipschitz constant L > 0, i.e.,

‖∇F (w)−∇F (w)‖2 ≤ L‖w − w‖2 ∀{w , w} ⊂ Rd

This assumption ensures gradient of F does not change arbitrary quicklyw.r.t the parameter w . From this assumption, we could then derive

F (w) ≤ F (w) +∇F (w)T (w − w) +1

2L‖w − w‖22 (1)

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 11 / 50

Page 12: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Proof of equation (1)

Define g(t) = F (w + t(w − w)) and notice

g ′(t) = ∇F (w + t(w − w))T (w − w).

g(1) = g(0) +

∫ 1

0g ′(t)dt

= F (w) +

∫ 1

0∇F (w + t(w − w))T (w − w)dt

= F (w) +∇F (w)T (w − w) +

∫ 1

0[∇F (w + t(w − w))−∇F (w)]T (w − w)dt

≤ F (w) +∇F (w)T (w − w) +

∫ 1

0L‖t(w − w)‖2‖w − w‖2dt

= F (w) +∇F (w)T (w − w) +1

2L‖w − w‖22.

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 12 / 50

Page 13: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Bound expected function decrease by gradient information

Lemma 4.2

Under Assumption 1, the iterates of SG satisfy the following inequality

Eξk [F (wk+1)]− F (wk) ≤− αk∇F (wk)TEξk [g(wk , ξk)] (2)

+1

2α2kLEξk [‖g(wk , ξk)‖22].

Proof: By Assumption 4.1, the iterates by SG satisfy

F (wk+1)− F (wk) ≤ ∇F (wk)T (wk+1 − wk) +1

2L‖wk+1 − wk‖22

≤ −αk∇F (wk)Tg(wk , ξk) +1

2α2kL‖g(wk , ξk)‖22.

Taking expectations w.r.t the distribution of ξk , we obtain the desiredresults.

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 13 / 50

Page 14: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Intuitions of Lemma 4.2

RHS of (2) depends on the first and second moment of g(wk , ξk).

If g(wk , ξk) is an unbiased estimate of ∇F (wk), then

Eξk [F (wk+1)]− F (wk) ≤ −αk‖∇F (wk)‖22 +1

2α2kLEξk [‖g(wk , ξk)‖22].

Convergence is guarantee if RHS of (2) is be bounded.

We need to restricted the variance of g(wk , ξk), i.e.,

Vξk [g(wk , ξk)] := Eξk [‖g(wk , ξk)‖22]− ‖Eξk [g(wk , ξk)]‖22. (3)

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 14 / 50

Page 15: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

First and Second Moment limits

Assumption 4.3 (First and Second Moment limits)

The objective function and SG satisfy the following

1 The sequence of iterates {wk} is contained in an open set over whichF is bounded below by Finf .

2 There exist scalars µG ≥ µ > 0 such that

∇F (wk)TEξk [g(wk , ξk)] ≥ µ‖∇F (wk)‖22, (4)

‖Eξk [g(wk , ξk)]‖ ≤ µG‖∇F (wk)‖2. (5)

3 There exist scalars M > 0 and MV > 0 such that

Vξk [g(wk , ξk)] ≤ M + Mv‖∇F (wk)‖22. (6)

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 15 / 50

Page 16: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

More on Assumption 4.3

The properties in Assumption 4.3-(2) immediately holds if g(wk , ξk)is an unbiased estimate of F (wk), and are maintained if suchunbiased estimate is multiplied by a PSD matrix Hk .

Combing the definition of variance (3) and Assumption 4.3,

Eξk [‖g(wk , ξk)‖22] ≤ M + MG‖∇F (wk)‖, (7)

where MG := MV + µ2G ≥ µ > 0.

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 16 / 50

Page 17: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Bounding Expected Function Decreases by Gradient

Lemma 4.4

Under Assumption 4.1 and 4.3, the iterate of SG satisfy the followinginequality

Eξk [F (wk+1)]− F (wk) ≤ −(µ− 1

2αkLMG )αk‖∇F (wk)‖22 +

1

2α2kLM.

Proof. By Lemma 1 and (4), it follows that

Eξk [F (wk+1)]− F (wk) ≤ −µαk‖∇F (wk)‖22 +1

2α2kLEξk [‖g(wk , ξk)‖22].

Together with (7), it yields the desired results.

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 17 / 50

Page 18: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Outline

1 Introduction

2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives

3 Noise Reduction MethodsDynamic SamplingGradient Aggregation

4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling

5 Conclusions

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 18 / 50

Page 19: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Strong Convexity

Assumption 4.5 (Strong Convexity)

The objective function F is strongly convex if there exist a constant c > 0such that

∇2F (w) ≥ cI ∀w ∈ Rd .

Strong convexity has several interesting consequences.

Better lower bound on F (w)

F (w) ≥ F (w) +∇F (w)T (w − w) +c

2‖w − w‖22.

Bound the optimality gap F (w)− F (w∗)

2c(F (w)− F (w∗)) ≤ ‖∇F (w)‖22 ∀w ∈ Rd (8)

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 19 / 50

Page 20: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Strong Convexity (Cont’d)

F (w) ≥ F (w) +∇F (w)T (w − w) +c

2‖w − w‖22

≥ F (w) +∇F (w)T (w − w) +c

2‖w − w‖22

= F (w)− 1

2c‖∇F (w)‖22,

where w = w − (1/c)‖ is the minimizer of the quadratic model. This holdfor any w ∈ Rd . Therefore,

F (w∗) ≥ F (w)− 1

2c‖∇F (w)‖22.

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 20 / 50

Page 21: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Strongly Convex Objective for Fixed Stepsizes SG

Theorem 4.6 (Strongly Convex Objective, Fixed Stepsizes

Under assumption 4.1, 4.3, 4.5, suppose that the SG method is run with afixed stepsize αk = α, satisfying

0 < α ≤ µ

LMG. (9)

Then, the expected optimality gap satisfies

E[F (wk)− F∗] ≤αLM

2cµ+ (1− αcµ)k−1

(F (w1)− F∗ −

αLM

2cµ

)(10)

k→∞−−−→ αLM

2cµ, (11)

where E[F (wk)] = Eξ1Eξ2 . . .Eξk−1[F (wk)].

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 21 / 50

Page 22: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Intuition of Theorem 4.6

If there’s no noise (M = 0), Theorem 4.6 becomes linear rate ofconvergence. Reduce to standard result of full gradient.

Need α to be small.

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 22 / 50

Page 23: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Strongly Convex Objective for Diminishing Stepsizes SG

Theorem 4.7 (Strongly Convex Objective, Diminishing Stepsizes

Under assumption 4.1, 4.3, 4.5, suppose that the SG method is run with astepsize sequence, such that

αk =β

γ + kfor some β >

1

cµand γ > 0 such that α1 ≤

µ

LMG. (12)

Then, the expected optimality gap satisfies

E[F (wk)− F∗] ≤ν

γ + k, (13)

where

ν := max{ β2LM

2(βcµ− 1), (γ + 1)(F (w1)− F∗)

}. (14)

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 23 / 50

Page 24: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Inuition of Theorem 4.7

For constant stepsize, α does not depend on c . For diminishingstepsizes, initial α1 depends on c .

For constant stepsize, initial point eventually does not matter. Fordiminishing stepsizes, initial point matters.

the effect of mini-batch is ”subtle”

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 24 / 50

Page 25: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Outline

1 Introduction

2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives

3 Noise Reduction MethodsDynamic SamplingGradient Aggregation

4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling

5 Conclusions

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 25 / 50

Page 26: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Nonconvex Objective for Fixed Stepsizes SG

Theorem 4.8 (Nonconvex Objective, Fixed Stepsize)

Under assumption 4.1, 4.3, suppose that the SG method is run with afixed stepsize αk = α, satisfying

0 < α ≤ µ

LMG. (15)

Then, the expected sum-of-squares and average-squared gradients of Fsatisfies

E[ K∑k=1

‖∇F (wk)‖22]≤ K αLM

µ+

2(F (w1)− Finf)

µα(16)

E[ 1

K

K∑k=1

‖∇F (wk)‖22]≤ αLM

µ+

2(F (w1)− Finf)

Kµα(17)

K→∞−−−−→ αLM

µ. (18)

where E[F (wk)] = Eξ1Eξ2 . . .Eξk−1[F (wk)].

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 26 / 50

Page 27: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Nonconvex Objective for Diminihsing Stepsizes SG

Theorem 4.9 (Nonconvex Objective, Diminishing Stepsizes)

Under Assumption 4.1 and 4.3, suppose SG is run with a stepsize sequencesatisfying

∑∞k=1 αk =∞ and

∑∞k=1 α

2k <∞, with AK =

∑Kk=1 αk ,

E[ K∑k=1

αk‖∇F (wk)‖22]<∞ (19)

E[ 1

Ak

K∑k=1

αk‖∇F (wk)‖22]

K→∞−−−−→ 0. (20)

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 27 / 50

Page 28: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Outline

1 Introduction

2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives

3 Noise Reduction MethodsDynamic SamplingGradient Aggregation

4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling

5 Conclusions

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 28 / 50

Page 29: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

An Ideal case: reducing noise at Geometric Rate

Recall that Lemma 4.2, we have

Eξk [F (wk+1)]− F (wk) ≤ −αk∇F (wk)TEξk[g(wk , ξk)

]+

L

2α2kEξk

[‖g(wk , ξk)‖22

].

If g() is a descent direction in expectation and we could decreaseEξk[‖g(wk , ξk)‖22

]fast enough, then good convergence rate is still

possible.

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 29 / 50

Page 30: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Linear Convergence Rate for Ideal Noise Reduction

Theorem 5.1 (Strongly Convex Objective, Noise Reduction)

Suppose that Assumption 4.1, 4.3 and 4.5 hold, and that

Vξk[g(wk , ξk)

]≤ Mζk−1. (21)

In addition, suppose SG is run with a fixed stepsize, αk = α such that

0 < α ≤ min{ µ

Lµ2G,

1

cµ}.

Then the expected optimality gap satisfies

E[F (wk)− F∗

]≤ ωρk−1

where

ω := max{ αLMcµ

,F (w1)− F∗} and ρ := max{1− αcµ/2, ζ}.

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 30 / 50

Page 31: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Dynamic Sampling

Consider the iteration

wk+1 ← wk −α

nk

∑i∈Sk

∇f (wk , ξk,i )

with nk := |Sk | =⌈τk−1

⌉for some τ > 1. Then this update fall under the

class of methods in Theorem 5.1, because

Vξk[g(wk , ξk)

]≤

Vξk [∇f (wk ; ξk,i )]

nk≤ M

nk,

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 31 / 50

Page 32: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Outline

1 Introduction

2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives

3 Noise Reduction MethodsDynamic SamplingGradient Aggregation

4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling

5 Conclusions

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 32 / 50

Page 33: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Intuition

Rather than increasing more new stochastic gradient, is it possible toachieve a lower variance by reusing and/or revising previous computedstochastic gradient information ?

Stochastic variance reduction gradient (SVRG)

Stochastic average gradient (SAG)

Iterative averaging

wk+1 ← wk − αkg(wk , ξk)

wk+1 ←1

k + 1

k+1∑j=1

wj .

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 33 / 50

Page 34: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Outline

1 Introduction

2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives

3 Noise Reduction MethodsDynamic SamplingGradient Aggregation

4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling

5 Conclusions

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 34 / 50

Page 35: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Preliminaries

To address adverse effects of high nonlinearity and ill-condition ofobjective function

At each iteration, solve a quadratic model

qk(s) = F (wk) +∇F (wk)T s +1

2sTHks

The Hessian matrix Hk ∈ Rd×d(i .e.,∇2F (wk)) could be very large,need to solve the following linear system

Hks = −∇F (wk) (22)

The updatewk+1 ← wk + αksk

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 35 / 50

Page 36: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Hessian Free Inexact Newton

When d is large (especially in deep learning), impossible to explicitlycompute H.

Apply Conjugate Gradient (CG) method to iteratively solve the linearsystem (22).

Only need to compute matrix-vector product Hs, as expensive ascomputing gradient

CG need at most d iterations to solve the linear system

This approach is still not possible for deep neural nets...

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 36 / 50

Page 37: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Hessian Free Inexact Newton (Cont’d)

Inexact Newton suggests it is tolerant to certain degree of noise in theHessian estimate than it is to the noise in the gradient estimate.

One could employ subsampled Hessian-free inexact Newton

∇2fSHk

(wk ; ξHk )s = −∇fSk (wk ; ξk)

where SHk ⊂ Sk .

gradient sample size |Sk | need to be large enough so that takingsubsamples for the Hessian estimate is sensible.

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 37 / 50

Page 38: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Outline

1 Introduction

2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives

3 Noise Reduction MethodsDynamic SamplingGradient Aggregation

4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling

5 Conclusions

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 38 / 50

Page 39: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Quasi-Newton (BFGS)

Approximate Hessian using only gradient information

BFGS is one of the most success of Quasi-Newton, by defining

sk := wk+1 − wk and vk := ∇F (wk+1)−∇F (wk),

the update rule of BFGS is

wk+1 ← wk − Hk∇F (wk)

Hk+1 ←(

1−vks

Tk

sTk vk

)THk

(1−

vksTk

sTk vk

)+

sksTK

sTk vk,

where Hk is a PSD approximation of (∇2F (wk))−1.

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 39 / 50

Page 40: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

BFGS (Cont’d)

This update ensures H−1k+1sk = vk holds, meaning that a second-orderTaylor expansion is satisfied along the most recent displacement.

enjoy a local superlinear rate of convergence.

Hk+1 are dense, even if the exact Hessians are sparse.

L-BFGS to rescue: don’t explicitly form Hk , but only computeHk∇F (wk) by a sequence of displacement pairs {(sk , vk)}.L-BFGS only achieves linear rate of convergence though...

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 40 / 50

Page 41: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

How about Stochastic L-BFGS ?

Iteration update: wk+1 ← wk − αkHkg(wk , ξk)

Stochastic L-BFGS could only have sublinear rate as SG. What’s thebenefit ? improving the constant, from dependent to Hessian toindependent of Hessian!

each iteration costs O(4md), if m = 5, the 20 times greater thanstochastic gradient estimation. This is not big problem in mini-batchSG though.

the effects of even a single bad update may linger for numerousiterations.

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 41 / 50

Page 42: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Online L-BFGS

Replace gradient with stochastic gradient

sk := wk+1 − wk and vk := ∇fSk (wk+1, ξk)−∇fSk (wk , ξk)

Note that the use of same seed in two gradient estimate.

The inverse Hessian approximation with ever step may not bewarranted when using small sample size Sk .

Alternatively, since ∇F (wk+1)−∇F (wk) ≈ ∇F (wk)2(wk+1 − wk),define

vk := ∇2fSKk

(wk ; ξHS )sk ,

where ∇2fSKk

(wk ; ξHS ) is a subsampled hessian.

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 42 / 50

Page 43: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Outline

1 Introduction

2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives

3 Noise Reduction MethodsDynamic SamplingGradient Aggregation

4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling

5 Conclusions

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 43 / 50

Page 44: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Inuition

Construct an approximation to the Hessian using only first-orderinformation

Always PSD, even when Hessian is not (useful for deep learning)

Ignores second-order interaction between elements of the parametervector w .

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 44 / 50

Page 45: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Classifical Gauss-Newton Method

Consider the following problem

f (w ; ξ) = `(h(xξ;w), yξ) =1

2‖h(xξ;w)− yξ‖22.

Gauss-Newton approximation is an affine approximation of the predictionfunction inside the quadratic loss function,

h(ξi ;w) ≈ h(xξ;wk) + Jh(wk ; ξ)(w − wk)

where J(·; ξ) represent the Jacobian of h(ξ; ·) w.r.t. w . This leads to

f (w ; ξ) ≈ 1

2‖h(xξ;wk) + Jh(wk ; ξ)(w − wk)− yξ‖22

=1

2‖h(xξ;w)− yξ‖22 + (h(xξ;wk)− yξ)

T Jh(wk ; ξ)(w − wk)

+1

2(w − wk)T Jh(wk ; ξ)T Jh(wk ; ξ)(w − wk)

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 45 / 50

Page 46: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Gauss-Newton

We could replace subsampled Hessian with Gauss-Newton matrix

GSHk

(wk ; ξHk ) =1

|SHk |∑i∈SH

k

Jh(wk ; ξk,i )T Jh(wl ; ξk,i ) (23)

For least square loss function, Gauss-Newton is always PSD, so itcould serve as another approximation of hessian in stochastic L-BFSGand Inexact Newton method.

It could also be generalized to other loss functions.

GSHk

(wk ; ξHk ) =1

|SHk |∑i∈SH

k

Jh(wk ; ξk,i )TH`(wk ; ξk,i )Jh(wl ; ξk,i ),

where H`(wk ; ξ) = ∂2

∂h2`(h(xξ;w), yξ).

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 46 / 50

Page 47: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Outline

1 Introduction

2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives

3 Noise Reduction MethodsDynamic SamplingGradient Aggregation

4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling

5 Conclusions

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 47 / 50

Page 48: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Diagonal Scaling Intuition

Iteration update:

wk+1 ← wk − αkDkg(wk , ξk)

D−1k ≈ diag(Hessian) or diag(Guass-Newton)

D−1k ≈ running average/sum of gradient component.

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 48 / 50

Page 49: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

An Example: RMSPROP and ADAGRAD

RMSPROP estimates the average magnitude of each element of thestochastic gradient vector g(wk , ξk) by maintaining the runningaverages

Rk = (1− λ)Rk−1 + λ(g(wk , ξk) ◦ g(wk , ξk)

)wk+1 = wk −

α√Rk + µ

◦ g(wk , ξk)

where ◦ represents element-wise product.

ADAGRAD just replaced the running average by a sum

Rk = Rk−1 +[g(wk , ξk) ◦ g(wk , ξk)

]

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 49 / 50

Page 50: Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Take Home Message

RMSPROP and ADAGRAD are considered the state-of-the-artstochastic optimization method in practical use of deep learningmodel.

Note that they are just approximating the diagonal value of Hessian.

There’s still big room for stochastic second order method to be playedaround with the deep learning model!

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 50 / 50