Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning

Stochastic Optimization for Large-scale MachineLearning

Wei-Cheng Chang

Language Technologies InstituteCarnegie Mellon University

September 13, 2016

Wei-Cheng Chang (LTI,CMU) Stochastic Optimization for Large-scale Machine LearningSeptember 13, 2016 1 / 50

Outline

1 Introduction

2 Analysis of Stochastic Gradient MethodsPreliminariesTwo Assumptions and Fundamental LemmasSG for Strongly Convex ObjectivesSG for General Objectives

3 Noise Reduction MethodsDynamic SamplingGradient Aggregation

4 Second-Order MethodsHesssian Free methodsStochastic Quasi-Newton MethodsGauss-NewtonDiagonal Scaling

5 Conclusions


Problem Settings and Notations

Expected Risk: Consider prediction function h parameterized byw ∈ Rd and some loss function `(·, ·),

R(w) =

∫`(h(x ;w), y)dP(x , y) = E[`(h(x ;w), y)].

Empirical Risk: In supervised learning, only have access to a set of nI.I.D. samples {(xi , yi )} ⊆ Rdx × Rdy

Rn(w) =1

n

n∑i=1

`(h(xi ;w), yi ).

We consider Rn as the objective function from now on. Note that allmethods in the subsequent slides can be applied readily when asmooth regularization term.


Problem Settings and Notations (Cont’d)

a sets of samples is presented by a random seed ξ. For example, arealization of ξ as a single sample (x , y) or a set of samples{(xi , yi )}i∈S .

refer the loss incurred for a given (w , ξ) as f (w ; ξ). That is, f is thecomposition of loss function ` and prediction function h.

Then we have,

R(w) = E[f (w ; ξ)],

Rn(w) =1

n

n∑i=1

fi (w),

where fi (w) ≡ f (w ; ξ[i ]) and {ξ[i ]}ni=1 is a set of realizations of ξcorresponding to a sample set {(xi , yi )}.


Why Stochastic Optimization in Large-scale MachineLearning?

Batch gradient methods could achieve linear rate of convergence

Rn(wk)− R∗n ≤ O(ρk).

To obtain ε-optimality, the total work is proportion to n log(1/ε).

SG could achieve sublinear rate of convergence in expectation

E[Rn(wk)− R∗n ] = O(1/k).

To obtain ε-optimality, the total work is proportion to O(1/ε). Whenn is large and limited computational time budget, SG is favorable.(More on later slides)

Due to stochastic nature of SG, it enjoys same rate of convergencefor the generalization error R − R∗.


Outline

1 Introduction




5 Conclusions


Rate of Convergence

Consider optimization problem

f ∗ = minx

f (x),

linear rate of convergence means, for some c ∈ (0, 1),

f (xk)− f ∗ ≤ c(f (xk−1)− f ∗) ≤ . . . ≤ ck(f (x0)− f ∗) ≤ ε

to achieve ε-optimality, the number of iterations need at most

k ≤ log((f (x0)− f ∗)/ε)

log(1/c)= O(log(

1

ε))


Stochastic Gradient (SG) Method

Objective function F : Rd → R represents

R(w) = E[f (w ; ξ)] or Rn(w) =1

n

n∑i=1

fi (w)

SG updates:wk+1 ← wk − αkg(wk , ξk).

Our analyses cover the choices

g(wk , ξk) =

∇f (wk ; ξk)1nk

∑nki=1∇f (wk ; ξk,i )

Hk1nk

∑nki=1∇f (wk ; ξk,i ),

where Hk is PSD, and nk is the mini-batch size.


Some Intuitions of SG

{ξk} is a sequence of independent random variables.

Sample uniformly from training set, SG optimizes the empirical riskF = Rn.

Picking samples according to P, SG optimizes the expected riskF = R.

Our goal is to show the expected angle between g(wk , ξk) and∇F (wk) is sufficiently close.


Outline

1 Introduction




5 Conclusions


Smoothness of objective function

Assumption 4.1 (Lipschitz-continuous objective gradients)

F is continuously differentiable and the gradient function of F , namely∇F , is Lipschitz-continuous with Lipschitz constant L > 0, i.e.,

‖∇F (w)−∇F (w)‖2 ≤ L‖w − w‖2 ∀{w , w} ⊂ Rd

This assumption ensures gradient of F does not change arbitrary quicklyw.r.t the parameter w . From this assumption, we could then derive

F (w) ≤ F (w) +∇F (w)T (w − w) +1

2L‖w − w‖22 (1)


Proof of equation (1)

Define g(t) = F (w + t(w − w)) and notice

g ′(t) = ∇F (w + t(w − w))T (w − w).

g(1) = g(0) +

∫ 1

0g ′(t)dt

= F (w) +

∫ 1

0∇F (w + t(w − w))T (w − w)dt

= F (w) +∇F (w)T (w − w) +

∫ 1

0[∇F (w + t(w − w))−∇F (w)]T (w − w)dt

≤ F (w) +∇F (w)T (w − w) +

∫ 1

0L‖t(w − w)‖2‖w − w‖2dt

= F (w) +∇F (w)T (w − w) +1

2L‖w − w‖22.


Bound expected function decrease by gradient information

Lemma 4.2

Under Assumption 1, the iterates of SG satisfy the following inequality

Eξk [F (wk+1)]− F (wk) ≤− αk∇F (wk)TEξk [g(wk , ξk)] (2)

+1

2α2kLEξk [‖g(wk , ξk)‖22].

Proof: By Assumption 4.1, the iterates by SG satisfy

F (wk+1)− F (wk) ≤ ∇F (wk)T (wk+1 − wk) +1

2L‖wk+1 − wk‖22

≤ −αk∇F (wk)Tg(wk , ξk) +1

2α2kL‖g(wk , ξk)‖22.

Taking expectations w.r.t the distribution of ξk , we obtain the desiredresults.


Intuitions of Lemma 4.2

RHS of (2) depends on the first and second moment of g(wk , ξk).

If g(wk , ξk) is an unbiased estimate of ∇F (wk), then

Eξk [F (wk+1)]− F (wk) ≤ −αk‖∇F (wk)‖22 +1


Convergence is guarantee if RHS of (2) is be bounded.

We need to restricted the variance of g(wk , ξk), i.e.,

Vξk [g(wk , ξk)] := Eξk [‖g(wk , ξk)‖22]− ‖Eξk [g(wk , ξk)]‖22. (3)


First and Second Moment limits

Assumption 4.3 (First and Second Moment limits)

The objective function and SG satisfy the following

1 The sequence of iterates {wk} is contained in an open set over whichF is bounded below by Finf .

2 There exist scalars µG ≥ µ > 0 such that

∇F (wk)TEξk [g(wk , ξk)] ≥ µ‖∇F (wk)‖22, (4)

‖Eξk [g(wk , ξk)]‖ ≤ µG‖∇F (wk)‖2. (5)

3 There exist scalars M > 0 and MV > 0 such that

Vξk [g(wk , ξk)] ≤ M + Mv‖∇F (wk)‖22. (6)


More on Assumption 4.3

The properties in Assumption 4.3-(2) immediately holds if g(wk , ξk)is an unbiased estimate of F (wk), and are maintained if suchunbiased estimate is multiplied by a PSD matrix Hk .

Combing the definition of variance (3) and Assumption 4.3,

Eξk [‖g(wk , ξk)‖22] ≤ M + MG‖∇F (wk)‖, (7)

where MG := MV + µ2G ≥ µ > 0.


Bounding Expected Function Decreases by Gradient

Lemma 4.4

Under Assumption 4.1 and 4.3, the iterate of SG satisfy the followinginequality

Eξk [F (wk+1)]− F (wk) ≤ −(µ− 1

2αkLMG )αk‖∇F (wk)‖22 +

1

2α2kLM.

Proof. By Lemma 1 and (4), it follows that

Eξk [F (wk+1)]− F (wk) ≤ −µαk‖∇F (wk)‖22 +1


Together with (7), it yields the desired results.


Outline

1 Introduction




5 Conclusions


Strong Convexity

Assumption 4.5 (Strong Convexity)

The objective function F is strongly convex if there exist a constant c > 0such that

∇2F (w) ≥ cI ∀w ∈ Rd .

Strong convexity has several interesting consequences.

Better lower bound on F (w)

F (w) ≥ F (w) +∇F (w)T (w − w) +c

2‖w − w‖22.

Bound the optimality gap F (w)− F (w∗)

2c(F (w)− F (w∗)) ≤ ‖∇F (w)‖22 ∀w ∈ Rd (8)


Strong Convexity (Cont’d)

F (w) ≥ F (w) +∇F (w)T (w − w) +c

2‖w − w‖22

≥ F (w) +∇F (w)T (w − w) +c

2‖w − w‖22

= F (w)− 1

2c‖∇F (w)‖22,

where w = w − (1/c)‖ is the minimizer of the quadratic model. This holdfor any w ∈ Rd . Therefore,

F (w∗) ≥ F (w)− 1

2c‖∇F (w)‖22.


Strongly Convex Objective for Fixed Stepsizes SG

Theorem 4.6 (Strongly Convex Objective, Fixed Stepsizes

Under assumption 4.1, 4.3, 4.5, suppose that the SG method is run with afixed stepsize αk = α, satisfying

0 < α ≤ µ

LMG. (9)

Then, the expected optimality gap satisfies

E[F (wk)− F∗] ≤αLM

2cµ+ (1− αcµ)k−1

(F (w1)− F∗ −

αLM

2cµ

)(10)

k→∞−−−→ αLM

2cµ, (11)

where E[F (wk)] = Eξ1Eξ2 . . .Eξk−1[F (wk)].


Intuition of Theorem 4.6

If there’s no noise (M = 0), Theorem 4.6 becomes linear rate ofconvergence. Reduce to standard result of full gradient.

Need α to be small.


Strongly Convex Objective for Diminishing Stepsizes SG

Theorem 4.7 (Strongly Convex Objective, Diminishing Stepsizes

Under assumption 4.1, 4.3, 4.5, suppose that the SG method is run with astepsize sequence, such that

αk =β

γ + kfor some β >

1

cµand γ > 0 such that α1 ≤

µ

LMG. (12)

Then, the expected optimality gap satisfies

E[F (wk)− F∗] ≤ν

γ + k, (13)

where

ν := max{ β2LM

2(βcµ− 1), (γ + 1)(F (w1)− F∗)

}. (14)


Inuition of Theorem 4.7

For constant stepsize, α does not depend on c . For diminishingstepsizes, initial α1 depends on c .

For constant stepsize, initial point eventually does not matter. Fordiminishing stepsizes, initial point matters.

the effect of mini-batch is ”subtle”


Outline

1 Introduction




5 Conclusions


Nonconvex Objective for Fixed Stepsizes SG

Theorem 4.8 (Nonconvex Objective, Fixed Stepsize)

Under assumption 4.1, 4.3, suppose that the SG method is run with afixed stepsize αk = α, satisfying

0 < α ≤ µ

LMG. (15)

Then, the expected sum-of-squares and average-squared gradients of Fsatisfies

E[ K∑k=1

‖∇F (wk)‖22]≤ K αLM

µ+

2(F (w1)− Finf)

µα(16)

E[ 1

K

K∑k=1

‖∇F (wk)‖22]≤ αLM

µ+

2(F (w1)− Finf)

Kµα(17)

K→∞−−−−→ αLM

µ. (18)

where E[F (wk)] = Eξ1Eξ2 . . .Eξk−1[F (wk)].


Nonconvex Objective for Diminihsing Stepsizes SG

Theorem 4.9 (Nonconvex Objective, Diminishing Stepsizes)

Under Assumption 4.1 and 4.3, suppose SG is run with a stepsize sequencesatisfying

∑∞k=1 αk =∞ and

∑∞k=1 α

2k <∞, with AK =

∑Kk=1 αk ,

E[ K∑k=1

αk‖∇F (wk)‖22]<∞ (19)

E[ 1

Ak

K∑k=1

αk‖∇F (wk)‖22]

K→∞−−−−→ 0. (20)


Outline

1 Introduction




5 Conclusions


An Ideal case: reducing noise at Geometric Rate

Recall that Lemma 4.2, we have

Eξk [F (wk+1)]− F (wk) ≤ −αk∇F (wk)TEξk[g(wk , ξk)

]+

L

2α2kEξk

[‖g(wk , ξk)‖22

].

If g() is a descent direction in expectation and we could decreaseEξk[‖g(wk , ξk)‖22

]fast enough, then good convergence rate is still

possible.


Linear Convergence Rate for Ideal Noise Reduction

Theorem 5.1 (Strongly Convex Objective, Noise Reduction)

Suppose that Assumption 4.1, 4.3 and 4.5 hold, and that

Vξk[g(wk , ξk)

]≤ Mζk−1. (21)

In addition, suppose SG is run with a fixed stepsize, αk = α such that

0 < α ≤ min{ µ

Lµ2G,

1

cµ}.

Then the expected optimality gap satisfies

E[F (wk)− F∗

]≤ ωρk−1

where

ω := max{ αLMcµ

,F (w1)− F∗} and ρ := max{1− αcµ/2, ζ}.


Dynamic Sampling

Consider the iteration

wk+1 ← wk −α

nk

∑i∈Sk

∇f (wk , ξk,i )

with nk := |Sk | =⌈τk−1

⌉for some τ > 1. Then this update fall under the

class of methods in Theorem 5.1, because

Vξk[g(wk , ξk)

]≤

Vξk [∇f (wk ; ξk,i )]

nk≤ M

nk,


Outline

1 Introduction




5 Conclusions


Intuition

Rather than increasing more new stochastic gradient, is it possible toachieve a lower variance by reusing and/or revising previous computedstochastic gradient information ?

Stochastic variance reduction gradient (SVRG)

Stochastic average gradient (SAG)

Iterative averaging

wk+1 ← wk − αkg(wk , ξk)

wk+1 ←1

k + 1

k+1∑j=1

wj .


Outline

1 Introduction




5 Conclusions


Preliminaries

To address adverse effects of high nonlinearity and ill-condition ofobjective function

At each iteration, solve a quadratic model

qk(s) = F (wk) +∇F (wk)T s +1

2sTHks

The Hessian matrix Hk ∈ Rd×d(i .e.,∇2F (wk)) could be very large,need to solve the following linear system

Hks = −∇F (wk) (22)

The updatewk+1 ← wk + αksk


Hessian Free Inexact Newton

When d is large (especially in deep learning), impossible to explicitlycompute H.

Apply Conjugate Gradient (CG) method to iteratively solve the linearsystem (22).

Only need to compute matrix-vector product Hs, as expensive ascomputing gradient

CG need at most d iterations to solve the linear system

This approach is still not possible for deep neural nets...


Hessian Free Inexact Newton (Cont’d)

Inexact Newton suggests it is tolerant to certain degree of noise in theHessian estimate than it is to the noise in the gradient estimate.

One could employ subsampled Hessian-free inexact Newton

∇2fSHk

(wk ; ξHk )s = −∇fSk (wk ; ξk)

where SHk ⊂ Sk .

gradient sample size |Sk | need to be large enough so that takingsubsamples for the Hessian estimate is sensible.


Outline

1 Introduction




5 Conclusions


Quasi-Newton (BFGS)

Approximate Hessian using only gradient information

BFGS is one of the most success of Quasi-Newton, by defining

sk := wk+1 − wk and vk := ∇F (wk+1)−∇F (wk),

the update rule of BFGS is

wk+1 ← wk − Hk∇F (wk)

Hk+1 ←(

1−vks

Tk

sTk vk

)THk

(1−

vksTk

sTk vk

)+

sksTK

sTk vk,

where Hk is a PSD approximation of (∇2F (wk))−1.


BFGS (Cont’d)

This update ensures H−1k+1sk = vk holds, meaning that a second-orderTaylor expansion is satisfied along the most recent displacement.

enjoy a local superlinear rate of convergence.

Hk+1 are dense, even if the exact Hessians are sparse.

L-BFGS to rescue: don’t explicitly form Hk , but only computeHk∇F (wk) by a sequence of displacement pairs {(sk , vk)}.L-BFGS only achieves linear rate of convergence though...


How about Stochastic L-BFGS ?

Iteration update: wk+1 ← wk − αkHkg(wk , ξk)

Stochastic L-BFGS could only have sublinear rate as SG. What’s thebenefit ? improving the constant, from dependent to Hessian toindependent of Hessian!

each iteration costs O(4md), if m = 5, the 20 times greater thanstochastic gradient estimation. This is not big problem in mini-batchSG though.

the effects of even a single bad update may linger for numerousiterations.


Online L-BFGS

Replace gradient with stochastic gradient

sk := wk+1 − wk and vk := ∇fSk (wk+1, ξk)−∇fSk (wk , ξk)

Note that the use of same seed in two gradient estimate.

The inverse Hessian approximation with ever step may not bewarranted when using small sample size Sk .

Alternatively, since ∇F (wk+1)−∇F (wk) ≈ ∇F (wk)2(wk+1 − wk),define

vk := ∇2fSKk

(wk ; ξHS )sk ,

where ∇2fSKk

(wk ; ξHS ) is a subsampled hessian.


Outline

1 Introduction




5 Conclusions


Inuition

Construct an approximation to the Hessian using only first-orderinformation

Always PSD, even when Hessian is not (useful for deep learning)

Ignores second-order interaction between elements of the parametervector w .


Classifical Gauss-Newton Method

Consider the following problem

f (w ; ξ) = `(h(xξ;w), yξ) =1

2‖h(xξ;w)− yξ‖22.

Gauss-Newton approximation is an affine approximation of the predictionfunction inside the quadratic loss function,

h(ξi ;w) ≈ h(xξ;wk) + Jh(wk ; ξ)(w − wk)

where J(·; ξ) represent the Jacobian of h(ξ; ·) w.r.t. w . This leads to

f (w ; ξ) ≈ 1

2‖h(xξ;wk) + Jh(wk ; ξ)(w − wk)− yξ‖22

=1

2‖h(xξ;w)− yξ‖22 + (h(xξ;wk)− yξ)

T Jh(wk ; ξ)(w − wk)

+1

2(w − wk)T Jh(wk ; ξ)T Jh(wk ; ξ)(w − wk)


Gauss-Newton

We could replace subsampled Hessian with Gauss-Newton matrix

GSHk

(wk ; ξHk ) =1

|SHk |∑i∈SH

k

Jh(wk ; ξk,i )T Jh(wl ; ξk,i ) (23)

For least square loss function, Gauss-Newton is always PSD, so itcould serve as another approximation of hessian in stochastic L-BFSGand Inexact Newton method.

It could also be generalized to other loss functions.

GSHk

(wk ; ξHk ) =1

|SHk |∑i∈SH

k

Jh(wk ; ξk,i )TH`(wk ; ξk,i )Jh(wl ; ξk,i ),

where H`(wk ; ξ) = ∂2

∂h2`(h(xξ;w), yξ).


Outline

1 Introduction




5 Conclusions


Diagonal Scaling Intuition

Iteration update:

wk+1 ← wk − αkDkg(wk , ξk)

D−1k ≈ diag(Hessian) or diag(Guass-Newton)

D−1k ≈ running average/sum of gradient component.


An Example: RMSPROP and ADAGRAD

RMSPROP estimates the average magnitude of each element of thestochastic gradient vector g(wk , ξk) by maintaining the runningaverages

Rk = (1− λ)Rk−1 + λ(g(wk , ξk) ◦ g(wk , ξk)

)wk+1 = wk −

α√Rk + µ

◦ g(wk , ξk)

where ◦ represents element-wise product.

ADAGRAD just replaced the running average by a sum

Rk = Rk−1 +[g(wk , ξk) ◦ g(wk , ξk)

]


Take Home Message

RMSPROP and ADAGRAD are considered the state-of-the-artstochastic optimization method in practical use of deep learningmodel.

Note that they are just approximating the diagonal value of Hessian.

There’s still big room for stochastic second order method to be playedaround with the deep learning model!


Documents

Stochastic Optimization for Large-scale Machine Learningnyc.lti.cs.cmu.edu/classes/11-745/F16/Slides/Peter... · 2016. 9. 21. · Stochastic Optimization for Large-scale Machine Learning