41
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye ”Master Class at Tel-Aviv”, Tel-Aviv University, November 2014 Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 1 / 34

Accelerating Stochastic Optimization - Tel Aviv Universitywolf/deeplearningmeeting/pdfs/FastStochastic.pdf · Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS

  • Upload
    danganh

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Accelerating Stochastic Optimization

Shai Shalev-Shwartz

School of CS and Engineering,The Hebrew University of Jerusalem

andMobileye

”Master Class at Tel-Aviv”,Tel-Aviv University, November 2014

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 1 / 34

Learning

Goal (informal): Learn an accurate mapping h : X → Y based onexamples ((x1, y1), . . . , (xn, yn)) ∈ (X × Y)n

Deep learning: Each mapping h : X → Y is parameterized by a weightvector w ∈ Rd, so our goal is to learn the vector w

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 2 / 34

Learning

Goal (informal): Learn an accurate mapping h : X → Y based onexamples ((x1, y1), . . . , (xn, yn)) ∈ (X × Y)n

Deep learning: Each mapping h : X → Y is parameterized by a weightvector w ∈ Rd, so our goal is to learn the vector w

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 2 / 34

Regularized Loss Minimization

A popular learning approach is Regularized Loss Minimization (RLM)with Euclidean regularization:

Sample S = ((x1, y1), . . . , (xn, yn)) ∼ Dn and approximately solvethe RLM problem

minw∈Rd

1

n

n∑i=1

φi(w) +λ

2‖w‖2

where φi(w) = `yi(hw(xi)) is the loss of predicting hw(xi) when thetrue target is yi

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 3 / 34

How to solve RLM for Deep Learning?

Stochastic Gradient Descent (SGD):

Advantages:Works well in practicePer iteration cost independent of n

Disadvantage: slow convergence

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

·107

10−2

10−1

100

# of backpropagation

obje

ctiv

e

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 4 / 34

How to solve RLM for Deep Learning?

Stochastic Gradient Descent (SGD):

Advantages:Works well in practicePer iteration cost independent of n

Disadvantage: slow convergence

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

·107

10−2

10−1

100

# of backpropagation

obje

ctiv

e

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 4 / 34

How to solve RLM for Deep Learning?

Stochastic Gradient Descent (SGD):

Advantages:Works well in practicePer iteration cost independent of n

Disadvantage: slow convergence

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

·107

10−2

10−1

100

# of backpropagation

obje

ctiv

e

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 4 / 34

How to improve SGD convergence rate?

1 Stochastic Dual Coordinate Ascent (SDCA):

Same per iteration cost as SGD... but converges exponentially faster

Designed for convex problems... but can be adapted to deep learning

2 SelfieBoost:

AdaBoost, with SGD as weak learner, converges exponentially fasterthan vanilla SGDBut yields an ensemble of networks — very expensive at prediction timeI’ll describe a new boosting algorithm that boost the performance ofthe same networkI’ll show faster convergence under some “SGD success” assumption

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 5 / 34

How to improve SGD convergence rate?

1 Stochastic Dual Coordinate Ascent (SDCA):

Same per iteration cost as SGD... but converges exponentially faster

Designed for convex problems... but can be adapted to deep learning

2 SelfieBoost:

AdaBoost, with SGD as weak learner, converges exponentially fasterthan vanilla SGDBut yields an ensemble of networks — very expensive at prediction timeI’ll describe a new boosting algorithm that boost the performance ofthe same networkI’ll show faster convergence under some “SGD success” assumption

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 5 / 34

SDCA vs. SGD

On CCAT dataset, shallow architecture

5 10 15 20 2510

−6

10−5

10−4

10−3

10−2

10−1

100

SDCA

SDCA−Perm

SGD

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 6 / 34

SelfieBoost vs. SGD

On MNIST dataset, depth 5 network

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

·107

10−4

10−3

10−2

10−1

100

# of backpropagation

erro

r

SGDSelfieBoost

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 7 / 34

Gradient Descent vs. Stochastic Gradient Descent

Define: PI(w) = 1|I|∑

i∈I φi(w) + λ2‖w‖

2

GD SGD

rule wt+1 = wt − η∇P (wt) wt+1 = wt − η∇PI(wt)for random I ⊂ [n]

per iteration cost O(n) O(1)

convergence rate log(1/ε) 1/ε

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 8 / 34

Hey, wait, but what about ...

Decaying learning rate

Nesterov’s momentum

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 9 / 34

SGD — More powerful oracle is crucial

Theorem

Any algorithm for solving RLM that only accesses the objective usingstochastic gradient oracle and has log(1/ε) rate must perform Ω(n2)iterations

Proof idea:

Consider two objectives (in both, λ = 1): for i ∈ ±1

Pi(w) =1

2n

(n− 1

2(w − i)2 +

n+ 1

2(w + i)2

)A stochastic gradient oracle returns w ± i w.p. 1

2 ±12n

Easy to see that w∗i = −i/n, Pi(0) = 1/2, Pi(w∗i ) = 1/2− 1/(2n2)

Therefore, solving to accuracy ε < 1/(2n2) amounts to determiningthe bias of the coin

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 10 / 34

SGD — More powerful oracle is crucial

Theorem

Any algorithm for solving RLM that only accesses the objective usingstochastic gradient oracle and has log(1/ε) rate must perform Ω(n2)iterations

Proof idea:

Consider two objectives (in both, λ = 1): for i ∈ ±1

Pi(w) =1

2n

(n− 1

2(w − i)2 +

n+ 1

2(w + i)2

)A stochastic gradient oracle returns w ± i w.p. 1

2 ±12n

Easy to see that w∗i = −i/n, Pi(0) = 1/2, Pi(w∗i ) = 1/2− 1/(2n2)

Therefore, solving to accuracy ε < 1/(2n2) amounts to determiningthe bias of the coin

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 10 / 34

Outline

1 SDCADescription and analysis for convex problemsSDCA for Deep LearningAccelerated SDCA

2 SelfieBoost

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 11 / 34

Stochastic Dual Coordinate Ascent

Primal problem:

minw∈Rd

P (w) :=

[1

n

n∑i=1

φi(w) +λ

2‖w‖2

]

(Fenchel) Dual problem:

maxα∈Rd,n

D(α) :=

1

n

n∑i=1

−φ∗i (−αi)−1

2λn2

∥∥∥∥∥n∑i=1

αi

∥∥∥∥∥2

(where αi is the i’th column of α)

DCA: At each iteration, optimize D(α) w.r.t. a single column of α,while the rest of the columns are kept in tact.

Stochastic Dual Coordinate Ascent (SDCA): Choose the updatedcolumn uniformly at random

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 12 / 34

Fenchel Conjugate

Two equivalent representations of a convex function

Point (w, f(w))

w

f(w)

Tangent (θ,−f?(θ))

slope=θ

−f?(θ)f?(θ) = max

w〈w, θ〉 − f(w)

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 13 / 34

SDCA Analysis

Theorem

Assume each φi is convex and smooth. Then, after

O

((n+

1

λ

)log

1

ε

)iterations of SDCA we have, with high probability, P (wt)− P (w∗) ≤ ε.

GD SGD SDCA

iteration cost nd d d

convergence rate 1λ log(1/ε) 1

λε

(n+ 1

λ

)log(1/ε)

runtime nd · 1λ d · 1λε d ·

(n+ 1

λ

)runtime for λ = 1

n n2d ndε nd

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 14 / 34

SDCA vs. SGD — experimental observations

On CCAT dataset, shallow architecture, λ = 10−6

5 10 15 20 2510

−6

10−5

10−4

10−3

10−2

10−1

100

SDCA

SDCA−Perm

SGD

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 15 / 34

SDCA vs. DCA — Randomization is crucial

0 2 4 6 8 10 12 14 16 1810

−6

10−5

10−4

10−3

10−2

10−1

100

SDCADCA−Cyclic

SDCA−PermBound

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 16 / 34

Outline

1 SDCADescription and analysis for convex problemsSDCA for Deep LearningAccelerated SDCA

2 SelfieBoost

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 17 / 34

Deep Networks are Non-Convex

A 2-dim slice of a network with hidden layers 10, 10, 10, 10, on MNIST,with the clamped ReLU activation function and logistic loss.

The slice is defined by finding a global minimum (using SGD) and creatingtwo random permutations of the first hidden layer.

−1−0.8−0.6−0.4−0.2

00.2

0.40.6

0.8

1

−1−0.8

−0.6−0.4

−0.20

0.20.4

0.60.8

1

1

2

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 18 / 34

But Deep Networks Seem Convex Near a Miminum

Now the slice is based on 2 random points at distance 1 around a globalminimum

−1 −0.8 −0.6 −0.4 −0.20 0.2 0.4 0.6 0.8 1

−1

−0.5

0

0.5

1

4

5

6

7

8

·10−2

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 19 / 34

SDCA for Deep Learning

For φi being non-convex, the Fenchel conjugate often becomesmeaningless

But, our analysis implies that an approximate dual update suffices,that is,

α(t)i = α

(t−1)i − ηλn

(∇φi(w(t−1)) + α

(t−1)i

)The relation between primal and dual vectors is by

w(t−1) =1

λn

n∑i=1

α(t−1)i

and therefore the corresponding primal update is:

w(t) = w(t−1) − η(∇φi(w(t−1)) + α

(t−1)i

)These updates can be implemented for deep learning as well and donot require to calculate the Fenchel conjugate

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 20 / 34

Intuition: Why SDCA is better than SGD

Recall that SDCA primal update rule is

w(t) = w(t−1) − η(∇φi(w(t−1)) + α

(t−1)i

)︸ ︷︷ ︸

v(t)

and that w(t−1) = 1λn

∑ni=1 α

(t−1)i .

Observe: v(t) is unbiased estimate of the gradient:

E[v(t)|w(t−1)] =1

n

n∑i=1

(∇φi(w(t−1)) + α

(t−1)i

)= ∇P (w(t−1))− λw(t−1) + λw(t−1)

= ∇P (w(t−1))

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 21 / 34

Intuition: Why SDCA is better than SGD

Recall that SDCA primal update rule is

w(t) = w(t−1) − η(∇φi(w(t−1)) + α

(t−1)i

)︸ ︷︷ ︸

v(t)

and that w(t−1) = 1λn

∑ni=1 α

(t−1)i .

Observe: v(t) is unbiased estimate of the gradient:

E[v(t)|w(t−1)] =1

n

n∑i=1

(∇φi(w(t−1)) + α

(t−1)i

)= ∇P (w(t−1))− λw(t−1) + λw(t−1)

= ∇P (w(t−1))

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 21 / 34

Intuition: Why SDCA is better than SGD

The update step of both SGD and SDCA is w(t) = w(t−1) − ηv(t)where

v(t) =

∇φi(w(t−1)) + λw(t−1) for SGD

∇φi(w(t−1)) + α(t−1)i for SDCA

In both cases E[v(t)|w(t−1)] = ∇P (w(t))

What about the variance?

For SGD, even if w(t−1) = w∗, the variance of v(t) is still constant

For SDCA, we’ll show that the variance of v(t) goes to zero asw(t−1) → w∗

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 22 / 34

SDCA as variance reduction technique (Johnson & Zhang)

For w∗ we have ∇P (w∗) = 0 which means

1

n

n∑i=1

∇φi(w∗) + λw∗ = 0

⇒ w∗ =1

λn

n∑i=1

(−∇φi(w∗)) =1

λn

n∑i=1

α∗i

Therefore, if α(t−1)i → α∗i and w(t−1) → w∗ then the update vector

satisfies

v(t) = ∇φi(w(t−1)) + α(t−1)i → ∇φi(w∗) + α∗i = 0

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 23 / 34

Issues with SDCA for Deep Learning

Needs to maintain the matrix α

Can significantly reduce storage if working with mini-batches

Another approach is the SVRG algorithm of Johnson and Zhang

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 24 / 34

Accelerated SDCA

Nesterov’s Accelerated (deterministic) Gradient Descent (AGD):combine 2 gradients to accelerate the convergence rate

AGD runtime: O(dn 1√

λ

)SDCA runtime: O

(d(n+ 1

λ

))Can we accelerate SDCA ?

Yes! The main idea is to iterate

Use SDCA to approximately minimize Pt(w) = P (w) + κ2 ‖w− y

(t−1)‖2Update y(t) = w(t) + β(w(t) − w(t−1))

Accelerated SDCA runtime: O(d(n+

√nλ

))

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 25 / 34

Experimental Demonstration

Smoothed hinge-loss with `1, `2 regularization, λ = 10−7:

astro-ph cov1 CCAT

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5 AccProxSDCAProxSDCA

FISTA

0 20 40 60 80 100

0.3

0.35

0.4

0.45

0.5 AccProxSDCAProxSDCA

FISTA

0 20 40 60 80 100

0.1

0.2

0.3

0.4

0.5 AccProxSDCAProxSDCA

FISTA

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 26 / 34

Outline

1 SDCADescription and analysis for convex problemsSDCA for Deep LearningAccelerated SDCA

2 SelfieBoost

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 27 / 34

SelfieBoost Motivation

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

·107

10−2

10−1

100

# of backpropagation

obje

ctiv

e

Why SGD is slow at the end?

High variance, even close to the optimum

Rare mistakes: Suppose all but 1% of the examples are correctlyclassified. SGD will now waste 99% of its time on examples that arealready correct by the model

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 28 / 34

SelfieBoost Motivation

For simplicity, consider a binary classification problem in the realizablecase

For a fixed ε0 (not too small), few SGD iterations find a solution withP (w)− P (w∗) ≤ ε0However, for a small ε, SGD requires many iterations

Smells like we need to use boosting ....

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 29 / 34

First idea: learn an ensemble using AdaBoost

Fix ε0 (say 0.05), and assume SGD can find a solution with error < ε0quite fast

Lets apply AdaBoost with the SGD learner as a weak learner:

At iteration t, we sub-sample a training set based on a distribution Dt

over [n]We feed the sub-sample to a SGD learner and gets a weak classifier htUpdate Dt+1 based on the predictions of htThe output of AdaBoost is an ensemble with prediction

∑Tt=1 αtht(x)

The celebrated Freund & Schapire theorem states that ifT = O(log(1/ε)) then the error of the ensemble classifier is at most ε

Observe that each boosting iteration involves calling SGD on arelatively small data, and updating the distribution on the entire bigdata. The latter step can be performed in parallel

Disadvantage of learning an ensemble: at prediction time, we need toapply many networks

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 30 / 34

Boosting the Same Network

Can we obtain “boosting-like” convergence, while learning a singlenetwork?

The SelfieBoost Algorithm:

Start with an initial network f1

At iteration t, define weights over the n examples according toDi ∝ e−yift(xi)

Sub-sample a training set S ∼ DUse SGD for approximately solving the problem

ft+1 ≈ argming

∑i∈S

yi(ft(xi)− g(xi)) +1

2

∑i∈S

(g(xi)− ft(xi))2

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 31 / 34

Analysis of the SelfieBoost Algorithm

Lemma: At each iteration, with high probability over the choice of S,there exists a network g with objective value of at most −1/4

Theorem: If at each iteration, the SGD algorithm finds a solutionwith objective value of at most −ρ, then after

log(1/ε)

ρ

SelfieBoost iterations the error of ft will be at most ε

To summarize: we have obtained log(1/ε) convergence assuming thatthe SGD algorithm can solve each sub-problem to a fixed accuracy(which seems to hold in practice)

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 32 / 34

SelfieBoost vs. SGD

On MNIST dataset, depth 5 network

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

·107

10−4

10−3

10−2

10−1

100

# of backpropagation

erro

r

SGDSelfieBoost

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 33 / 34

Summary

SGD converges quickly to an o.k. solution, but then slows down:1 High variance even at w∗

2 Wastes time on already solved cases

There’s a need for stochastic methods that have similar per-iterationcomplexity as SGD but converge faster

1 SDCA reduces variance and therefore converges faster2 SelfieBoost focuses on the hard cases

Future Work and Open Questions:

Evaluate the empirical performance of SDCA and SelfieBoost forchallenging deep learning tasks

Bridge the gap between empirical success of SGD and worst-casehardness results

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 34 / 34

Summary

SGD converges quickly to an o.k. solution, but then slows down:1 High variance even at w∗

2 Wastes time on already solved cases

There’s a need for stochastic methods that have similar per-iterationcomplexity as SGD but converge faster

1 SDCA reduces variance and therefore converges faster2 SelfieBoost focuses on the hard cases

Future Work and Open Questions:

Evaluate the empirical performance of SDCA and SelfieBoost forchallenging deep learning tasks

Bridge the gap between empirical success of SGD and worst-casehardness results

Shalev-Shwartz (HUJI&Mobileye) Fast Stochastic TAU’14 34 / 34