COCOA: Communication-Efficient Coordinate Ascent

Preview:

DESCRIPTION

"COCOA: Communication-Efficient Coordinate Ascent" presentation from AMPCamp 5 by Virginia Smith

Citation preview

COCOA Communication-Efficient

Coordinate Ascent

Virginia Smith ! Martin Jaggi, Martin Takáč, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, & Michael I. Jordan

LARGE-SCALE OPTIMIZATION

LARGE-SCALE OPTIMIZATION

COCOA

LARGE-SCALE OPTIMIZATION

COCOA

RESULTS

LARGE-SCALE OPTIMIZATION!

COCOA

RESULTS

Machine Learning with Large Datasets

Machine Learning with Large Datasets

Machine Learning with Large Datasets

image/music/video tagging document categorization

item recommendation click-through rate prediction

sequence tagging protein structure prediction

sensor data prediction spam classification

fraud detection

Machine Learning Workflow

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …

Example: SVM Classification

Example: SVM Classification

Example: SVM Classification

Example: SVM Classification

Example: SVM Classification

Example: SVM Classification

Example: SVM Classification

wx-b=1

wx-b=-1

wx-b=0

w

2 / ||w||

Example: SVM Classification

wx-b=1

wx-b=-1

wx-b=0

w

2 / ||w||minw2Rd

2||w||2 + 1

n

nX

i=1

`hinge(yiwTxi)

Example: SVM Classification

wx-b=1

wx-b=-1

wx-b=0

w

2 / ||w||minw2Rd

2||w||2 + 1

n

nX

i=1

`hinge(yiwTxi)

Descent algorithms and line search methods Acceleration, momentum, and conjugate gradients

Newton and Quasi-Newton methods Coordinate descent

Stochastic and incremental gradient methods SMO

SVMlight LIBLINEAR

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …

SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, …

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …

SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, …

Open Problem:efficiently solving objective

when data is distributed

Distributed Optimization

Distributed Optimization

Distributed Optimization

reduce: w = w � ↵P

k �w

Distributed Optimization

reduce: w = w � ↵P

k �w

Distributed Optimization

“always communicate”reduce: w = w � ↵

Pk �w

Distributed Optimization✔ convergence guarantees

“always communicate”reduce: w = w � ↵

Pk �w

Distributed Optimization✔ convergence guarantees✗ high communication

“always communicate”reduce: w = w � ↵

Pk �w

Distributed Optimization✔ convergence guarantees✗ high communication

“always communicate”reduce: w = w � ↵

Pk �w

Distributed Optimization✔ convergence guarantees✗ high communication

average: w := 1K

Pk wk

“always communicate”reduce: w = w � ↵

Pk �w

Distributed Optimization✔ convergence guarantees✗ high communication

average: w := 1K

Pk wk

“always communicate”

“never communicate”

reduce: w = w � ↵P

k �w

Distributed Optimization✔ convergence guarantees✗ high communication

✔ low communication

average: w := 1K

Pk wk

“always communicate”

“never communicate”

reduce: w = w � ↵P

k �w

Distributed Optimization✔ convergence guarantees✗ high communication

✔ low communication

average: w := 1K

Pk wk

“always communicate”

“never communicate”

ZDWJ, 2012

reduce: w = w � ↵P

k �w

Distributed Optimization✔ convergence guarantees✗ high communication

✔ low communication✗ convergence not guaranteed

average: w := 1K

Pk wk

“always communicate”

“never communicate”

ZDWJ, 2012

reduce: w = w � ↵P

k �w

Mini-batch

Mini-batch

Mini-batch

reduce: w = w � ↵|b|

Pi2b �w

Mini-batch

reduce: w = w � ↵|b|

Pi2b �w

Mini-batch✔ convergence guarantees

reduce: w = w � ↵|b|

Pi2b �w

Mini-batch✔ convergence guarantees✔ tunable communication

reduce: w = w � ↵|b|

Pi2b �w

Mini-batch✔ convergence guarantees✔ tunable communication

a natural middle-ground

reduce: w = w � ↵|b|

Pi2b �w

Mini-batch✔ convergence guarantees✔ tunable communication

a natural middle-ground

reduce: w = w � ↵|b|

Pi2b �w

Are we done?

Mini-batch Limitations

Mini-batch Limitations

1. METHODS BEYOND SGD

Mini-batch Limitations

1. METHODS BEYOND SGD

2. STALE UPDATES

Mini-batch Limitations

1. METHODS BEYOND SGD

2. STALE UPDATES

3. AVERAGE OVER BATCH SIZE

LARGE-SCALE OPTIMIZATION

COCOA

RESULTS

LARGE-SCALE OPTIMIZATION

COCOA!

RESULTS

Mini-batch Limitations

1. METHODS BEYOND SGD

2. STALE UPDATES

3. AVERAGE OVER BATCH SIZE

Mini-batch Limitations

1. METHODS BEYOND SGD Use Primal-Dual Framework

2. STALE UPDATES Immediately apply local updates

3. AVERAGE OVER BATCH SIZE Average over K << batch size

1. METHODS BEYOND SGD Use Primal-Dual Framework

2. STALE UPDATES Immediately apply local updates

3. AVERAGE OVER BATCH SIZE Average over K << batch size

Communication-Efficient Distributed Dual

Coordinate Ascent (CoCoA)

1. Primal-Dual Framework

PRIMAL DUAL≥

1. Primal-Dual Framework

PRIMAL DUAL≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

1. Primal-Dual Framework

PRIMAL DUAL≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

max

↵2Rn

"D(↵) := �||A↵||2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Ai =1

�n

xi

1. Primal-Dual Framework

PRIMAL DUAL

Stopping criteria given by duality gap

≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

max

↵2Rn

"D(↵) := �||A↵||2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Ai =1

�n

xi

1. Primal-Dual Framework

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practice

≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

max

↵2Rn

"D(↵) := �||A↵||2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Ai =1

�n

xi

1. Primal-Dual Framework

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. liblinear

≥minw2Rd

"P (w) :=

2||w||2 + 1

n

nX

i=1

`i(wTxi)

#

max

↵2Rn

"D(↵) := �||A↵||2 � 1

n

nX

i=1

`⇤i (�↵i)

#

Ai =1

�n

xi

2. Immediately Apply Updates

2. Immediately Apply Updates

for i 2 b�w �w � ↵riP (w)

end

w w +�w

STALE

2. Immediately Apply Updates

for i 2 b�w �w � ↵riP (w)

end

w w +�w

STALE

for i 2 b�w �w � ↵riP (w)w w +�w

endOutput: �↵[k] and �w := A[k]�↵[k]

FRESH

3. Average over K

3. Average over K

reduce: w = w + 1K

Pk �wk

CoCoAAlgorithm 1: CoCoA

Input: T � 1, scaling parameter 1 �K K (default: �K := 1).

Data: {(xi, yi)}ni=1 distributed over K machines

Initialize: ↵

(0)[k] 0 for all machines k, and w

(0) 0

for t = 1, 2, . . . , Tfor all machines k = 1, 2, . . . ,K in parallel

(�↵[k],�wk) LocalDualMethod(↵

(t�1)[k] ,w(t�1)

)

(t)[k] ↵

(t�1)[k] +

�K

K �↵[k]

end

reduce w

(t) w

(t�1)+

�K

K

PKk=1 �wk

end

Procedure A: LocalDualMethod: Dual algorithm on machine k

Input: Local ↵[k] 2 Rnk, and w 2 Rd

consistent with other coordinate

blocks of ↵ s.t. w = A↵

Data: Local {(xi, yi)}nki=1

Output: �↵[k] and �w := A[k]�↵[k]

CoCoAAlgorithm 1: CoCoA

Input: T � 1, scaling parameter 1 �K K (default: �K := 1).

Data: {(xi, yi)}ni=1 distributed over K machines

Initialize: ↵

(0)[k] 0 for all machines k, and w

(0) 0

for t = 1, 2, . . . , Tfor all machines k = 1, 2, . . . ,K in parallel

(�↵[k],�wk) LocalDualMethod(↵

(t�1)[k] ,w(t�1)

)

(t)[k] ↵

(t�1)[k] +

�K

K �↵[k]

end

reduce w

(t) w

(t�1)+

�K

K

PKk=1 �wk

end

Procedure A: LocalDualMethod: Dual algorithm on machine k

Input: Local ↵[k] 2 Rnk, and w 2 Rd

consistent with other coordinate

blocks of ↵ s.t. w = A↵

Data: Local {(xi, yi)}nki=1

Output: �↵[k] and �w := A[k]�↵[k]

✔ <10 lines of code in Spark

CoCoAAlgorithm 1: CoCoA

Input: T � 1, scaling parameter 1 �K K (default: �K := 1).

Data: {(xi, yi)}ni=1 distributed over K machines

Initialize: ↵

(0)[k] 0 for all machines k, and w

(0) 0

for t = 1, 2, . . . , Tfor all machines k = 1, 2, . . . ,K in parallel

(�↵[k],�wk) LocalDualMethod(↵

(t�1)[k] ,w(t�1)

)

(t)[k] ↵

(t�1)[k] +

�K

K �↵[k]

end

reduce w

(t) w

(t�1)+

�K

K

PKk=1 �wk

end

Procedure A: LocalDualMethod: Dual algorithm on machine k

Input: Local ↵[k] 2 Rnk, and w 2 Rd

consistent with other coordinate

blocks of ↵ s.t. w = A↵

Data: Local {(xi, yi)}nki=1

Output: �↵[k] and �w := A[k]�↵[k]

✔ <10 lines of code in Spark

✔ primal-dual framework allows for any internal optimization method

CoCoAAlgorithm 1: CoCoA

Input: T � 1, scaling parameter 1 �K K (default: �K := 1).

Data: {(xi, yi)}ni=1 distributed over K machines

Initialize: ↵

(0)[k] 0 for all machines k, and w

(0) 0

for t = 1, 2, . . . , Tfor all machines k = 1, 2, . . . ,K in parallel

(�↵[k],�wk) LocalDualMethod(↵

(t�1)[k] ,w(t�1)

)

(t)[k] ↵

(t�1)[k] +

�K

K �↵[k]

end

reduce w

(t) w

(t�1)+

�K

K

PKk=1 �wk

end

Procedure A: LocalDualMethod: Dual algorithm on machine k

Input: Local ↵[k] 2 Rnk, and w 2 Rd

consistent with other coordinate

blocks of ↵ s.t. w = A↵

Data: Local {(xi, yi)}nki=1

Output: �↵[k] and �w := A[k]�↵[k]

✔ <10 lines of code in Spark

✔ primal-dual framework allows for any internal optimization method

✔ local updates applied immediately

CoCoAAlgorithm 1: CoCoA

Input: T � 1, scaling parameter 1 �K K (default: �K := 1).

Data: {(xi, yi)}ni=1 distributed over K machines

Initialize: ↵

(0)[k] 0 for all machines k, and w

(0) 0

for t = 1, 2, . . . , Tfor all machines k = 1, 2, . . . ,K in parallel

(�↵[k],�wk) LocalDualMethod(↵

(t�1)[k] ,w(t�1)

)

(t)[k] ↵

(t�1)[k] +

�K

K �↵[k]

end

reduce w

(t) w

(t�1)+

�K

K

PKk=1 �wk

end

Procedure A: LocalDualMethod: Dual algorithm on machine k

Input: Local ↵[k] 2 Rnk, and w 2 Rd

consistent with other coordinate

blocks of ↵ s.t. w = A↵

Data: Local {(xi, yi)}nki=1

Output: �↵[k] and �w := A[k]�↵[k]

✔ <10 lines of code in Spark

✔ primal-dual framework allows for any internal optimization method

✔ local updates applied immediately

✔ average over K

Convergence

E[D(↵⇤)�D(↵(T ))] ✓1� (1�⇥)

1

K

�n�

� + �n�

◆T ⇣D(↵⇤)�D(↵(0))

Assumptions: are -smooth LocalDualMethod makes improvement per step

e.g. for SDCA ⇥

⇥ =

✓1� �n�

1 + �n�

1

n

◆H

`i 1/�

applies also to duality gap !

measure of difficulty of data partition 0 � n/K

Convergence

E[D(↵⇤)�D(↵(T ))] ✓1� (1�⇥)

1

K

�n�

� + �n�

◆T ⇣D(↵⇤)�D(↵(0))

Assumptions: are -smooth LocalDualMethod makes improvement per step

e.g. for SDCA ⇥

⇥ =

✓1� �n�

1 + �n�

1

n

◆H

`i 1/�

applies also to duality gap !

measure of difficulty of data partition 0 � n/K

✔ it converges!

Convergence

E[D(↵⇤)�D(↵(T ))] ✓1� (1�⇥)

1

K

�n�

� + �n�

◆T ⇣D(↵⇤)�D(↵(0))

Assumptions: are -smooth LocalDualMethod makes improvement per step

e.g. for SDCA ⇥

⇥ =

✓1� �n�

1 + �n�

1

n

◆H

`i 1/�

applies also to duality gap !

measure of difficulty of data partition 0 � n/K

✔ it converges!

✔ convergence rate matches current state-of-the-art mini-

batch methods

Convergence

E[D(↵⇤)�D(↵(T ))] ✓1� (1�⇥)

1

K

�n�

� + �n�

◆T ⇣D(↵⇤)�D(↵(0))

Assumptions: are -smooth LocalDualMethod makes improvement per step

e.g. for SDCA ⇥

⇥ =

✓1� �n�

1 + �n�

1

n

◆H

`i 1/�

applies also to duality gap !

measure of difficulty of data partition 0 � n/K

✔ it converges!

✔ convergence rate matches current state-of-the-art mini-

batch methods

LARGE-SCALE OPTIMIZATION

COCOA

RESULTS!

LARGE-SCALE OPTIMIZATION

COCOA

RESULTS!

Empirical Results in

Dataset Training (n) Features (d) Sparsity Workers (K)

Cov 522,911 54 22.22% 4

Rcv1 677,399 47,236 0.16% 8

Imagenet 32,751 160,000 100% 32

0 200 400 600 80010

−6

10−4

10−2

100

102

Imagenet

Time (s)

Log P

rim

al S

uboptim

alit

y

0 200 400 600 80010

−6

10−4

10−2

100

102

Imagenet

COCOA (H=1e3)mini−batch−CD (H=1)local−SGD (H=1e3)mini−batch−SGD (H=10)

0 200 400 600 80010

−6

10−4

10−2

100

102

Imagenet

Time (s)

Lo

g P

rim

al S

ub

op

tima

lity

0 200 400 600 80010

−6

10−4

10−2

100

102

Imagenet

COCOA (H=1e3)mini−batch−CD (H=1)local−SGD (H=1e3)mini−batch−SGD (H=10)

0 20 40 60 80 10010

−6

10−4

10−2

100

102

Cov

Time (s)

Lo

g P

rim

al S

ub

op

tima

lity

0 20 40 60 80 10010

−6

10−4

10−2

100

102

Cov

COCOA (H=1e5)minibatch−CD (H=100)local−SGD (H=1e5)batch−SGD (H=1)

0 100 200 300 40010

−6

10−4

10−2

100

102

RCV1

Time (s)

Log P

rim

al S

uboptim

alit

y

0 100 200 300 40010

−6

10−4

10−2

100

102

COCOA (H=1e5)minibatch−CD (H=100)local−SGD (H=1e4)batch−SGD (H=100)

COCOA Take-Aways

COCOA Take-AwaysA framework for distributed optimization

COCOA Take-AwaysA framework for distributed optimizationUses state-of-the-art primal-dual methods

COCOA Take-AwaysA framework for distributed optimizationUses state-of-the-art primal-dual methodsReduces communication: - applies updates immediately - averages over # of machines

COCOA Take-AwaysA framework for distributed optimizationUses state-of-the-art primal-dual methodsReduces communication: - applies updates immediately - averages over # of machines

Strong empirical results & convergence guarantees

COCOA Take-AwaysA framework for distributed optimizationUses state-of-the-art primal-dual methodsReduces communication: - applies updates immediately - averages over # of machines

Strong empirical results & convergence guarantees

To appear inNIPS ‘14

COCOA Take-AwaysA framework for distributed optimizationUses state-of-the-art primal-dual methodsReduces communication: - applies updates immediately - averages over # of machines

Strong empirical results & convergence guarantees

To appear inNIPS ‘14www.berkeley.edu/~vsmith

Recommended