COCOA: Communication-Efficient Coordinate Ascent

COCOA Communication-Efficient

Coordinate Ascent

Virginia Smith ! Martin Jaggi, Martin Takáč, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, & Michael I. Jordan

LARGE-SCALE OPTIMIZATION

RESULTS

LARGE-SCALE OPTIMIZATION!

RESULTS

Machine Learning with Large Datasets

image/music/video tagging document categorization

item recommendation click-through rate prediction

sequence tagging protein structure prediction

sensor data prediction spam classification

fraud detection

Machine Learning Workflow

Machine Learning WorkflowDATA & PROBLEM classification, regression,

collaborative filtering, …

MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …

OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …

Example: SVM Classification

wx-b=1

wx-b=-1

wx-b=0

2 / ||w||

wx-b=1

wx-b=-1

wx-b=0

2 / ||w||minw2Rd

2||w||2 + 1

`hinge(yiwTxi)

wx-b=1

wx-b=-1

wx-b=0

2 / ||w||minw2Rd

2||w||2 + 1

`hinge(yiwTxi)

Descent algorithms and line search methods Acceleration, momentum, and conjugate gradients

Newton and Quasi-Newton methods Coordinate descent

Stochastic and incremental gradient methods SMO

SVMlight LIBLINEAR

SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, …

Open Problem:efficiently solving objective

when data is distributed

Distributed Optimization

reduce: w = w � ↵P

k �w

“always communicate”reduce: w = w � ↵

Pk �w

Distributed Optimization✔ convergence guarantees

Pk �w

Distributed Optimization✔ convergence guarantees✗ high communication

Pk �w

average: w := 1K

Pk �w

average: w := 1K

“always communicate”

“never communicate”

k �w

✔ low communication

average: w := 1K

k �w

✔ low communication

average: w := 1K

ZDWJ, 2012

k �w

✔ low communication✗ convergence not guaranteed

average: w := 1K

ZDWJ, 2012

k �w

Mini-batch

reduce: w = w � ↵|b|

Pi2b �w

Mini-batch

Pi2b �w

Mini-batch✔ convergence guarantees

Pi2b �w

Mini-batch✔ convergence guarantees✔ tunable communication

Pi2b �w

a natural middle-ground

Pi2b �w

a natural middle-ground

Pi2b �w

Are we done?

Mini-batch Limitations

1. METHODS BEYOND SGD

2. STALE UPDATES

3. AVERAGE OVER BATCH SIZE

RESULTS

COCOA!

RESULTS

2. STALE UPDATES

3. AVERAGE OVER BATCH SIZE

1. METHODS BEYOND SGD Use Primal-Dual Framework

2. STALE UPDATES Immediately apply local updates

3. AVERAGE OVER BATCH SIZE Average over K << batch size

1. METHODS BEYOND SGD Use Primal-Dual Framework

2. STALE UPDATES Immediately apply local updates

3. AVERAGE OVER BATCH SIZE Average over K << batch size

Communication-Efficient Distributed Dual

Coordinate Ascent (CoCoA)

1. Primal-Dual Framework

PRIMAL DUAL≥

PRIMAL DUAL≥minw2Rd

"P (w) :=

2||w||2 + 1

`i(wTxi)

PRIMAL DUAL≥minw2Rd

"P (w) :=

2||w||2 + 1

`i(wTxi)

↵2Rn

"D(↵) := �||A↵||2 � 1

`⇤i (�↵i)

PRIMAL DUAL

Stopping criteria given by duality gap

≥minw2Rd

"P (w) :=

2||w||2 + 1

`i(wTxi)

↵2Rn

"D(↵) := �||A↵||2 � 1

`⇤i (�↵i)

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practice

≥minw2Rd

"P (w) :=

2||w||2 + 1

`i(wTxi)

↵2Rn

"D(↵) := �||A↵||2 � 1

`⇤i (�↵i)

PRIMAL DUAL

Stopping criteria given by duality gapGood performance in practiceDefault in software packages e.g. liblinear

≥minw2Rd

"P (w) :=

2||w||2 + 1

`i(wTxi)

↵2Rn

"D(↵) := �||A↵||2 � 1

`⇤i (�↵i)

2. Immediately Apply Updates

for i 2 b�w �w � ↵riP (w)

w w +�w

2. Immediately Apply Updates

for i 2 b�w �w � ↵riP (w)

w w +�w

for i 2 b�w �w � ↵riP (w)w w +�w

endOutput: �↵[k] and �w := A[k]�↵[k]

3. Average over K

reduce: w = w + 1K

Pk �wk

CoCoAAlgorithm 1: CoCoA

Input: T � 1, scaling parameter 1 �K K (default: �K := 1).

Data: {(xi, yi)}ni=1 distributed over K machines

Initialize: ↵

(0)[k] 0 for all machines k, and w

for t = 1, 2, . . . , Tfor all machines k = 1, 2, . . . ,K in parallel

(�↵[k],�wk) LocalDualMethod(↵

(t�1)[k] ,w(t�1)

(t)[k] ↵

(t�1)[k] +

K �↵[k]

reduce w

(t�1)+

PKk=1 �wk

Procedure A: LocalDualMethod: Dual algorithm on machine k

Input: Local ↵[k] 2 Rnk, and w 2 Rd

consistent with other coordinate

blocks of ↵ s.t. w = A↵

Data: Local {(xi, yi)}nki=1

Output: �↵[k] and �w := A[k]�↵[k]

Initialize: ↵

(t�1)[k] ,w(t�1)

(t)[k] ↵

(t�1)[k] +

K �↵[k]

reduce w

(t�1)+

PKk=1 �wk

✔ <10 lines of code in Spark

Initialize: ↵

(t�1)[k] ,w(t�1)

(t)[k] ↵

(t�1)[k] +

K �↵[k]

reduce w

(t�1)+

PKk=1 �wk

✔ primal-dual framework allows for any internal optimization method

Initialize: ↵

(t�1)[k] ,w(t�1)

(t)[k] ↵

(t�1)[k] +

K �↵[k]

reduce w

(t�1)+

PKk=1 �wk

✔ local updates applied immediately

Initialize: ↵

(t�1)[k] ,w(t�1)

(t)[k] ↵

(t�1)[k] +

K �↵[k]

reduce w

(t�1)+

PKk=1 �wk

✔ local updates applied immediately

✔ average over K

Convergence

E[D(↵⇤)�D(↵(T ))] ✓1� (1�⇥)

�n�

� + �n�

◆T ⇣D(↵⇤)�D(↵(0))

Assumptions: are -smooth LocalDualMethod makes improvement per step

e.g. for SDCA ⇥

✓1� �n�

1 + �n�

`i 1/�

applies also to duality gap !

measure of difficulty of data partition 0 � n/K

Convergence

E[D(↵⇤)�D(↵(T ))] ✓1� (1�⇥)

�n�

� + �n�

◆T ⇣D(↵⇤)�D(↵(0))

e.g. for SDCA ⇥

✓1� �n�

1 + �n�

`i 1/�

✔ it converges!

Convergence

E[D(↵⇤)�D(↵(T ))] ✓1� (1�⇥)

�n�

� + �n�

◆T ⇣D(↵⇤)�D(↵(0))

e.g. for SDCA ⇥

✓1� �n�

1 + �n�

`i 1/�

✔ it converges!

✔ convergence rate matches current state-of-the-art mini-

batch methods

Convergence

E[D(↵⇤)�D(↵(T ))] ✓1� (1�⇥)

�n�

� + �n�

◆T ⇣D(↵⇤)�D(↵(0))

e.g. for SDCA ⇥

✓1� �n�

1 + �n�

`i 1/�

✔ it converges!

✔ convergence rate matches current state-of-the-art mini-

batch methods

RESULTS!

Empirical Results in

Dataset Training (n) Features (d) Sparsity Workers (K)

Cov 522,911 54 22.22% 4

Rcv1 677,399 47,236 0.16% 8

Imagenet 32,751 160,000 100% 32

0 200 400 600 80010

10−4

10−2

Imagenet

Time (s)

uboptim

0 200 400 600 80010

10−4

10−2

Imagenet

COCOA (H=1e3)mini−batch−CD (H=1)local−SGD (H=1e3)mini−batch−SGD (H=10)

0 200 400 600 80010

10−4

10−2

Imagenet

Time (s)

0 200 400 600 80010

10−4

10−2

Imagenet

COCOA (H=1e3)mini−batch−CD (H=1)local−SGD (H=1e3)mini−batch−SGD (H=10)

0 20 40 60 80 10010

10−4

10−2

Time (s)

0 20 40 60 80 10010

10−4

10−2

COCOA (H=1e5)minibatch−CD (H=100)local−SGD (H=1e5)batch−SGD (H=1)

0 100 200 300 40010

10−4

10−2

Time (s)

uboptim

0 100 200 300 40010

10−4

10−2

COCOA (H=1e5)minibatch−CD (H=100)local−SGD (H=1e4)batch−SGD (H=100)

COCOA Take-Aways

COCOA Take-AwaysA framework for distributed optimization

COCOA Take-AwaysA framework for distributed optimizationUses state-of-the-art primal-dual methods

COCOA Take-AwaysA framework for distributed optimizationUses state-of-the-art primal-dual methodsReduces communication: - applies updates immediately - averages over # of machines

Strong empirical results & convergence guarantees

To appear inNIPS ‘14

To appear inNIPS ‘14www.berkeley.edu/~vsmith

COCOA: Communication-Efficient Coordinate Ascent

Software

Taxonomy of Dual Block-Coordinate Ascent Methods for

Stochastic Dual Coordinate Ascent Methods for Regularized ... · coordinate ascent that was established by Luo and Tseng (1992). The linear convergence means that it achieves a rate

Dynamic Tree Block Coordinate Ascent Daniel Tarlow 1, Dhruv Batra 2 Pushmeet Kohli 3, Vladimir Kolmogorov 4 1: University of Toronto3: Microsoft Research

AFTERMARKET NSTALLATION ANUAL ASCENT ASCENT XL · 2018. 11. 26. · 052553-021r4 Printed in USA November, 2018 AFTERMARKET INSTALLATION MANUAL ASCENT & ASCENT XL SLIDE-OUT COVER RV

Ascent Manufacturing

World Cocoa Foundation 26th Partnership Meeting & Cocoa ... · World Cocoa Foundation 26th Partnership Meeting & Cocoa Sustainability Trade Fair ... Cocoa Foundation’s 26th Partnership

Ascent SW User Manual for Fluoroskan Ascent, CF & FL

ASCENT courseware mapping reference - ASCENT- Center for Technical

Ascent Escrow

Cocoa Fertilizer Forum - cocoa CONNECT

The ASCENT - Donegal County Council · 2019. 3. 22. · ASCENT Project Manager Donegal County Council The ASCENT Project ASCENT will bring together Local and Environmental Authorities

Ascent Payroll

First Ascent of Mount Kennedy and Second Ascent of Mount ...aac-publications.s3.amazonaws.com/documents/aaj/... · First Ascent of Mount Kennedy and Second Ascent of Mount Hubbard

No. 30692 MULTILATERAL International Cocoa Agreement, …...Cocoa means cocoa beans and cocoa products; 2. Cocoa products means products made exclusively from cocoa beans, ... In exercising

Coordinate Descent and Ascent Methodsjnutini/documents/mlrg_CD.pdfSuitable for large-scale supervised learning (large # loss functions n): Primal formulated as sum of convex loss functions

Cocoa Beans: Chocolate & Cocoa Industry Quality …cocoaquality.eu/data/Cocoa Beans Industry Quality Requirements Apr... · 3 Contents Cocoa Beans: Chocolate & Cocoa Industry Quality

COCOA: coordinate covariation analysis of epigenetic

Tugas Ascent

Summary of the Cocoa Fertilizer Initiative Evaluation Report · out by WCF to coordinate and align the cocoa sustainability efforts of the world’s largest cocoa and chocolate companies

Zheng Qu University of Edinburgh Optimization & Big Data Workshop Edinburgh, 6 th to 8 th May, 2015 Randomized dual coordinate ascent with arbitrary sampling