75
PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department of Computer Science University of Texas at Austin Joint work with H.-F. Yu and I. S. Dhillon Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 1 / 29

PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

PASSCoDe: Parallel ASynchronous Stochasticdual Co-ordinate Descent

Cho-Jui Hsieh

Department of Computer ScienceUniversity of Texas at Austin

Joint work with H.-F. Yu and I. S. Dhillon

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 1 / 29

Page 2: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Outline

L2-regularized Empirical Risk Minimization

Dual Coordinate Descent (Hsieh et al., 2008)

Parallel Dual Coordinate Descent (on multi-core machines)

Theoretical Analysis

Experimental Results

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 2 / 29

Page 3: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

L2-regularized ERM

w∗ = arg minw∈Rd

P(w) :=1

2‖w‖2 +

n∑i=1

`i (wTx i )

SVM with hinge loss: `i (zi ) = C max (1− zi , 0)

SVM with squared hinge loss: `i (zi ) = C max (1− zi , 0)2

Logistic regression: `i (zi ) = C log (1 + e−zi )

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 3 / 29

Page 4: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Primal and Dual Formulations

Primal Problem

w∗ = arg minw∈Rd

P(w) :=1

2‖w‖2 +

n∑i=1

`i (wTx i )

Dual Problem

α∗ = arg minα∈Rn

D(α) :=1

2

∥∥∥∥∥n∑

i=1

αix i

∥∥∥∥∥2

+n∑

i=1

`∗i (−αi ),

`∗i (·): the conjugate of `i (·)Primal-Dual Relationship between w∗ and α∗

w∗ = w (α∗) :=n∑

i=1

α∗i x i

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 4 / 29

Page 5: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Coordinate Descent on the Dual Problem

Randomly select an i ∈ {1, . . . , n} and update αi ← αi + δ∗, where

δ∗ = arg minδ

D(α + δe i )

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 5 / 29

Page 6: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Coordinate Descent on the Dual Problem

Randomly select an i ∈ {1, . . . , n} and update αi ← αi + δ∗, where

δ∗ = arg minδ

D(α + δe i )

= arg minδ

1

2

(δ +

(∑n

i=1 αix i )T x i

‖x i‖2

)2

+1

‖x i‖2`∗i (− (αi + δ))

= Ti

( n∑i=1

αix i

)T

x i , αi

Simple univariate problem, but O(nnz) construction time

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 6 / 29

Page 7: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Coordinate Descent on the Dual Problem

Randomly select an i ∈ {1, . . . , n} and update αi ← αi + δ∗, where

δ∗ = arg minδ

D(α + δe i )

= arg minδ

1

2

(δ +

(∑n

i=1 αix i )T x i

‖x i‖2

)2

+1

‖x i‖2`∗i (− (αi + δ))

= Ti

( n∑i=1

αix i

)T

x i , αi

Simple univariate problem, but O(nnz) construction time ⇒ O(ni )

DCD: [Hsieh et al 2008]

Maintain primal variable w =∑n

i=1 αix i and δ∗ = Ti

(wTx i , αi

)O(ni ) construction time: ni = nnz of x i

O(ni ) maintenance cost: w ← w + δ∗x i

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 7 / 29

Page 8: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent

Stochastic Dual Coordinate Descent

For t = 1, 2, . . .

1 Randomly pick an index i

2 Compute wTx i

3 Update αi ← αi + δ∗ where δ∗ = Ti (wTx i , αi )

4 Update w ← w + δ∗x i .

Implemented in LIBLINEAR:

Linear SVM (Hsieh et al., 2008), multi-class SVM (Keerthi et al.,2008), Logistic regression (Yu et al., 2011).

Analysis: (Nesterov et al., 2012; Shalev-Shwartz et al., 2013)

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 8 / 29

Page 9: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 3)

CPU2

DCD step: compute wTx i

0

0

0

0

0

0

0 0 0 0 0 0

0

R1

0

R2

0

R3

0

R4

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

Page 10: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 3)

CPU2

operation: Load x32 to R2

DCD step: compute wTx i

0

0

0

0

0

0

0 0 0 0 0 0

0

R1

0.2

x32

0

R3

0

R4

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

Page 11: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 3)

CPU2

operation: Load w2 to R1

DCD step: compute wTx i

0

0

0

0

0

0

0 0 0 0 0 0

0

w2

0.2

x32

0

R3

0

R4

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

Page 12: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 3)

CPU2

operation: R3 = R3 + R1×R2

DCD step: compute wTx i

0

0

0

0

0

0

0 0 0 0 0 0

0

w2

0.2

x32

0

wT xi

0

R4

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

Page 13: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 3)

CPU2

operation: Load x34 to R2

DCD step: compute wTx i

0

0

0

0

0

0

0 0 0 0 0 0

0

w2

0.4

x34

0

wT xi

0

R4

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

Page 14: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 3)

CPU2

operation: Load w4 to R1

DCD step: compute wTx i

0

0

0

0

0

0

0 0 0 0 0 0

0

w4

0.4

x34

0

wT xi

0

R4

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

Page 15: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 3)

CPU2

operation: R3 = R3 + R1×R2

DCD step: compute wTx i

0

0

0

0

0

0

0 0 0 0 0 0

0

w4

0.4

x34

0

wT xi

0

R4

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

Page 16: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 3)

CPU2

operation: Load α3 to R2

DCD step: compute δ∗ = Ti

(wTx , αi

)

0

0

0

0

0

0

0 0 0 0 0 0

0

w4

0

α3

0

wT xi

0

R4

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

Page 17: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 3)

CPU2

operation: R4 = Ti (R2,R3)

DCD step: compute δ∗ = Ti

(wTx , αi

)

0

0

0

0

0

0

0 0 0 0 0 0

0

w4

0

α3

0

wT xi

1

δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

Page 18: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 3)

CPU2

operation: R2 = R2 + R4

DCD step: update αi = αi + δ∗

0

0

0

0

0

0

0 0 0 0 0 0

0

w4

1

α3

0

wT xi

1

δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

Page 19: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 3)

CPU2

operation: Save R2 to α3

DCD step: update αi = αi + δ∗

0

0

0

0

0

0

0 0 1 0 0 0

0

w4

1

α3

0

wT xi

1

δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

Page 20: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 3)

CPU2

operation: Load x32 to R2

DCD step: update w = w + δ∗x i

0

0

0

0

0

0

0 0 1 0 0 0

0

w4

0.2

x32

0

wT xi

1

δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

Page 21: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 3)

CPU2

operation: Load w2 to R1

DCD step: update w = w + δ∗x i

0

0

0

0

0

0

0 0 1 0 0 0

0

w2

0.2

x32

0

wT xi

1

δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

Page 22: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 3)

CPU2

operation: R1 = R1 + R2×R4

DCD step: update w = w + δ∗x i

0

0

0

0

0

0

0 0 1 0 0 0

0.2

w2

0.2

x32

0

wT xi

1

δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

Page 23: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 3)

CPU2

operation: Save R1 to w2

DCD step: update w = w + δ∗x i

0

0.2

0

0

0

0

0 0 1 0 0 0

0.2

w2

0.2

x32

0

wT xi

1

δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

Page 24: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 3)

CPU2

operation: Load x34 to R2

DCD step: update w = w + δ∗x i

0

0.2

0

0

0

0

0 0 1 0 0 0

0.2

w2

0.4

x34

0

wT xi

1

δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

Page 25: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 3)

CPU2

operation: Load w4 to R1

DCD step: update w = w + δ∗x i

0

0.2

0

0

0

0

0 0 1 0 0 0

0

w4

0.4

x34

0

wT xi

1

δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

Page 26: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 3)

CPU2

operation: R1 = R1 + R2×R4

DCD step: update w = w + δ∗x i

0

0.2

0

0

0

0

0 0 1 0 0 0

0.4

w4

0.4

x34

0

wT xi

1

δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

Page 27: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 3)

CPU2

operation: Save R1 to w4

DCD step: update w = w + δ∗x i

0

0.2

0

0.4

0

0

0 0 1 0 0 0

0.4

w4

0.4

x34

0

wT xi

1

δ∗

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 9 / 29

Page 28: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Parallel DCD in Shared-memory Multi-core System

Serial DCD updates:

For t = 1, 2, . . .

1 Randomly pick an index i

2 Compute wTx i

3 Update αi ← αi + δ∗ where δ∗ = Ti (wTx i , αi )

4 Update w ← w + δ∗x i .

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 10 / 29

Page 29: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Parallel DCD in Shared-memory Multi-core System

Parallel DCD updates:

Each thread repeatedly performs the following updates.For t = 1, 2, . . .

1 Randomly pick an index i

2 Compute wTx i

3 Update αi ← αi + δ∗ where δ∗ = Ti (wTx i , αi )

4 Update w ← w + δ∗x i .

Easy to implement using OpenMP.

Variables α and w stored in shared memory.

Distributed Dual Coordinate Descent:

Each machine has local copy of α,w(Yang, 2013; Jaggi et al, 2014; Lee and Roth, 2015; Ma et al., 2015).

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 11 / 29

Page 30: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Parallel DCD in Shared-memory Multi-core System

Parallel DCD updates:

Each thread repeatedly performs the following updates.For t = 1, 2, . . .

1 Randomly pick an index i

2 Compute wTx i

3 Update αi ← αi + δ∗ where δ∗ = Ti (wTx i , αi )

4 Update w ← w + δ∗x i .

Easy to implement using OpenMP.

Variables α and w stored in shared memory.

Distributed Dual Coordinate Descent:

Each machine has local copy of α,w(Yang, 2013; Jaggi et al, 2014; Lee and Roth, 2015; Ma et al., 2015).

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 11 / 29

Page 31: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Parallel DCD in Shared-memory Multi-core System

Parallel DCD updates:

Each thread repeatedly performs the following updates.For t = 1, 2, . . .

1 Randomly pick an index i

2 Compute wTx i

3 Update αi ← αi + δ∗ where δ∗ = Ti (wTx i , αi )

4 Update w ← w + δ∗x i .

Easy to implement using OpenMP.

Variables α and w stored in shared memory.

Distributed Dual Coordinate Descent:

Each machine has local copy of α,w(Yang, 2013; Jaggi et al, 2014; Lee and Roth, 2015; Ma et al., 2015).

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 11 / 29

Page 32: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Parallel Dual Coordinate Descent: Two Issues forCorrectness

Inconsistent Read of wConflict Write of w

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 12 / 29

Page 33: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Inconsistant Read

Thread 2 reads w andthread 1 writes to wsimultaneously.

There may not exist anyα such thatw =

∑i αix i .

The “bounded delay”analysis in (Liu andWright, 2014) cannot bedirectly applied.

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 13 / 29

Page 34: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Conflict Write

Thread 1 and 2 write tow simultaneously.

Updates to w can beoverwritten, so theconverged solution w andα may be inconsistent:

w 6=∑i

αix i .

CPU1: CPU2:w = w + 0.2 w = w + 0.5OP R1 w OP R2

0 0.0 1.0 0.01 load w 1.0 1.0 load w 1.02 add 0.2 1.2 1.0 add 0.5 1.53 save w 1.2 1.2 1.54 1.2 1.5 save w 1.5

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 14 / 29

Page 35: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

DCD step: compute δ∗ = Ti

(wTx , αi

)

AA

0

0.2

0

0.4

0

0

0 0 1 0 0 0

0

R1

0

R2

0

R3

0

R4

0

R5

0

R6

0

R7

0

R8

w (1)

0

0

0

0

0

0

w (2)

0

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 36: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: R4 = Ti (R2,R3)

DCD step: compute δ∗ = Ti

(wTx , αi

)

AA

0

0.2

0

0.4

0

0

0 0 1 0 0 0

0

R1

0

R2

0

R3

2

δ∗

0

R5

0

R6

0

R7

0

R8

w (1)

0

0

0

0

0

0

w (2)

0

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 37: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: Load x11 to R2

DCD step: update w = w + δ∗x i

AA

0

0.2

0

0.4

0

0

0 0 1 0 0 0

0

R1

0.1

x11

0

R3

2

δ∗

0

R5

0

R6

0

R7

0

R8

w (1)

0

0

0

0

0

0

w (2)

0

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 38: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: Load w1 to R1

DCD step: update w = w + δ∗x i

AA

0

0.2

0

0.4

0

0

0 0 1 0 0 0

0

w1

0.1

x11

0

R3

2

δ∗

0

R5

0

R6

0

R7

0

R8

w (1)

0

0

0

0

0

0

w (2)

0

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 39: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: R1 = R1 + R2×R4

DCD step: update w = w + δ∗x i

operation: Load x61 to R6

DCD step: compute wTx i

AA

0

0.2

0

0.4

0

0

0 0 1 0 0 0

0.2

w1

0.1

x11

0

R3

2

δ∗

0

R5

0.1

x61

0

R7

0

R8

w (1)

0

0

0

0

0

0

w (2)

0

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 40: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: Save R1 to w1

DCD step: update w = w + δ∗x i

operation: Load w1 to R5

DCD step: compute wTx i

AA

0.2

0.2

0

0.4

0

0

0 0 1 0 0 0

0.2

w1

0.1

x11

0

R3

2

δ∗

0.2

w1

0.1

x61

0

R7

0

R8

w (1)

0

0

0

0

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 41: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: Load x12 to R2

DCD step: update w = w + δ∗x i

operation: R7 = R7 + R5×R6

DCD step: compute wTx i

AA

0.2

0.2

0

0.4

0

0

0 0 1 0 0 0

0.2

w1

0.2

x12

0

R3

2

δ∗

0.2

w1

0.1

x61

0.02

wT xi

0

R8

w (1)

0

0

0

0

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 42: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: Load w2 to R1

DCD step: update w = w + δ∗x i

operation: Load x66 to R6

DCD step: compute wTx i

AA

0.2

0.2

0

0.4

0

0

0 0 1 0 0 0

0.2

w2

0.2

x12

0

R3

2

δ∗

0.2

w1

0.6

x66

0.02

wT xi

0

R8

w (1)

0

0.2

0

0

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 43: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: R1 = R1 + R2×R4

DCD step: update w = w + δ∗x i

operation: Load w6 to R5

DCD step: compute wTx i

Inconsistent Read of w

0.2

0.2

0

0.4

0

0

0 0 1 0 0 0

0.6

w2

0.2

x12

0

R3

2

δ∗

0

w6

0.6

x66

0.02

wT xi

0

R8

w (1)

0

0.2

0

0

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 44: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: Save R1 to w2

DCD step: update w = w + δ∗x i

operation: R7 = R7 + R5×R6

DCD step: compute wTx i

Inconsistent Read of w

0.2

0.6

0

0.4

0

0

0 0 1 0 0 0

0.6

w2

0.2

x12

0

R3

2

δ∗

0

w6

0.6

x66

0.02

wT xi

0

R8

w (1)

0

0.2

0

0

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 45: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: Load x13 to R2

DCD step: update w = w + δ∗x i

AA

operation: Load α6 to R6

DCD step: compute δ∗ = Ti

(wTx , αi

)

0.2

0.6

0

0.4

0

0

0 0 1 0 0 0

0.6

w2

0.3

x13

0

R3

2

δ∗

0

w6

0

α6

0.02

wT xi

0

R8

w (1)

0

0.2

0

0

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 46: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: Load w3 to R1

DCD step: update w = w + δ∗x i

AA

operation: R8 = Ti (R6,R7)

DCD step: compute δ∗ = Ti

(wTx , αi

)

0.2

0.6

0

0.4

0

0

0 0 1 0 0 0

0

w3

0.3

x13

0

R3

2

δ∗

0

w6

0

α6

0.02

wT xi

5

δ∗

w (1)

0

0.2

0

0

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 47: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: R1 = R1 + R2×R4

DCD step: update w = w + δ∗x i

AA

DCD step: compute δ∗ = Ti

(wTx , αi

)

0.2

0.6

0

0.4

0

0

0 0 1 0 0 0

0.6

w3

0.3

x13

0

R3

2

δ∗

0

w6

0

α6

0.02

wT xi

5

δ∗

w (1)

0

0.2

0

0

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 48: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: Save R1 to w3

DCD step: update w = w + δ∗x i

AA

DCD step: compute δ∗ = Ti

(wTx , αi

)

0.2

0.6

0.6

0.4

0

0

0 0 1 0 0 0

0.6

w3

0.3

x13

0

R3

2

δ∗

0

w6

0

α6

0.02

wT xi

5

δ∗

w (1)

0

0.2

0

0

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 49: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: Load x14 to R2

DCD step: update w = w + δ∗x i

AA

DCD step: compute δ∗ = Ti

(wTx , αi

)

0.2

0.6

0.6

0.4

0

0

0 0 1 0 0 0

0.6

w3

0.4

x14

0

R3

2

δ∗

0

w6

0

α6

0.02

wT xi

5

δ∗

w (1)

0

0.2

0

0

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 50: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: Load w4 to R1

DCD step: update w = w + δ∗x i

AA

operation: R6 = R6 + R8

DCD step: update αi = αi + δ∗

0.2

0.6

0.6

0.4

0

0

0 0 1 0 0 0

0.4

w4

0.4

x14

0

R3

2

δ∗

0

w6

5

α6

0.02

wT xi

5

δ∗

w (1)

0

0.2

0

0.4

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 51: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: R1 = R1 + R2×R4

DCD step: update w = w + δ∗x i

AA

operation: Save R6 to α6

DCD step: update αi = αi + δ∗

0.2

0.6

0.6

0.4

0

0

0 0 1 0 0 5

1.2

w4

0.4

x14

0

R3

2

δ∗

0

w6

5

α6

0.02

wT xi

5

δ∗

w (1)

0

0.2

0

0.4

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 52: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: Save R1 to w4

DCD step: update w = w + δ∗x i

AA

operation: Load x61 to R6

DCD step: update w = w + δ∗x i

0.2

0.6

0.6

1.2

0

0

0 0 1 0 0 5

1.2

w4

0.4

x14

0

R3

2

δ∗

0

w6

0.1

x61

0.02

wT xi

5

δ∗

w (1)

0

0.2

0

0.4

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 53: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: Load x15 to R2

DCD step: update w = w + δ∗x i

AA

operation: Load w1 to R5

DCD step: update w = w + δ∗x i

0.2

0.6

0.6

1.2

0

0

0 0 1 0 0 5

1.2

w4

0.5

x15

0

R3

2

δ∗

0.2

w1

0.1

x61

0.02

wT xi

5

δ∗

w (1)

0

0.2

0

0.4

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 54: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: Load w5 to R1

DCD step: update w = w + δ∗x i

AA

operation: R5 = R5 + R6×R8

DCD step: update w = w + δ∗x i

0.2

0.6

0.6

1.2

0

0

0 0 1 0 0 5

0

w5

0.5

x15

0

R3

2

δ∗

0.7

w1

0.1

x61

0.02

wT xi

5

δ∗

w (1)

0

0.2

0

0.4

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 55: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: R1 = R1 + R2×R4

DCD step: update w = w + δ∗x i

AA

operation: Save R5 to w1

DCD step: update w = w + δ∗x i

0.7

0.6

0.6

1.2

0

0

0 0 1 0 0 5

1

w5

0.5

x15

0

R3

2

δ∗

0.7

w1

0.1

x61

0.02

wT xi

5

δ∗

w (1)

0

0.2

0

0.4

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 56: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: Save R1 to w5

DCD step: update w = w + δ∗x i

AA

operation: Load x66 to R6

DCD step: update w = w + δ∗x i

0.7

0.6

0.6

1.2

1

0

0 0 1 0 0 5

1

w5

0.5

x15

0

R3

2

δ∗

0.7

w1

0.6

x66

0.02

wT xi

5

δ∗

w (1)

0

0.2

0

0.4

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 57: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: Load x16 to R2

DCD step: update w = w + δ∗x i

Conflict Write of w

operation: Load w6 to R5

DCD step: update w = w + δ∗x i

0.7

0.6

0.6

1.2

1

0

0 0 1 0 0 5

1

w5

0.6

x16

0

R3

2

δ∗

0

w6

0.6

x66

0.02

wT xi

5

δ∗

w (1)

0

0.2

0

0.4

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 58: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: Load w6 to R1

DCD step: update w = w + δ∗x i

Conflict Write of w

operation: R5 = R5 + R6×R8

DCD step: update w = w + δ∗x i

0.7

0.6

0.6

1.2

1

0

0 0 1 0 0 5

0

w6

0.6

x16

0

R3

2

δ∗

3

w6

0.6

x66

0.02

wT xi

5

δ∗

w (1)

0

0.2

0

0.4

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 59: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: R1 = R1 + R2×R4

DCD step: update w = w + δ∗x i

Conflict Write of w

operation: Save R5 to w6

DCD step: update w = w + δ∗x i

0.7

0.6

0.6

1.2

1

3

0 0 1 0 0 5

1.2

w6

0.6

x16

0

R3

2

δ∗

3

w6

0.6

x66

0.02

wT xi

5

δ∗

w (1)

0

0.2

0

0.4

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 60: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

operation: Save R1 to w6

DCD step: update w = w + δ∗x i

Conflict Write of w

0.7

0.6

0.6

1.2

1

1.2

0 0 1 0 0 5

1.2

w6

0.6

x16

0

R3

2

δ∗

3

w6

0.6

x66

0.02

wT xi

5

δ∗

w (1)

0

0.2

0

0.4

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 61: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Dual Coordinate Descent in Parallel

α :

0.1

0.2

0.3

0.4

0.5

0.6

x1

0.4

0.5

x2

0.2

0.4

x3 w

0.2

0.4

x4

0.2

0.5

x5

0.1

0.6

x6Memory:

Registers:

CPU1(i = 1)

CPU2(i = 6)

AA

0.7

0.6

0.6

1.2

1

1.2

0 0 1 0 0 5

1.2

w6

0.6

x16

0

R3

2

δ∗

3

w6

0.6

x66

0.02

wT xi

5

δ∗

w (1)

0

0.2

0

0.4

0

0

w (2)

0.2

0

0

0

0

0

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 15 / 29

Page 62: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

PASSCoDe-Lock

Each thread repeatedly performs the following updates.For t = 1, 2, . . .

1 Randomly pick an index i

2 Lock {wj | (x i )j 6= 0}3 Compute wTx i

4 Update αi ← αi + δ∗ where δ∗ = Ti (wTx i , αi )

5 Update w ← w + δ∗x i .

6 Unlock the variables.

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 16 / 29

Page 63: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

How to Resolve the Issues

Three PASSCoDe approaches:

lock: acquire locks for all necessary wj before the update

inconsistent read conflict write

PASSCoDe-Lock resolved resolved

Scaling (on rcv1 with 100 epochs):

# threads Lock

2 98.03s / 0.27x

4 106.11s / 0.25x

10 114.43s / 0.23x

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 17 / 29

Page 64: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

PASSCoDe-Atomic

Each thread repeatedly performs the following updates.For t = 1, 2, . . .

1 Randomly pick an index i

2 Compute wTx i

3 Update αi ← αi + δ∗ where δ∗ = Ti (wTx i , αi )

4 For each j ∈ N(i)

5 Update wj ← wj + δ∗(x i )j atomically

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 18 / 29

Page 65: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

How to Resolve the Issues

Three PASSCoDe approaches:

lock: acquire locks for all necessary wj before the update

atomic: apply atomic operation for wj = wj + δ∗xij

inconsistent read conflict write

PASSCoDe-Lock resolved resolvedPASSCoDe-Atomic remained resolved

Scaling (on rcv1 with 100 epochs):

# threads Lock Atomic

2 98.03s / 0.27x 15.28s / 1.75x

4 106.11s / 0.25x 8.35s / 3.20x

10 114.43s / 0.23x 3.86s / 6.91x

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 19 / 29

Page 66: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Analysis for PASSCoDe-Atomic

Atomic operations guarantee:

all updates to w will be performed eventuallyw =

∑ni=1 αix i holds for the outputted (w , α)

Bounded delay assumption: to handle inconsistent read of wall updates of w before τ iterations must be performed

Theorem

Under certain conditions on τ , PASSCoDe− Atomic has global linearconvergence rate in expectation:

E[D(αj+1)− D(α∗)

]≤ ηE

[D(αj)− D(α∗)

]Our analysis covers logistic regression and SVM with hinge loss(where the dual problem is not strictly convex).

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 20 / 29

Page 67: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

PASSCoDe-Wild

Each thread repeatedly performs the following updates.For t = 1, 2, . . .

1 Randomly pick an index i

2 Compute wTx i

3 Update αi ← αi + δ∗ where δ∗ = Ti (wTx i , αi )

4 Update w ← w + δ∗x i

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 21 / 29

Page 68: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

How to Resolve the Issues

Three PASSCoDe approaches:

lock: acquire locks for all necessary wj before the update

atomic: apply atomic operation for wj = wj + δ∗xijwild: do nothing to resolve either issue

inconsistent read conflict write

PASSCoDe-Lock resolved resolvedPASSCoDe-Atomic remained resolvedPASSCoDe-Wild remained remained

Scaling (on rcv1 with 100 epochs):

# threads Lock Atomic Wild

2 98.03s / 0.27x 15.28s / 1.75x 14.08s / 1.90x4 106.11s / 0.25x 8.35s / 3.20x 7.61s / 3.50x

10 114.43s / 0.23x 3.86s / 6.91x 3.59s / 7.43x

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 22 / 29

Page 69: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Analysis for PASSCoDe-Wild

Some updates are missing due to memory conflicts

for the final (w , α):

w 6=n∑

i=1

αix i

construct w from thefinal α:

w =n∑

i=1

αix i

Which one for prediction, w or w?

Prediction Accuracy (%) by# threads w w LIBLINEAR

news204 97.1 96.1

97.18 97.2 93.3

covtype4 67.8 38.0

66.38 67.6 38.0

rcv14 97.7 97.5

97.78 97.7 97.4

webspam4 99.1 93.1

99.18 99.1 88.4

kddb4 88.8 79.7

88.88 88.8 87.7

Question: why w is better than w?

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 23 / 29

Page 70: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Backward Analysis for PASSCoDe-Wild

Recall the primal problem

w∗ = arg minw

P(w) :=1

2‖w‖2 +

n∑i=1

`i

(wTx i

)

Theorem

Let ε be the error caused by the memory conflicts.

w = arg minw

P(w) :=1

2‖w + ε‖2 +

n∑i=1

`i (wTx i )

w = arg minw

P(w) :=1

2‖w‖2 +

n∑i=1

`i

((w − ε) Tx i

)P(w) is the problem with the perturbation on the regularization term

P(w) is the problem with the perturbation on the prediction term

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 24 / 29

Page 71: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Datasets and Experimental Settings

Datasets.

n n d d C

news20 16,000 3,996 1,355,191 455.5 2rcv1 677,399 20,242 47,236 73.2 1webspam 280,000 70,000 16,609,143 3727.7 1kddb 19,264,097 748,401 29,890,095 29.4 1

Compared Implementation.

LIBLINEAR: serial baseline

PASSCoDe-Wild and PASSCoDe-Atomic: our methods

CoCoA: a multi-core version of [Jaggi et al, 2014]

AsySCD: [Liu & Wright, 2014]

Machine: Intel Multi-core machine with 256 GB Memory

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 25 / 29

Page 72: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Convergence in terms of Walltime

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 26 / 29

Page 73: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Accuracy

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 27 / 29

Page 74: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Speedup

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 28 / 29

Page 75: PASSCoDe Parallel ASynchronous Stochastic dual Co-ordinate ...cjhsieh/passcode_icml.pdf · PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Department

Conclusions

PASSCoDe: an simple but effective asynchronous dual coordinatedescent

Analysis three variants

PASSCoDe-LockPASSCoDe-Atomic: established global linear convergencePASSCoDe-Wild: backward analysis

Future work: extend the analysis to L1-regularized problems

LASSOL1-regularized Logistic Regression

Cho-Jui Hsieh (UT Austin) PASSCoDe July 7, 2015 29 / 29