45
with System and Algorithm Co-design Mu Li Thesis Defense CSD, CMU Feb 2nd, 2017 Scaling Distributed Machine Learning

Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

with System and Algorithm Co-design

Mu Li Thesis Defense

CSD, CMU Feb 2nd, 2017

Scaling Distributed Machine Learning

Page 2: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

2

minw

nX

i=1

fi(w)

Large-scale problems

✧ Distributed systems ✧ Large scale optimization methods

Page 3: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Large Scale Machine Learning

✧ Machine learning learns from data ✧ More data ✓ better accuracy ✓ can use more complex models

3

Data size

Accu

racy

More complex models

Page 4: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Ads Click Prediction

✧ Predict if an ad will be clicked ✧ Each ad impression is an example ✧ Logistic regression ✓ Single machine processes 1 million

examples per second

4

Page 5: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Ads Click Prediction

✧ Predict if an ad will be clicked ✧ Each ad impression is an example ✧ Logistic regression ✓ Single machine processes 1 million

examples per second ✧ A typical industrial size problem has ✓ 100 billion examples ✓ 10 billion unique features

5

0

175

350

525

700

Year

2010 2011 2012 2013 2014

Training data size

(TB)

Page 6: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

6

✧ Recognize the object in an image ✧ Convolutional neural network ✧ A state-of-the-art network ✓ Hundreds of layers ✓ Billions of floating-point operation for

processing a single image

Image Recognition

Page 7: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

7

✧ Distribute workload among many machines

✧ Widely available thanks to cloud providers (AWS, GCP, Azure) machine

machine

machine

machine

switch switch

switch switch

Distributed Computing for Large Scale Problems

Page 8: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

8

✧ Distribute workload among many machines

✧ Widely available thanks to cloud providers (AWS, GCP, Azure)

✧ Challenges ✓ Limited communication bandwidth (10x

less than memory bandwidth) ✓ Large synchronization cost (1ms latency) ✓ Job failures

machine

machine

switch

machine

machine

switch

switch switch

0

8

16

24

Machine time (hour)

100 1000 10000Failu

re ra

te (%

)

Distributed Computing for Large Scale Problems

Page 9: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Distributed Optimization for Large Scale ML

9

m[

i=1

Ii = {1, 2, . . . , n}

X

i2I1

@fi(wt)

X

i2I2

@fi(wt)

X

i2Im

@fi(wt)

wt wt+1+

…✧ Challenges ✓ Massive communication traffic ✓ Expensive global synchronization

minw

nX

i=1

fi(w)

Page 10: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

10

Distributed SystemsDistributed Systems

✓ Large data size, complex models ✓ Fault tolerant ✓ Easy to use

Scaling Distributed Machine Learning

Distributed SystemsLarge Scale Optimization

✓ Communication efficient ✓ Convergence guarantee

Page 11: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

11

Distributed SystemsDistributed Systems

Scaling Distributed Machine Learning

Distributed SystemsLarge Scale Optimization

for machine learning

Parameter Server for machine learning

for machine learning

MXNet for deep learning

for machine learning

DBPG for non-convex non-smooth fi

for machine learning

EMSO for efficient minibatch SGD

Page 12: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

12

Distributed SystemsDistributed Systems

Scaling Distributed Machine Learning

Distributed SystemsLarge Scale Optimization

for machine learning

Parameter Server for machine learning

for machine learning

MXNet for deep learning

for machine learning

DBPG for non-convex non-smooth fi

for machine learning

EMSO for efficient minibatch SGD

Page 13: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Existing Open Source Systems in 2012

13

✧ MPI (message passing interface) ✓ Hard to use for sparse problems ✓ No fault tolerance

✧ Key-value store, e.g. redis ✓ Expensive individual key-value pair communication ✓ Difficult to program on the server side

✧ Hadoop/Spark ✓ BSP data consistency makes efficient implementation challenging

Page 14: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Parameter Server Architecture

14

Training data

push

Worker machines

Server machines

pull

Model[Smola’10, Dean’12]

update

Page 15: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Keys Features of our Implementation

✧ Trade off data consistency for speed ✓ Flexible consistency models ✓ User-defined filters

✧ Fault tolerance with chain replication

15

[Li et al, OSDI’14]

X

X

Page 16: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Flexible Consistency Model

16

Bounded delay / SSP [Langford 09, Cipar 13]

1 2 3 4 5

Eventual / Total asynchronous

[Smola 10]

1 2 3 4

1 2 3Sequential / BSP 4

gradient push & pulliter 0

gradient push & pulliter 1

execute after finished dependency

gradient push & pulliter 0

gradient push & pulliter 1

no dependency

Flexible models via task dependency graph

Page 17: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

User-defined Filters

✧ User defined encoder/decoder for efficient communication

✧ Lossless compression ✓ General data compression: LZ, LZR, ..

✧ Lossy compression ✓ Random skip ✓ Fixed-point encoding

17

Encoder

Decoder

Data

Data

machine

machine

efficient communication

Page 18: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Fault Tolerance with Chain Replication✧ Model is partitioned by consistent hashing ✧ Chain replication

18

worker 0 server 0 server 1

push push

ackack

push

ack

push

push

ack

ackworker 0

server 0 server 1

worker 1

✧ Option: aggregation reduces backup traffic

Page 19: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

19

Distributed SystemsDistributed Systems

Scaling Distributed Machine Learning

Distributed SystemsLarge Scale Optimization

for machine learning

Parameter Server for machine learning

for machine learning

MXNet for deep learning

for machine learning

DBPG for non-convex non-smooth fi

for machine learning

EMSO for efficient minibatch SGD

Page 20: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Proximal Gradient Method

✓ fi: continuously differentiable but not necessarily convex ✓ h: convex but possibly non-smooth

20

minw2⌦

nX

i=1

fi(w) + h(w)

✧ Iterative update

[Combettes’09]

Prox⌘(x) := argmin

y2⌦h(y) +

1

2⌘

kx� yk2

wt+1 = Prox�t

"wt � ⌘t

nX

i=1

fi(wt)

#

where

Page 21: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Delayed Block Proximal Gradient

✧ Algorithm design tailored for parameter server implementation ✓ Update a block of coordinates each time ✓ Allow delay among blocks ✓ Use filters during communication

✧ Only 300 lines of codes

21

[Li et al, NIPS’14]

data

model

Page 22: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Convergence Analysis

✧ Assumptions: ✓ Block Lipschitz continuity: within block , cross blocks

✓ Delay is bounded by 𝜏

✓ Lossy compressions such as random skip filter and significantly-modified filter

✧ DBPG converges to a stationary point if the learning rate is chosen as

22

Lvar Lcor

⌘t <1

Lvar

+ ⌧Lcor

Page 23: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Experiments on Ads Click Prediction

✧ Real dataset used in production ✓ 170 billion examples, 65 billion unique

features, 636 TB in total ✧ 1000 machines ✧ Sparse logistic regression

23

0

0.5

1

1.5

2

Maximal delay 𝜏

0 1 2 4 8 16

computing waiting

time (hour)

sequential

best trade-off1.6x

Time to achieve the same objective value

min

w

nX

i=1

log(1 + exp(�yi hxi, wi)) + �kwk1

Page 24: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Filters to Reduce Communication Traffic

✧ Key caching ✓ Cache feature IDs on both sender and

receiver ✧ Data compression ✧ KKT filter ✓ Shrink gradient to 0 based on the KKT

condition

24Tr

affic (

%)

0

25

50

75

100

Baseline Key Caching

Compre-ssing

KKT Filter

Traffi

c (%

)

0

25

50

75

100

Baseline Key Caching

Compre-ssing

KKT Filter

Server

Worker

2x

40x 40x

2x 2.5x12x

Page 25: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

25

Distributed SystemsDistributed Systems

Scaling Distributed Machine Learning

Distributed SystemsLarge Scale Optimization

for machine learning

Parameter Server for machine learning

for machine learning

MXNet for deep learning

for machine learning

DBPG for non-convex non-smooth fi

for machine learning

EMSO for efficient minibatch SGD

Page 26: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Deep Learning is Unique✧ Complex workloads

✧ Heterogeneous computing

✧ Easy to use programming interface

26

“deep learning” trend in the past 5 years

Page 27: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Key Features of MXNet

✧ Easy-to-use front-end ✓ Mixed programming

✧ Scalable and efficient back-end ✓ Computation and memory optimization ✓ Auto-parallelization ✓ Scaling to multiple GPU/machines

27

[Chen et al, NIPS’15 workshop] (corresponding author)

Page 28: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

28

✧ Imperative programming is flexible ✓ e.g. Numpy, Matlab, Torch, …

Mixed Programming

import mxnet as mx a = mx.nd.zeros((100, 50)) b = mx.nd.ones((100, 50)) c = a * b print(c) c += 1

import mxnet as mx net = mx.symbol.Variable('data') net = mx.symbol.FullyConnected( data=net, num_hidden=128) net = mx.symbol.SoftmaxOutput(data=net) model = mx.module.Module(net) model.forward(data=c) model.backward()

✧ Declarative programs are easy to optimize ✓ e.g. TensorFlow, Theano, Caffe, …

Good for defining the neural networkGood for updating and interacting

with the neural network

Page 29: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Back-end System

29

a b

1

+

c

fullc

softmax

weight

bias

Back-end

import mxnet as mx a = mx.nd.zeros((100, 50)) b = mx.nd.ones((100, 50)) c = a * b c += 1

import mxnet as mx net = mx.symbol.Variable('data') net = mx.symbol.FullyConnected( data=net, num_hidden=128) net = mx.symbol.SoftmaxOutput(data=net) texec = mx.module.Module(net) texec.forward(data=c) texec.backward()

Front-end

✧ Optimization ✓ Memory optimization ✓ Operator fusion and

runtime compilation ✧ Scheduling ✓ Auto-parallelization

Page 30: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Scale to Multiple GPU Machines

30

PCIe Switch

GPU

GPU

GPU

GPU

CPU

Network Switch

63 GB/s 4 PCIe 3.0 16x

15.75 GB/s PCIe 3.0 16x

1.25 GB/s 10 Gbit Ethernet

Hierarchical parameter server

Level-1 Servers

Workers

Level-2 Servers

GPUs

CPUs

✧ 1000 lines of codes

Page 31: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Experiment Setup

✧ ✓ 1.2 million images with 1000 classes

✧ Resnet 152-layer model ✧ EC2 P2.8 xlarge ✓ 8 K80 GPUs per machine

31

GPU 0-7

PCIe switchesCPU

✧ Minibatch SGD ✓ Draw a random set of examples It at

iteration t ✓ Update

wt+1 = wt �⌘t|It|

X

i2It

@fi(wt)

✧ Synchronized updating

Page 32: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Communication Cost

✧ Fix #GPUs per machine

32

time

(sec

)

0.2

0.325

0.45

0.575

0.7

# of GPUs

0 32 64 96 128

1 GPU/machine2 GPU/machine4 GPU/machine8 GPU/machine

Page 33: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Scalability

33

time

(sec

)

0

0.25

0.5

0.75

1

# of GPUs

0 32 64 96 128

Communication costbatch size/GPU=4batch size/GPU=8batch size/GPU=16

115x speedup

Page 34: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Convergence

✧ Increase learning rate by 5x

✧ Increase learning rate by 10x, decrease it at epoch 50, 80

34

Top-

1 val

idat

ion

accu

racy

0.1

0.275

0.45

0.625

0.8

# of epochs

0 30 60 90 120

batch size=256batch size=2,560batch size=5,120

Page 35: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

35

Distributed SystemsDistributed Systems

Scaling Distributed Machine Learning

Distributed SystemsLarge Scale Optimization

for machine learning

Parameter Server for machine learning

for machine learning

MXNet for deep learning

for machine learning

DBPG for non-convex non-smooth fi

for machine learning

EMSO for efficient minibatch SGD

Page 36: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Minibatch SGD

✧ Large batch size b in SGD ✓ Better parallelization within a batch ✓ Less switching/communication cost

✧ Small batch size b ✓ Faster convergenceN: number of examples processed

36

O(1/pN + b/N)

Batch size (b)

Better system performance

convergence rateWorse

Page 37: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Motivation

✧ Improve converge rate for large batch size ✓ Example variance decreases with batch size ✓ Solve a more “accurate” optimization

subproblem over each batch

37

Batch size (b)

Better system performance

convergence rate

[Li et al, KDD’14]

Worse

Page 38: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Efficient Minibatch SGD✧ Define . Minibatch SGD solves

38

fIt(w) :=X

i2It

fi(w)

first-order approximation

wt = argminw2⌦

fIt(wt�1) + h@fIt(wt�1), w � wt�1i+

1

2⌘tkw � wt�1k22

�conservative penalty

✧ For convex fi, choose . EMSO has convergence rate (compared to )

⌘t = O(b/pN)

O(1/pN)

O(1/pN + b/N)

exact objective

wt = argminw2⌦

fIt(w) +

1

2⌘tkw � wt�1k22

�✧ EMSO solves the subproblem at each iteration

Page 39: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Experiment✧ Ads click prediction with fixed run time

39

Extended to deep learning in [Keskar et al, arXiv’16]O

bjec

tive

0.195

0.201

0.208

0.214

0.22

Batch size

1e3 1e4 1e5

SGDEMSO

Obj

ectiv

e

0.01

0.016

0.023

0.029

0.035

Batch size

1e3 1e4 1e5

SGDEMSO

Single machine 10 machines

Page 40: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

40

minw

nX

i=1

fi(w)

✧ Distributed systems ✧ Large scale optimization

Large-scale problems

Reduce communication

costCo-design

✓ Communicate less ✓ Message compression ✓ Relaxed data consistency

With appropriate computational frameworks and algorithm design, distributed machine learning can be made simple,

fast, and scalable, both in theory and in practice.

Page 41: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Acknowledgement

41

with other 13 collaborators

Advisors

QQ & Alex

Committee members

Page 42: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Backup slides

42

Page 43: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Scaling to 16 GPUs in a Single Machine

43

time

(sec

)

0

0.25

0.5

0.75

1

# of GPUs

0 4 8 12 16

Comm Costbs/GPU=2bs/GPU=4bs/GPU=8bs/GPU=16

Communication dominates

15x

Page 44: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Compare to a L-BFGS Based System

44

System AParameter Server

0.1 1 10

Time (hour)

0

1.25

2.5

3.75

5

System A Parameter Server

computing waiting

Obj

ectiv

e

Tim

e (h

our)

Page 45: Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Sections not Covered

✧ AdaDelay: model the actual delay for asynchronized SGD ✧ Parsa: data placement to reduce communication cost ✧ Difacto: large scale factorization machine

45