Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ......

with System and Algorithm Co-design

Mu Li Thesis Defense

CSD, CMU Feb 2nd, 2017

Scaling Distributed Machine Learning

Large-scale problems

✧ Distributed systems ✧ Large scale optimization methods

Large Scale Machine Learning

✧ Machine learning learns from data ✧ More data ✓ better accuracy ✓ can use more complex models

Data size

More complex models

Ads Click Prediction

✧ Predict if an ad will be clicked ✧ Each ad impression is an example ✧ Logistic regression ✓ Single machine processes 1 million

examples per second

Ads Click Prediction

✧ Predict if an ad will be clicked ✧ Each ad impression is an example ✧ Logistic regression ✓ Single machine processes 1 million

examples per second ✧ A typical industrial size problem has ✓ 100 billion examples ✓ 10 billion unique features

2010 2011 2012 2013 2014

Training data size

✧ Recognize the object in an image ✧ Convolutional neural network ✧ A state-of-the-art network ✓ Hundreds of layers ✓ Billions of floating-point operation for

processing a single image

Image Recognition

✧ Distribute workload among many machines

✧ Widely available thanks to cloud providers (AWS, GCP, Azure) machine

machine

switch switch

Distributed Computing for Large Scale Problems

✧ Widely available thanks to cloud providers (AWS, GCP, Azure)

✧ Challenges ✓ Limited communication bandwidth (10x

less than memory bandwidth) ✓ Large synchronization cost (1ms latency) ✓ Job failures

machine

switch

machine

switch

switch switch

✧ Widely available thanks to cloud providers (AWS, GCP, Azure)

✧ Challenges ✓ Limited communication bandwidth (10x

less than memory bandwidth) ✓ Large synchronization cost (1ms latency) ✓ Job failures

machine

switch

machine

switch

switch switch

Machine time (hour)

100 1000 10000Failu

Distributed Optimization for Large Scale ML

Ii = {1, 2, . . . , n}

@fi(wt)

wt wt+1+

Distributed Optimization for Large Scale ML

Ii = {1, 2, . . . , n}

@fi(wt)

wt wt+1+

…✧ Challenges ✓ Massive communication traffic ✓ Expensive global synchronization

Distributed SystemsDistributed Systems

✓ Large data size, complex models ✓ Fault tolerant ✓ Easy to use

Distributed SystemsLarge Scale Optimization

✓ Communication efficient ✓ Convergence guarantee

for machine learning

Parameter Server for machine learning

MXNet for deep learning

DBPG for non-convex non-smooth fi

EMSO for efficient minibatch SGD

Existing Open Source Systems in 2012

✧ MPI (message passing interface) ✓ Hard to use for sparse problems ✓ No fault tolerance

✧ Key-value store, e.g. redis ✓ Expensive individual key-value pair communication ✓ Difficult to program on the server side

✧ Hadoop/Spark ✓ BSP data consistency makes efficient implementation challenging

Parameter Server Architecture

[Smola’10, Dean’12]

Training data

Worker machines

Training data

Worker machines

Model[Smola’10, Dean’12]

Training data

Worker machines

Server machines

Training data

Worker machines

Server machines

Training data

Worker machines

Server machines

update

Training data

Worker machines

Server machines

update

Keys Features of our Implementation

[Li et al, OSDI’14]

✧ Trade off data consistency for speed ✓ Flexible consistency models ✓ User-defined filters

✧ Fault tolerance with chain replication

Flexible Consistency Model

gradient push & pulliter 0

execute after finished dependency

no dependency

Flexible models via task dependency graph

1 2 3Sequential / BSP 4

no dependency

Eventual / Total asynchronous

[Smola 10]

1 2 3 4

no dependency

Bounded delay / SSP [Langford 09, Cipar 13]

1 2 3 4 5

Eventual / Total asynchronous

[Smola 10]

1 2 3 4

no dependency

User-defined Filters

✧ User defined encoder/decoder for efficient communication

✧ Lossless compression ✓ General data compression: LZ, LZR, ..

✧ Lossy compression ✓ Random skip ✓ Fixed-point encoding

Encoder

Decoder

machine

efficient communication

Fault Tolerance with Chain Replication✧ Model is partitioned by consistent hashing ✧ Chain replication

worker 0 server 0 server 1

push push

ackack

Fault Tolerance with Chain Replication✧ Model is partitioned by consistent hashing ✧ Chain replication

worker 0 server 0 server 1

push push

ackack

ackworker 0

server 0 server 1

worker 1

✧ Option: aggregation reduces backup traffic

Proximal Gradient Method

✓ fi: continuously differentiable but not necessarily convex ✓ h: convex but possibly non-smooth

minw2⌦

fi(w) + h(w)

[Combettes’09]

Proximal Gradient Method

✓ fi: continuously differentiable but not necessarily convex ✓ h: convex but possibly non-smooth

minw2⌦

fi(w) + h(w)

✧ Iterative update

[Combettes’09]

Prox⌘(x) := argmin

y2⌦h(y) +

kx� yk2

wt+1 = Prox�t

"wt � ⌘t

fi(wt)

Delayed Block Proximal Gradient

✧ Algorithm design tailored for parameter server implementation ✓ Update a block of coordinates each time ✓ Allow delay among blocks ✓ Use filters during communication

✧ Only 300 lines of codes

[Li et al, NIPS’14]

Delayed Block Proximal Gradient

✧ Algorithm design tailored for parameter server implementation ✓ Update a block of coordinates each time ✓ Allow delay among blocks ✓ Use filters during communication

✧ Only 300 lines of codes

[Li et al, NIPS’14]

Convergence Analysis

✧ Assumptions: ✓ Block Lipschitz continuity: within block , cross blocks

✓ Delay is bounded by 𝜏

✓ Lossy compressions such as random skip filter and significantly-modified filter

✧ DBPG converges to a stationary point if the learning rate is chosen as

Lvar Lcor

⌘t <1

+ ⌧Lcor

Experiments on Ads Click Prediction

✧ Real dataset used in production ✓ 170 billion examples, 65 billion unique

features, 636 TB in total ✧ 1000 machines ✧ Sparse logistic regression

log(1 + exp(�yi hxi, wi)) + �kwk1

Maximal delay 𝜏

0 1 2 4 8 16

computing waiting

time (hour)

Time to achieve the same objective value

Maximal delay 𝜏

0 1 2 4 8 16

computing waiting

time (hour)

sequential

Maximal delay 𝜏

0 1 2 4 8 16

computing waiting

time (hour)

sequential

Maximal delay 𝜏

0 1 2 4 8 16

computing waiting

time (hour)

sequential

best trade-off1.6x

Filters to Reduce Communication Traffic

Server

Worker

affic (

Baseline Key Caching

Compre-ssing

KKT Filter

Traffi

Compre-ssing

KKT Filter

Server

Worker

✧ Key caching ✓ Cache feature IDs on both sender and

receiver

affic (

Compre-ssing

KKT Filter

Traffi

Compre-ssing

KKT Filter

Server

Worker

receiver✧ Data compression

affic (

Compre-ssing

KKT Filter

Traffi

Compre-ssing

KKT Filter

Server

Worker

2x 2.5x

receiver✧ Data compression✧ KKT filter ✓ Shrink gradient to 0 based on the KKT

condition

affic (

Compre-ssing

KKT Filter

Traffi

Compre-ssing

KKT Filter

Server

Worker

40x 40x

2x 2.5x12x

Deep Learning is Unique✧ Complex workloads

✧ Heterogeneous computing

✧ Easy to use programming interface

“deep learning” trend in the past 5 years

Key Features of MXNet

✧ Easy-to-use front-end ✓ Mixed programming

✧ Scalable and efficient back-end ✓ Computation and memory optimization ✓ Auto-parallelization ✓ Scaling to multiple GPU/machines

[Chen et al, NIPS’15 workshop] (corresponding author)

Mixed Programming

import mxnet as mx net = mx.symbol.Variable('data') net = mx.symbol.FullyConnected( data=net, num_hidden=128) net = mx.symbol.SoftmaxOutput(data=net) model = mx.module.Module(net) model.forward(data=c) model.backward()

✧ Declarative programs are easy to optimize ✓ e.g. TensorFlow, Theano, Caffe, …

Good for defining the neural network

✧ Imperative programming is flexible ✓ e.g. Numpy, Matlab, Torch, …

Mixed Programming

import mxnet as mx a = mx.nd.zeros((100, 50)) b = mx.nd.ones((100, 50)) c = a * b print(c) c += 1

import mxnet as mx net = mx.symbol.Variable('data') net = mx.symbol.FullyConnected( data=net, num_hidden=128) net = mx.symbol.SoftmaxOutput(data=net) model = mx.module.Module(net) model.forward(data=c) model.backward()

✧ Declarative programs are easy to optimize ✓ e.g. TensorFlow, Theano, Caffe, …

Good for defining the neural networkGood for updating and interacting

with the neural network

Back-end System

Back-end

import mxnet as mx a = mx.nd.zeros((100, 50)) b = mx.nd.ones((100, 50)) c = a * b c += 1

import mxnet as mx net = mx.symbol.Variable('data') net = mx.symbol.FullyConnected( data=net, num_hidden=128) net = mx.symbol.SoftmaxOutput(data=net) texec = mx.module.Module(net) texec.forward(data=c) texec.backward()

Front-end

Back-end System

Back-end

Front-end

Back-end System

softmax

weight

Back-end

Front-end

Back-end System

softmax

weight

Back-end

Front-end

Back-end System

softmax

weight

Back-end

Front-end

✧ Optimization ✓ Memory optimization ✓ Operator fusion and

runtime compilation ✧ Scheduling ✓ Auto-parallelization

Scale to Multiple GPU Machines

PCIe Switch

Network Switch

63 GB/s 4 PCIe 3.0 16x

15.75 GB/s PCIe 3.0 16x

1.25 GB/s 10 Gbit Ethernet

Scale to Multiple GPU Machines

PCIe Switch

Network Switch

63 GB/s 4 PCIe 3.0 16x

15.75 GB/s PCIe 3.0 16x

1.25 GB/s 10 Gbit Ethernet

Hierarchical parameter server

Level-1 Servers

Workers

Level-2 Servers

✧ 1000 lines of codes

Experiment Setup

✧ ✓ 1.2 million images with 1000 classes

✧ Resnet 152-layer model ✧ EC2 P2.8 xlarge ✓ 8 K80 GPUs per machine

GPU 0-7

PCIe switchesCPU

Experiment Setup

✧ ✓ 1.2 million images with 1000 classes

✧ Resnet 152-layer model ✧ EC2 P2.8 xlarge ✓ 8 K80 GPUs per machine

GPU 0-7

PCIe switchesCPU

✧ Minibatch SGD ✓ Draw a random set of examples It at

iteration t ✓ Update

wt+1 = wt �⌘t|It|

@fi(wt)

✧ Synchronized updating

Communication Cost

✧ Fix #GPUs per machine

Communication Cost

# of GPUs

0 32 64 96 128

1 GPU/machine2 GPU/machine4 GPU/machine8 GPU/machine

Communication Cost

# of GPUs

0 32 64 96 128

Communication Cost

# of GPUs

0 32 64 96 128

Communication Cost

# of GPUs

0 32 64 96 128

Scalability

# of GPUs

0 32 64 96 128

Communication costbatch size/GPU=4batch size/GPU=8batch size/GPU=16

Scalability

# of GPUs

0 32 64 96 128

Scalability

# of GPUs

0 32 64 96 128

Scalability

# of GPUs

0 32 64 96 128

Scalability

# of GPUs

0 32 64 96 128

115x speedup

Convergence

# of epochs

0 30 60 90 120

batch size=256batch size=2,560batch size=5,120

Convergence

✧ Increase learning rate by 5x

# of epochs

0 30 60 90 120

Convergence

✧ Increase learning rate by 5x

✧ Increase learning rate by 10x, decrease it at epoch 50, 80

# of epochs

0 30 60 90 120

Minibatch SGD

Batch size (b)

Better system performance

Minibatch SGD

✧ Large batch size b in SGD ✓ Better parallelization within a batch ✓ Less switching/communication cost

Batch size (b)

Minibatch SGD

✧ Large batch size b in SGD ✓ Better parallelization within a batch ✓ Less switching/communication cost

✧ Small batch size b ✓ Faster convergenceN: number of examples processed

O(1/pN + b/N)

Batch size (b)

convergence rateWorse

Motivation

Batch size (b)

convergence rate

[Li et al, KDD’14]

Motivation

✧ Improve converge rate for large batch size ✓ Example variance decreases with batch size ✓ Solve a more “accurate” optimization

subproblem over each batch

Batch size (b)

convergence rate

[Li et al, KDD’14]

Efficient Minibatch SGD✧ Define . Minibatch SGD solves

fIt(w) :=X

wt = argminw2⌦

fIt(wt�1) + h@fIt(wt�1), w � wt�1i+

2⌘tkw � wt�1k22

fIt(w) :=X

first-order approximation

wt = argminw2⌦

�conservative penalty

fIt(w) :=X

wt = argminw2⌦

fIt(w) +

�✧ EMSO solves the subproblem at each iteration

fIt(w) :=X

wt = argminw2⌦

exact objective

wt = argminw2⌦

fIt(w) +

fIt(w) :=X

wt = argminw2⌦

✧ For convex fi, choose . EMSO has convergence rate (compared to )

⌘t = O(b/pN)

O(1/pN)

O(1/pN + b/N)

exact objective

wt = argminw2⌦

fIt(w) +

Experiment✧ Ads click prediction with fixed run time

Batch size

1e3 1e4 1e5

SGDEMSO

Single machine

Batch size

1e3 1e4 1e5

SGDEMSO

Batch size

1e3 1e4 1e5

SGDEMSO

Single machine 10 machines

Extended to deep learning in [Keskar et al, arXiv’16]O

Batch size

1e3 1e4 1e5

SGDEMSO

Batch size

1e3 1e4 1e5

SGDEMSO

Single machine 10 machines

✧ Distributed systems ✧ Large scale optimization

Reduce communication

✓ Communicate less ✓ Message compression ✓ Relaxed data consistency

costCo-design

✓ Communicate less ✓ Message compression ✓ Relaxed data consistency

costCo-design

With appropriate computational frameworks and algorithm design, distributed machine learning can be made simple,

fast, and scalable, both in theory and in practice.

Acknowledgement

with other 13 collaborators

Advisors

QQ & Alex

Committee members

Backup slides

Scaling to 16 GPUs in a Single Machine

# of GPUs

0 4 8 12 16

Comm Costbs/GPU=2bs/GPU=4bs/GPU=8bs/GPU=16

# of GPUs

0 4 8 12 16

# of GPUs

0 4 8 12 16

Communication dominates

# of GPUs

0 4 8 12 16

Communication dominates

Compare to a L-BFGS Based System

System AParameter Server

0.1 1 10

Time (hour)

System A Parameter Server

computing waiting

Sections not Covered

✧ AdaDelay: model the actual delay for asynchronized SGD ✧ Parsa: data placement to reduce communication cost ✧ Difacto: large scale factorization machine

Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ......

Documents

Online Placement and Scaling of Geo-distributed Machine ...ix.cs.uoregon.edu/~jiao/publications/tpds20.pdfresource demands from multiple geo-distributed ML jobs, and rent cloud resources

Scaling Distributed Machine Learning with the Parameter Serverweb.eecs.umich.edu/~mosharaf/Readings/Parameter-Server.pdf · Scaling Distributed Machine Learning with the Parameter

Data Science. Topics databases and data architectures databases in the real world –scaling, data quality, distributed machine learning/data mining/statistics

Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · 2017-02-05 · Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems for machine learning

Scaling Down Distributed Infrastructure on Wimpy Machines for …jimmylin/publications/Lin_TempWeb2015.pdf · Scaling Down Distributed Infrastructure on Wimpy Machines for Personal

Distributed Monitoring and Cloud Scaling

Scaling Agile For Distributed Enterprise Organizations

Scaling Distributed Machine Learning with In …Scaling Distributed Machine Learning with In-Network Aggregation Amedeo Sapio∗ KAUST Marco Canini∗ KAUST Chen-Yu Ho KAUST Jacob

Scalable Machine Learning with Dask - Tom …Scalable Machine Learning Dask-ML Scaling Pains 15 Scaling Pains 16 Scaling Pains 17 Scaling Pains 18 Distributed Scikit-Learn Distributed

Scaling Distributed Machine Learning with In-Network Aggregation€¦ · The remarkable success of today’s machine learning solutions de-rives from the ability to build increasingly

Distributed Clustering for Scaling Classic Algorithms

Scaling Up Distributed Solar in Emerging Markets

Scaling Distributed Machine Learning with In-Network … · Scaling Distributed Machine Learning with In-Network Aggregation ... reductions in end-to-end training times. While the

Scaling Distributed Machine Learning with the …0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf…Scaling Distributed Machine Learning with the Parameter Server Mu Li

Scaling Distributed Cache Hierarchies through Computation ...beckmann/publications/papers/2015.hpca.cdcs.pdf · Scaling Distributed Cache Hierarchies through Computation and Data

Distributed machine learning

Scaling Distributed Machine Learning with the Parameter Serverdga/papers/osdi14-paper-li_mu.pdf · Scaling Distributed Machine Learning with the Parameter Server Mu Li, ... itate

Scaling Distributed Machine Learning with the Parameter Server · Scaling Distributed Machine Learning with the Parameter Server Mu Liz, David G. Andersen , Jun Woo Park , Alexander

Scaling Distributed Machine Learningmuli/file/mu_defense.pdf · Scaling Distributed Machine Learning . 2 min w Xn i=1 f i (w) Large-scale problems Distributed systems Large scale

Scaling a DISTRIBUTED SPATIAL CACHE OVERLAY · Reduce data skew via migration and replication ... dynamic loads in a distributed spatial cache overlay. ... “Scaling a distributed