Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

Scaling Deep Learning on Multi-GPU Servers

Peter Pietzuch(with Alexandros Koliousis, Luo Mai, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa)

Imperial College London

http://lsds.doc.ic.ac.uk<[email protected]>

SICS – Sweden – September 2018

Large-Scale Data & Systems Group

Deep Neural Networks (DNN) at Scale

Speech recognitionImage classification

Audio data

“Hello”“Person”

Image data

TrainDNNswith a

lot of data

Peter Pietzuch - Imperial College London

Training DNNs with Stochastic Gradient Descent

• Goal: Obtain DNN model that minimises classification error

• Stochastic Gradient Descent (SGD): � Consider mini-batch of training data� Iteratively calculate gradients and update model parameters w

Model parameters w

Erro

r

lowest error

random optimal

• converge

Peter Pietzuch - Imperial College London 3

The Problem With Large Batch Sizes

• “Training with large mini-batches is bad for your health. More importantly, it’s bad for your test error. Friends don’t let friends use mini-batches larger than 32.”

� Y. LeCun, @ylecun, April 26, 2018


Scaling DNN Training on GPUs

• Manager:• Need a highly-accurate image classification model ASAP

• Highly-Paid Data Scientist:• OK. I’ll use ResNet-50. It will take ½ month

on 1 GPU

• M: • Throw more hardware at the problem! Need this sooner.

• HPDS: …


Parallel DNN Training• With large training datasets, speed up by calculating gradients in parallel

0

0

0

0

0

1

1 1

11

1 0

10

1

0

0

0

0

0

1

1

1

11

1

0

1

00

01 0

0

0

0

0

01

11

1

1

10

01

GPU 1 GPU 2 GPU 3

gradient

model

gradient gradient

model model

Shard 1 Shard 2 Shard 3

Model replicasdiverge over time

Mini-batchPeter Pietzuch - Imperial College London 6

Synchronisation Among GPUs

• Parameter server: Maintains global model

GPU 1 GPU 2 GPU 3

model model model

model

gradient gradientgradient

Globalmodel

• GPUs:1. Send gradients to

update global model2. Synchronise local

model replicas with global model


What is the Best Batch Size?• ResNet-32 on Titan X GPU


1134

361302 354 379

445

0200400600800

10001200

32 64 128 256 512 1024Tim

e to

acc

urac

y (s

ec)

Batch size, b

TensorFlow

Intuition for Small Batch Sizes

• Practical considerations: • More frequent model updates minimise bias to initial conditions faster

(i.e. initial weights are forgotten faster)

• Theoretical considerations:• Small batch sizes explore better the wider minima, which are known to

exhibit better test accuracy


Statistical Efficiency Needs Small Batch Sizes


0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100 120 140

Test

acc

urac

y (%

)

Epochs

b=64b=128b=256b=512

b=1024b=2048b=4096

Hardware Efficiency Needs Large Batch Sizes

Keep work per GPU constant => scale batch size with #GPUs


Computegradient

Updatereplica

Computegradient

Updatereplica

Computegradient

ComputegradientAverage

GPU

1G

PU 2

Batch N

Batch N

Batch N+1

Batch N+1Aggregategradients(blocking)

12

12

12

12

Hardware & Statistical Efficiency

• “Training with large mini-batches is bad for your health. More importantly, it’s bad for your test error. Friends don’t let friends use mini-batches larger than 32.”� Y. LeCun, @ylecun, April 26, 2018

• The only reason practitioners increase batch size is hardware efficiency

• But best batch size depends on both hardware efficiency & statistical efficiency


Limits of Scaling DNN Training

• HPDS: Managed to train it on 100s of GPUs in 1h!

• M: Great! Make it faster. Use as many resources as you need!

• HPDS: Can’t :( Beyond this point statistical efficiency collapses. Actually,I straggled to maintain it up to this point…

• M: …

• The story continues. New hardware becomes available. Improved communication between workers. Latest results train ResNet-50 in just few minutes. But the problem remains.


Hyper-Parameter Tuning

The Fundamental Challenge of GPU Scaling

• “If batch size could be made arbitrarily large while still training effectively, then training is amenable to standard weak scaling approaches. However, if the training rate of some models is restricted to small batch sizes, then we will need to find other algorithmic and architectural approaches to their acceleration.”� J. Dean, D. Patterson, X. […], “The Golden Age”, IEEE Micro

• How to design a system that can scale training with multiple GPUs even when the preferred batch size is small?


Problem: Small Batch Sizes Underutilise GPUs


0

20

40

60

80

100

0 20 40 60 80 100

GPU

Occ

upan

cy (%

)

CDF (%)

Under-used resources

Idea: Train Multiple Model Replicas per GPU

• Fully exploits task parallelism on GPU

• But we now need to synchronise large number of model replicas...Peter Pietzuch - Imperial College London 16

0

20

40

60

80

100

32 64 128 256 512 1024Thro

ughp

ut In

crea

se (%

)

Batch size, b

Regained resources

How to Synchronise Many Model Replicas?

• Synchronised SGD:� Average gradients of multiple workers� Start next training with average model

• Workers start next exploration from same point in weight space


Average gradienttrajectory

Replica X’s trajectory

Replica Y’s trajectory

Initial weights

Initial weights

Idea: Sample Multiple Nearby Points in Space

• Synchronisation using Elastic Averaging SGD (EA-SGD)

• Benefits: 1. Each replica reuses learning set-up of small batch size2. Increased exploration through parallelism3. Average replica helps with direction to good minima


Correction rule

Average model trajectory



Initial weights

When To Apply Corrections?

• Synchronously apply corrections to model replicas

• Other techniques� Control impact of average model using momentum� Auto-tune number of model replicas


Correction rule

Average model trajectory



Initial weights

• Execute concurrent tasks to leverage GPU concurrency

• Schedule fine-grained compute & synchronisationtasks

• Minimise amount of data transfer among different GPUs

Crossbow: Multi-GPU Deep Learning System

(1) Train multiplemodels in parallel

(2) Synchronous hierarchicalmodel averaging

(3) Efficient fine-grained task engine

task1

task2

task3

model

model

model

GPU

buffer-1buffer-2

buffer-3buffer-4

Reusable data buffers

Currently in use

task

GPU-1 GPU-2

model modelmodel model

Model


GPU Parallelism with Multiple Model Replicas

• On each GPU, use different streams for training tasks• Task: series of operations based on computation of model layers

� Task associated with (a) model replica and (b) batch of training data

data

model

task

stream 1

stream 2

stream 3

model1 task1

task2

task3

model2

model3

data1

data2

data3

Dispatch to stream


Synchronous Hierarchical Elastic Model Averaging

• Intra-GPU, cross-GPU, and CPU-GPU communication have different performance characteristics

• Crossbow uses hierarchical aggregation to avoid bottlenecks

GPU-1 GPU-2

model model

model

model model

model

Model

Sync withinGPU

Sync across GPUs


CrossBow Task Execution Engine


Computegradient

Replica Queue

Elasticaverage

Updatereplica

Averagegradients

Averageall gradients

Updatereplica

Computegradient

Computegradient

Replica Queue

Elasticaverage

Updatereplica

Updatereplica

Computegradient

Elasticaverage

Elasticaverage

Updatedevice model

Map

ped

mem

ory

Auto-tuner

Updatereplica

Updatereplica

Dataset

W

Worker threads

W

W

Computegradient

Sync. Queue

Replica Queue

Elasticaverage

Updatereplica

Averagegradients

Updatereplica

Computegradient

Replica

Queue

Elasticaverage

Updatereplica

Updatereplica

Computegradient

Computegradient

Elasticaverage

Elasticaverage

Updatedevice model

Updatereplica

Updatereplica

New Replica Queue

Sync. Queue

New Replica Queue

Batch NBatch N+1

Batch N+2

Mon

itor

GPU

1G

PU 2

A

…

Create new queue on-the-fly Ready to compute another gradient

Sche

dule

nex

t ava

ilabl

e re

plic

a

• Many small tasks for (1) replica training and (2) synchronisation

• CrossBow has GPU/CPU task engine for multiplexing & overlapping small tasks

CrossBow: Benefit of Synchronous Model Averaging


0

500

1000

1500

2000

2500

1 2 4 8

TensorFlow CrossBow 1-replica CrossBow

Number of GPUs

Tim

e to

69%

test

acc

urac

y (s

econ

ds)

• VGG-16 with Cifar-100 dataset on Titan X GPUs

CrossBow: Statistical Efficiency with Many Models


0

20

40

60

80

100

0 20 40 60 80 100 120 140

Test

acc

urac

y (%

)

Epochs

m=1m=2m=4m=8

m=16m=32

Increasing parallelism does not collapse convergence, while each replica reuses same learning setup

• ResNet-50 with ImageNet dataset on Titan X GPUs

Summary: Scaling Deep Leaning on Multi-GPU Servers

• Need to make training throughput independent from hyper-parameters� Current need for hyper-parameter tuning too complex� Need new designs for deep learning systems

• Crossbow: Scaling DNN training with small batch sizes on many GPUs� Multiple model replicas per GPU for high hardware efficiency� Synchronous hierarchical model averaging for high statistical efficiency� Requires new CPU/GPU task engine design

27

Peter Pietzuchhttps://lsds.doc.ic.ac.uk — [email protected]

Thank You — Any Questions?

Peter Pietzuch - Imperial College London

Documents

Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,