Crossbow: Scaling Deep Learning on Multi-GPU Servers

Crossbow: Scaling Deep Learning on Multi-GPU Servers

Peter Pietzuchwith Alexandros Koliousis, Luo Mai, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa

Imperial College Londonhttp://lsds.doc.ic.ac.uk<[email protected]>

CASTOR Software Days – Stockholm, Sweden – October 2019

Large-Scale Data & Systems Group

Machine Learning with Deep Neural Networks (DNN)

Revolutionised solutions in vision, speech recognition, …DNN models are trained by giving examples (instead of programming)

`

`

`

audio

words

text

topics

hello audience

images

labels`

`

`

Peter Pietzuch - Imperial College London 2

When DNN output is wrong, tweak its parameters

Training DNNs

• Obtain DNN model that minimises classification error

• Use Stochastic Gradient Descent (SGD) for training:

• 1. Begin with random model• 2. Consider mini-batch of

training data• 3. Iteratively calculate gradients

& update model parameters w

Model parameters wEr

ror

lowest error

random optimal

• converge


Training DNNs on GPUs

• GPUs are good at parallelising gradient computation


Training DNNs in Parallel with GPUs

• With large datasets, speed up by calculating gradients on multiple GPUs • Every GPU has model replica with a copy of model parameters (or weights)

0

0

0

0

0

1

1 1

11

1 0

10

1

0

0

0

0

0

1

1

1

11

1

0

1

00

01 0

0

0

0

0

01

11

1

1

10

01

GPU 1 GPU 2 GPU 3

gradient

model

gradient gradient

model model

Shard 1 Shard 2 Shard 3

But model replicaswould divergeover time…

Mini-batch (of training data)Peter Pietzuch - Imperial College London 5

Model Synchronisation among GPUs

• Parameter server: Maintains global model

GPU 1 GPU 2 GPU 3

model model model

model

gradient gradientgradient

Globalmodel

• GPUs:1. Send gradients to

update global model2. Synchronise local

model replicas with global model


The Problem with Large Batch Sizes

Training with large mini-batches is bad for your health.

More importantly, it’s bad for your test error.

Friends don’t let friends use mini-batches larger than 32.

Yann LeCun @ylecun

2:00 PM – 26 Apr 2018447 1.2K


Why Use Large Batch Sizes?

dataset

gradient gradient gradient

weights

average

An even bigger batchA bigger batchA batch

Keep work per GPU constant to scale

E.g. ~32 to 256 labelled images


What is the Best Batch Size on a GPU?

• ResNet-32 on NVIDIA Titan X GPU

Peter Pietzuch - Imperial College London

1134

361302 354 379

445

0200400600800

10001200

32 64 128 256 512 1024Tim

e to

acc

urac

y (s

ec)

Batch size b

TensorFlow

9

Training DNNs Favours Small Batch Sizes

We want frequent, less “noisy” updates

w

GPU gradsmallbatch

weights

wweights

grad

grad

avg

GPU

largebatch

GPU

w

w

GPU gradsmallbatch


Statistical Efficiency Needs Small Batch Sizes


0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100 120 140

Test

acc

urac

y (%

)

Epochs

b=64b=128b=256b=512

b=1024b=2048b=4096

11

small batch sizes

largebatch sizes

Hardware Efficiency Needs Large Batch Sizes

Keep work per GPU constant → increase batch size with #GPUs


Computegradient

Updatereplica

Computegradient

Updatereplica

Computegradient

ComputegradientAverage

GPU

1G

PU 2

Batch size ½ b

Batch size ½ b

Batch size ½ b

Batch size ½ b

12

Tension between Hardware & Statistical Efficiency

• Practitioners increase batch size due to hardware efficiency

• But best batch size depends on both hardware & statistical efficiency


Training with large mini-batches is bad for your health.

More importantly, it’s bad for your test error.

Friends don’t let friends use mini-batches larger than 32.

Yann LeCun @ylecun

2:00 PM – 26 Apr 2018

447 1.2K

Current Practice: Hyper-Parameter Tuning

• Adjust hyper-parameters (eg learning rate, momentum etc) to avoid reduction in statistical efficiency

• Linear scaling rule:"When mini-batch size is multiplied by k, multiply learning rate by k”

• Goyal et al. (2017)

• Drawbacks– Manual, labour-intensive process– Highly model specific – not portable and does not work for some models– Less effective for very large batch sizes…


Limits of Hyper-Parameter Tuning

“When mini-batch size is multiplied by k, multiply learning rate by k”

20

25

30

35

40

256 1,024 4,096 16,384 65,536

Top-

1 Va

lidat

ion

Erro

r

Mini-batch size

ResNet-50, 32 images/GPU

8,192


Fundamental Challenge of GPU Scaling

• “If batch size could be made arbitrarily large while still training effectively, then training is amenable to standard weak scaling approaches. However, if the training rate of some models is restricted to small batch sizes, then we will need to find other algorithmic and architectural approaches to their acceleration.”

– J. Dean, D. Patterson et al., “A New Golden Age in Computer Architecture”, IEEE Micro, 2018

• How to design a deep learning system that scales training with multiple GPUs, even when the preferred batch size is small?


(1) How to increasehardware efficiency with small batches?

(2) How to synchronisemodel replicas?

(3) How to reducescheduling &synchronisationoverheads?

task1

task2

task3

model

model

model

GPU

buffer-1buffer-2

buffer-3buffer-4

Reusable data buffers

Currently in use

task

GPU-1 GPU-2

model modelmodel model

Model


Problem: Small Batch Sizes Underutilise GPUs


0

20

40

60

80

100

0 20 40 60 80 100

GPU

Occ

upan

cy (%

)

CDF (%)

Under-used resources

18

How to Process Small Batches Efficiently?

One batch per GPU →Not enough data and instruction parallelism for every operator

batchgrad

weightsbatch

grad

weights

One per GPU0

100

Operations

GPU

Util

. (%

)


Idea: Train Multiple Model Replicas per GPU

One learning process (or learner) per GPU stream

batchgrad

weightsLearner

Learner

Learner

Stream

Stream

Stream

1 GPU

Scheduler


Effect of Training Multiple Model Replicas per GPU

• But now we must synchronise a large number of learners/model replicas...Peter Pietzuch - Imperial College London

0

20

40

60

80

100

32 64 128 256 512 1024Thro

ughp

ut in

crea

se (%

)

Batch size b

Regained resources

21

(1) How to increaseefficiency with small batches?


task1

task2

task3

model

model

model

GPU

GPU 1 GPU 2


Model


• Train multiple model replicas per GPU

Problem: Why not Synchronous Parallel SGD?

All learners always start from the same pointLimited exploration of parameter space


Idea: Maintain Independent Model Replicas

• Benefits: – Increased exploration of space through parallelism– Each model replica uses small batch size


Average model trajectory

Replica X’s trajectory

Replica Y’s trajectory

Initial weights

Crossbow: Synchronous Model Averaging

Allow learners to diverge but correct trajectories based on average modelAccelerate average model trajectory with momentum to find minima faster

correction

correction

Momentum-accelerated


GPUs with Synchronous Model Averaging

Learner

Replica …Learner

Replica

GPU 2

Learner

Replica …Learner

Replica

Learner

Replica …Learner

Replica

GPU 3GPU 1

• Synchronously apply corrections to model replicas



Learner

Replica …

Reference Model

Learner

Learner

Replica

GPU 2

Learner

Replica …

Average Model

Learner

Learner

Replica

Learner

Replica …

Reference Model

Learner

Learner

Replica

GPU 3GPU 1

• Synchronously apply corrections to model replicas



Learner

Replica …

Reference Model

Learner

Learner

Replica

GPU 2

Learner

Replica …

Average Model

Learner

Learner

Replica

Learner

Replica …

Reference Model

Learner

Learner

Replica

GPU 3GPU 1

Synchronous Model Averaging

• Ensures consistent view of average model• Takes GPU bandwidth into account during synchronisation


• Train multiple model replicas per GPU

• Use synchronous model averaging

(1) How to increaseefficiency with small batches?


(3) How to reducescheduling andsynchronisationoverheads?

task1

task2

task3

model

model

model

GPU

GPU-1 GPU-2


Model


buffer-1buffer-2

buffer-3buffer-4

Reusable data buffers

Currently in use

task

Crossbow Architecture

Auto-tuner

Java24K LOC

C/C++15K LOC

Dataset

Datapre-processData

pre-processDatapre-processors

Input batches

Model replicas

Learner streams

Ready queues

Task schedulerDataflows

GPU

1G

PU 2

Learner

Synch

Integration with TensorFlow

github.com/lsds/crossbow

Learner

Synch


Efficient Task Scheduling

• Execute compute and synchronisation tasks

• Fine-grained concurrency

• Need efficient scheduler to feed all GPUs with tasks

Dataflow

Task scheduler

Input batches

Ready queues

Dataset

Data pre-processors

<Multiple threads>

Model replicas

Learn. streams

Auto-tuner


Interleaving Compute & Synchronisation Tasks


Computegradient

Replica Queue

Elasticaverage

Updatereplica

Averagegradients

Averageall gradients

Updatereplica

Computegradient

Computegradient

Replica Queue

Elasticaverage

Updatereplica

Updatereplica

Computegradient

Elasticaverage

Elasticaverage

Updatedevice model

Map

ped

mem

ory

Auto-tuner

Updatereplica

Updatereplica

Dataset

W

Worker threads

W

W

Computegradient

Sync. Queue

Replica Queue

Elasticaverage

Updatereplica

Averagegradients

Updatereplica

Computegradient

Replica

Queue

Elasticaverage

Updatereplica

Updatereplica

Computegradient

Computegradient

Elasticaverage

Elasticaverage

Updatedevice model

Updatereplica

Updatereplica

New Replica Queue

Sync. Queue

New Replica Queue

Batch NBatch N+1

Batch N+2

Mon

itor

GPU

1G

PU 2

A

…

Create new queue on-the-fly Ready to compute another gradient

Sche

dule

nex

t ava

ilabl

e re

plic

a

32

Auto-Tuning the Number of Model Replicas

• Monitor training throughput

• Dynamically adustnumber of learners

• Uses object pooling & lazy materialization

Auto-tuner

Input batches

Task manager

Dataset

Data pre-processors

Model replicas

Learn. streams


Experimental Evaluation


Does Crossbow Train Effectively with Small Batch Sizes?

Multiple learners per GPU improve hardware efficiency

TensorFlow Crossbow

b = 64

b = 256b = 64

1 learner

b = 64 4 learners

0200400600800

100012001400

Tim

e to

test

acc

urac

y (s

ec)


• ResNet-32 with ImageNet dataset on 1 Titan X GPU

0

20000

40000

60000

80000

100000

8

TT

A(5

3%

) (s

ec)

g

TensorFlow

32

Crossbow (m=1)

32

Crossbow

16

2

Does Crossbow Scale to multiple GPUs?

Tim

e to

test

acc

urac

y (s

ec)

GPUs

ResNet-50


Synchronous Model Averaging improves statistical efficiency

• ResNet-50 with ImageNet dataset on 8 Titan X GPUs

Does Crossbow Train Effectively Across Models?

Training with multiple learners always better than training with large batches

1.31.5

1.2

ResNet-32 VGG ResNet-50 ResNet-101

2.7

1

2

3

4

LeNet

Spee

d-up

ove

r Ten

sorF

low

1 GPU 8 GPUs4.3


• ResNet-50 with ImageNetdataset on Titan X GPUs

What is the Statistical Efficiency with Many Learners?


0

20

40

60

80

100

0 20 40 60 80 100 120 140

Test

acc

urac

y (%

)

Epochs

m=1m=2m=4m=8

m=16m=32

Increasing parallelism has less effect on statistical efficiency38

Crossbow: Scaling GPU Deep Learning

• Need to make training throughput independent from hyper-parameters– Rethink the design of future deep learning systems

• Crossbow: Scaling DNN training with small batch sizes on many GPUs– Multiple model replicas per GPU for high hardware efficiency– Synchronous model averaging for high statistical efficiency

• Exciting research challenges for next generation deep learning systems


Peter Pietzuchhttps://lsds.doc.ic.ac.uk — [email protected]

Thank You — Any Questions?

github.com/lsds/crossbow

Documents

Crossbow: Scaling Deep Learning on Multi-GPU Servers