39
Crossbow: Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch with Alexandros Koliousis, Luo Mai, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa Imperial College London http://lsds.doc.ic.ac.uk <[email protected]> CASTOR Software Days – Stockholm, Sweden – October 2019 Large-Scale Data & Systems Group

Crossbow: Scaling Deep Learning on Multi-GPU Servers

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Crossbow: Scaling Deep Learning on Multi-GPU Servers

Peter Pietzuchwith Alexandros Koliousis, Luo Mai, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa

Imperial College Londonhttp://lsds.doc.ic.ac.uk<[email protected]>

CASTOR Software Days – Stockholm, Sweden – October 2019

Large-Scale Data & Systems Group

Page 2: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Machine Learning with Deep Neural Networks (DNN)

Revolutionised solutions in vision, speech recognition, …DNN models are trained by giving examples (instead of programming)

`

`

`

audio

words

text

topics

hello audience

images

labels`

`

`

Peter Pietzuch - Imperial College London 2

When DNN output is wrong, tweak its parameters

Page 3: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Training DNNs

• Obtain DNN model that minimises classification error

• Use Stochastic Gradient Descent (SGD) for training:

• 1. Begin with random model• 2. Consider mini-batch of

training data• 3. Iteratively calculate gradients

& update model parameters w

Model parameters wEr

ror

lowest error

random optimal

• converge

Peter Pietzuch - Imperial College London 3

Page 4: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Training DNNs on GPUs

• GPUs are good at parallelising gradient computation

Peter Pietzuch - Imperial College London 4

Page 5: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Training DNNs in Parallel with GPUs

• With large datasets, speed up by calculating gradients on multiple GPUs • Every GPU has model replica with a copy of model parameters (or weights)

0

0

0

0

0

1

1 1

11

1 0

10

1

0

0

0

0

0

1

1

1

11

1

0

1

00

01 0

0

0

0

0

01

11

1

1

10

01

GPU 1 GPU 2 GPU 3

gradient

model

gradient gradient

model model

Shard 1 Shard 2 Shard 3

But model replicaswould divergeover time…

Mini-batch (of training data)Peter Pietzuch - Imperial College London 5

Page 6: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Model Synchronisation among GPUs

• Parameter server: Maintains global model

GPU 1 GPU 2 GPU 3

model model model

model

gradient gradientgradient

Globalmodel

• GPUs:1. Send gradients to

update global model2. Synchronise local

model replicas with global model

Peter Pietzuch - Imperial College London 6

Page 7: Crossbow: Scaling Deep Learning on Multi-GPU Servers

The Problem with Large Batch Sizes

Training with large mini-batches is bad for your health.

More importantly, it’s bad for your test error.

Friends don’t let friends use mini-batches larger than 32.

Yann LeCun @ylecun

2:00 PM – 26 Apr 2018447 1.2K

Peter Pietzuch - Imperial College London 7

Page 8: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Why Use Large Batch Sizes?

dataset

gradient gradient gradient

weights

average

An even bigger batchA bigger batchA batch

Keep work per GPU constant to scale

E.g. ~32 to 256 labelled images

Peter Pietzuch - Imperial College London 8

Page 9: Crossbow: Scaling Deep Learning on Multi-GPU Servers

What is the Best Batch Size on a GPU?

• ResNet-32 on NVIDIA Titan X GPU

Peter Pietzuch - Imperial College London

1134

361302 354 379

445

0200400600800

10001200

32 64 128 256 512 1024Tim

e to

acc

urac

y (s

ec)

Batch size b

TensorFlow

9

Page 10: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Training DNNs Favours Small Batch Sizes

We want frequent, less “noisy” updates

w

GPU gradsmallbatch

weights

wweights

grad

grad

avg

GPU

largebatch

GPU

w

w

GPU gradsmallbatch

Peter Pietzuch - Imperial College London 10

Page 11: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Statistical Efficiency Needs Small Batch Sizes

Peter Pietzuch - Imperial College London

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100 120 140

Test

acc

urac

y (%

)

Epochs

b=64b=128b=256b=512

b=1024b=2048b=4096

11

small batch sizes

largebatch sizes

Page 12: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Hardware Efficiency Needs Large Batch Sizes

Keep work per GPU constant → increase batch size with #GPUs

Peter Pietzuch - Imperial College London

Computegradient

Updatereplica

Computegradient

Updatereplica

Computegradient

ComputegradientAverage

GPU

1G

PU 2

Batch size ½ b

Batch size ½ b

Batch size ½ b

Batch size ½ b

12

Page 13: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Tension between Hardware & Statistical Efficiency

• Practitioners increase batch size due to hardware efficiency

• But best batch size depends on both hardware & statistical efficiency

Peter Pietzuch - Imperial College London 13

Training with large mini-batches is bad for your health.

More importantly, it’s bad for your test error.

Friends don’t let friends use mini-batches larger than 32.

Yann LeCun @ylecun

2:00 PM – 26 Apr 2018

447 1.2K

Page 14: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Current Practice: Hyper-Parameter Tuning

• Adjust hyper-parameters (eg learning rate, momentum etc) to avoid reduction in statistical efficiency

• Linear scaling rule:"When mini-batch size is multiplied by k, multiply learning rate by k”

• Goyal et al. (2017)

• Drawbacks– Manual, labour-intensive process– Highly model specific – not portable and does not work for some models– Less effective for very large batch sizes…

Peter Pietzuch - Imperial College London 14

Page 15: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Limits of Hyper-Parameter Tuning

“When mini-batch size is multiplied by k, multiply learning rate by k”

20

25

30

35

40

256 1,024 4,096 16,384 65,536

Top-

1 Va

lidat

ion

Erro

r

Mini-batch size

ResNet-50, 32 images/GPU

8,192

Peter Pietzuch - Imperial College London 15

Page 16: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Fundamental Challenge of GPU Scaling

• “If batch size could be made arbitrarily large while still training effectively, then training is amenable to standard weak scaling approaches. However, if the training rate of some models is restricted to small batch sizes, then we will need to find other algorithmic and architectural approaches to their acceleration.”

– J. Dean, D. Patterson et al., “A New Golden Age in Computer Architecture”, IEEE Micro, 2018

• How to design a deep learning system that scales training with multiple GPUs, even when the preferred batch size is small?

Peter Pietzuch - Imperial College London 16

Page 17: Crossbow: Scaling Deep Learning on Multi-GPU Servers

(1) How to increasehardware efficiency with small batches?

(2) How to synchronisemodel replicas?

(3) How to reducescheduling &synchronisationoverheads?

task1

task2

task3

model

model

model

GPU

buffer-1buffer-2

buffer-3buffer-4

Reusable data buffers

Currently in use

task

GPU-1 GPU-2

model modelmodel model

Model

Peter Pietzuch - Imperial College London 17

Page 18: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Problem: Small Batch Sizes Underutilise GPUs

Peter Pietzuch - Imperial College London

0

20

40

60

80

100

0 20 40 60 80 100

GPU

Occ

upan

cy (%

)

CDF (%)

Under-used resources

18

Page 19: Crossbow: Scaling Deep Learning on Multi-GPU Servers

How to Process Small Batches Efficiently?

One batch per GPU →Not enough data and instruction parallelism for every operator

batchgrad

weightsbatch

grad

weights

One per GPU0

100

Operations

GPU

Util

. (%

)

Peter Pietzuch - Imperial College London 19

Page 20: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Idea: Train Multiple Model Replicas per GPU

One learning process (or learner) per GPU stream

batchgrad

weightsLearner

Learner

Learner

Stream

Stream

Stream

1 GPU

Scheduler

Peter Pietzuch - Imperial College London 20

Page 21: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Effect of Training Multiple Model Replicas per GPU

• But now we must synchronise a large number of learners/model replicas...Peter Pietzuch - Imperial College London

0

20

40

60

80

100

32 64 128 256 512 1024Thro

ughp

ut in

crea

se (%

)

Batch size b

Regained resources

21

Page 22: Crossbow: Scaling Deep Learning on Multi-GPU Servers

(1) How to increaseefficiency with small batches?

(2) How to synchronisemodel replicas?

task1

task2

task3

model

model

model

GPU

GPU 1 GPU 2

model modelmodel model

Model

Peter Pietzuch - Imperial College London 22

• Train multiple model replicas per GPU

Page 23: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Problem: Why not Synchronous Parallel SGD?

All learners always start from the same pointLimited exploration of parameter space

Peter Pietzuch - Imperial College London 23

Page 24: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Idea: Maintain Independent Model Replicas

• Benefits: – Increased exploration of space through parallelism– Each model replica uses small batch size

Peter Pietzuch - Imperial College London 24

Average model trajectory

Replica X’s trajectory

Replica Y’s trajectory

Initial weights

Page 25: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Crossbow: Synchronous Model Averaging

Allow learners to diverge but correct trajectories based on average modelAccelerate average model trajectory with momentum to find minima faster

correction

correction

Momentum-accelerated

Peter Pietzuch - Imperial College London 25

Page 26: Crossbow: Scaling Deep Learning on Multi-GPU Servers

GPUs with Synchronous Model Averaging

Learner

Replica …Learner

Replica

GPU 2

Learner

Replica …Learner

Replica

Learner

Replica …Learner

Replica

GPU 3GPU 1

• Synchronously apply corrections to model replicas

Peter Pietzuch - Imperial College London 26

Page 27: Crossbow: Scaling Deep Learning on Multi-GPU Servers

GPUs with Synchronous Model Averaging

Learner

Replica …

Reference Model

Learner

Learner

Replica

GPU 2

Learner

Replica …

Average Model

Learner

Learner

Replica

Learner

Replica …

Reference Model

Learner

Learner

Replica

GPU 3GPU 1

• Synchronously apply corrections to model replicas

Peter Pietzuch - Imperial College London 27

Page 28: Crossbow: Scaling Deep Learning on Multi-GPU Servers

GPUs with Synchronous Model Averaging

Learner

Replica …

Reference Model

Learner

Learner

Replica

GPU 2

Learner

Replica …

Average Model

Learner

Learner

Replica

Learner

Replica …

Reference Model

Learner

Learner

Replica

GPU 3GPU 1

Synchronous Model Averaging

• Ensures consistent view of average model• Takes GPU bandwidth into account during synchronisation

Peter Pietzuch - Imperial College London 28

Page 29: Crossbow: Scaling Deep Learning on Multi-GPU Servers

• Train multiple model replicas per GPU

• Use synchronous model averaging

(1) How to increaseefficiency with small batches?

(2) How to synchronisemodel replicas?

(3) How to reducescheduling andsynchronisationoverheads?

task1

task2

task3

model

model

model

GPU

GPU-1 GPU-2

model modelmodel model

Model

Peter Pietzuch - Imperial College London 29

buffer-1buffer-2

buffer-3buffer-4

Reusable data buffers

Currently in use

task

Page 30: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Crossbow Architecture

Auto-tuner

Java24K LOC

C/C++15K LOC

Dataset

Datapre-processData

pre-processDatapre-processors

Input batches

Model replicas

Learner streams

Ready queues

Task schedulerDataflows

GPU

1G

PU 2

Learner

Synch

Integration with TensorFlow

github.com/lsds/crossbow

Learner

Synch

Peter Pietzuch - Imperial College London 30

Page 31: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Efficient Task Scheduling

• Execute compute and synchronisation tasks

• Fine-grained concurrency

• Need efficient scheduler to feed all GPUs with tasks

Dataflow

Task scheduler

Input batches

Ready queues

Dataset

Data pre-processors

<Multiple threads>

Model replicas

Learn. streams

Auto-tuner

Peter Pietzuch - Imperial College London 31

Page 32: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Interleaving Compute & Synchronisation Tasks

Peter Pietzuch - Imperial College London

Computegradient

Replica Queue

Elasticaverage

Updatereplica

Averagegradients

Averageall gradients

Updatereplica

Computegradient

Computegradient

Replica Queue

Elasticaverage

Updatereplica

Updatereplica

Computegradient

Elasticaverage

Elasticaverage

Updatedevice model

Map

ped

mem

ory

Auto-tuner

Updatereplica

Updatereplica

Dataset

W

Worker threads

W

W

Computegradient

Sync. Queue

Replica Queue

Elasticaverage

Updatereplica

Averagegradients

Updatereplica

Computegradient

Replica

Queue

Elasticaverage

Updatereplica

Updatereplica

Computegradient

Computegradient

Elasticaverage

Elasticaverage

Updatedevice model

Updatereplica

Updatereplica

New Replica Queue

Sync. Queue

New Replica Queue

Batch NBatch N+1

Batch N+2

Mon

itor

GPU

1G

PU 2

A

Create new queue on-the-fly Ready to compute another gradient

Sche

dule

nex

t ava

ilabl

e re

plic

a

32

Page 33: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Auto-Tuning the Number of Model Replicas

• Monitor training throughput

• Dynamically adustnumber of learners

• Uses object pooling & lazy materialization

Auto-tuner

Input batches

Task manager

Dataset

Data pre-processors

Model replicas

Learn. streams

Peter Pietzuch - Imperial College London 33

Page 34: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Experimental Evaluation

Peter Pietzuch - Imperial College London 34

Page 35: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Does Crossbow Train Effectively with Small Batch Sizes?

Multiple learners per GPU improve hardware efficiency

TensorFlow Crossbow

b = 64

b = 256b = 64

1 learner

b = 64 4 learners

0200400600800

100012001400

Tim

e to

test

acc

urac

y (s

ec)

Peter Pietzuch - Imperial College London 35

• ResNet-32 with ImageNet dataset on 1 Titan X GPU

Page 36: Crossbow: Scaling Deep Learning on Multi-GPU Servers

0

20000

40000

60000

80000

100000

8

TT

A(5

3%

) (s

ec)

g

TensorFlow

32

Crossbow (m=1)

32

Crossbow

16

2

Does Crossbow Scale to multiple GPUs?

Tim

e to

test

acc

urac

y (s

ec)

GPUs

ResNet-50

Peter Pietzuch - Imperial College London 36

Synchronous Model Averaging improves statistical efficiency

• ResNet-50 with ImageNet dataset on 8 Titan X GPUs

Page 37: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Does Crossbow Train Effectively Across Models?

Training with multiple learners always better than training with large batches

1.31.5

1.2

ResNet-32 VGG ResNet-50 ResNet-101

2.7

1

2

3

4

LeNet

Spee

d-up

ove

r Ten

sorF

low

1 GPU 8 GPUs4.3

Peter Pietzuch - Imperial College London 37

Page 38: Crossbow: Scaling Deep Learning on Multi-GPU Servers

• ResNet-50 with ImageNetdataset on Titan X GPUs

What is the Statistical Efficiency with Many Learners?

Peter Pietzuch - Imperial College London

0

20

40

60

80

100

0 20 40 60 80 100 120 140

Test

acc

urac

y (%

)

Epochs

m=1m=2m=4m=8

m=16m=32

Increasing parallelism has less effect on statistical efficiency38

Page 39: Crossbow: Scaling Deep Learning on Multi-GPU Servers

Crossbow: Scaling GPU Deep Learning

• Need to make training throughput independent from hyper-parameters– Rethink the design of future deep learning systems

• Crossbow: Scaling DNN training with small batch sizes on many GPUs– Multiple model replicas per GPU for high hardware efficiency– Synchronous model averaging for high statistical efficiency

• Exciting research challenges for next generation deep learning systems

Peter Pietzuch - Imperial College London 39

Peter Pietzuchhttps://lsds.doc.ic.ac.uk — [email protected]

Thank You — Any Questions?

github.com/lsds/crossbow