26
Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa) Imperial College London http://lsds.doc.ic.ac.uk <[email protected]> SICS – Sweden – September 2018 Large-Scale Data & Systems Group

Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

  • Upload
    others

  • View
    25

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

Scaling Deep Learning on Multi-GPU Servers

Peter Pietzuch(with Alexandros Koliousis, Luo Mai, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa)

Imperial College London

http://lsds.doc.ic.ac.uk<[email protected]>

SICS – Sweden – September 2018

Large-Scale Data & Systems Group

Page 2: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

Deep Neural Networks (DNN) at Scale

Speech recognitionImage classification

Audio data

“Hello”“Person”

Image data

TrainDNNswith a

lot of data

Peter Pietzuch - Imperial College London

Page 3: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

Training DNNs with Stochastic Gradient Descent

• Goal: Obtain DNN model that minimises classification error

• Stochastic Gradient Descent (SGD): � Consider mini-batch of training data� Iteratively calculate gradients and update model parameters w

Model parameters w

Erro

r

lowest error

random optimal

• converge

Peter Pietzuch - Imperial College London 3

Page 4: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

The Problem With Large Batch Sizes

• “Training with large mini-batches is bad for your health. More importantly, it’s bad for your test error. Friends don’t let friends use mini-batches larger than 32.”

� Y. LeCun, @ylecun, April 26, 2018

Peter Pietzuch - Imperial College London 4

Page 5: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

Scaling DNN Training on GPUs

• Manager:• Need a highly-accurate image classification model ASAP

• Highly-Paid Data Scientist:• OK. I’ll use ResNet-50. It will take ½ month

on 1 GPU

• M: • Throw more hardware at the problem! Need this sooner.

• HPDS: …

Peter Pietzuch - Imperial College London 5

Page 6: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

Parallel DNN Training• With large training datasets, speed up by calculating gradients in parallel

0

0

0

0

0

1

1 1

11

1 0

10

1

0

0

0

0

0

1

1

1

11

1

0

1

00

01 0

0

0

0

0

01

11

1

1

10

01

GPU 1 GPU 2 GPU 3

gradient

model

gradient gradient

model model

Shard 1 Shard 2 Shard 3

Model replicasdiverge over time

Mini-batchPeter Pietzuch - Imperial College London 6

Page 7: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

Synchronisation Among GPUs

• Parameter server: Maintains global model

GPU 1 GPU 2 GPU 3

model model model

model

gradient gradientgradient

Globalmodel

• GPUs:1. Send gradients to

update global model2. Synchronise local

model replicas with global model

Peter Pietzuch - Imperial College London 7

Page 8: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

What is the Best Batch Size?• ResNet-32 on Titan X GPU

Peter Pietzuch - Imperial College London 8

1134

361302 354 379

445

0200400600800

10001200

32 64 128 256 512 1024Tim

e to

acc

urac

y (s

ec)

Batch size, b

TensorFlow

Page 9: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

Intuition for Small Batch Sizes

• Practical considerations: • More frequent model updates minimise bias to initial conditions faster

(i.e. initial weights are forgotten faster)

• Theoretical considerations:• Small batch sizes explore better the wider minima, which are known to

exhibit better test accuracy

Peter Pietzuch - Imperial College London 9

Page 10: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

Statistical Efficiency Needs Small Batch Sizes

Peter Pietzuch - Imperial College London 10

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100 120 140

Test

acc

urac

y (%

)

Epochs

b=64b=128b=256b=512

b=1024b=2048b=4096

Page 11: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

Hardware Efficiency Needs Large Batch Sizes

Keep work per GPU constant => scale batch size with #GPUs

Peter Pietzuch - Imperial College London 11

Computegradient

Updatereplica

Computegradient

Updatereplica

Computegradient

ComputegradientAverage

GPU

1G

PU 2

Batch N

Batch N

Batch N+1

Batch N+1Aggregategradients(blocking)

12

12

12

12

Page 12: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

Hardware & Statistical Efficiency

• “Training with large mini-batches is bad for your health. More importantly, it’s bad for your test error. Friends don’t let friends use mini-batches larger than 32.”� Y. LeCun, @ylecun, April 26, 2018

• The only reason practitioners increase batch size is hardware efficiency

• But best batch size depends on both hardware efficiency & statistical efficiency

Peter Pietzuch - Imperial College London 12

Page 13: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

Limits of Scaling DNN Training

• HPDS: Managed to train it on 100s of GPUs in 1h!

• M: Great! Make it faster. Use as many resources as you need!

• HPDS: Can’t :( Beyond this point statistical efficiency collapses. Actually,I straggled to maintain it up to this point…

• M: …

• The story continues. New hardware becomes available. Improved communication between workers. Latest results train ResNet-50 in just few minutes. But the problem remains.

Peter Pietzuch - Imperial College London 13

Hyper-Parameter Tuning

Page 14: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

The Fundamental Challenge of GPU Scaling

• “If batch size could be made arbitrarily large while still training effectively, then training is amenable to standard weak scaling approaches. However, if the training rate of some models is restricted to small batch sizes, then we will need to find other algorithmic and architectural approaches to their acceleration.”� J. Dean, D. Patterson, X. […], “The Golden Age”, IEEE Micro

• How to design a system that can scale training with multiple GPUs even when the preferred batch size is small?

Peter Pietzuch - Imperial College London 14

Page 15: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

Problem: Small Batch Sizes Underutilise GPUs

Peter Pietzuch - Imperial College London 15

0

20

40

60

80

100

0 20 40 60 80 100

GPU

Occ

upan

cy (%

)

CDF (%)

Under-used resources

Page 16: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

Idea: Train Multiple Model Replicas per GPU

• Fully exploits task parallelism on GPU

• But we now need to synchronise large number of model replicas...Peter Pietzuch - Imperial College London 16

0

20

40

60

80

100

32 64 128 256 512 1024Thro

ughp

ut In

crea

se (%

)

Batch size, b

Regained resources

Page 17: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

How to Synchronise Many Model Replicas?

• Synchronised SGD:� Average gradients of multiple workers� Start next training with average model

• Workers start next exploration from same point in weight space

Peter Pietzuch - Imperial College London 17

Average gradienttrajectory

Replica X’s trajectory

Replica Y’s trajectory

Initial weights

Initial weights

Page 18: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

Idea: Sample Multiple Nearby Points in Space

• Synchronisation using Elastic Averaging SGD (EA-SGD)

• Benefits: 1. Each replica reuses learning set-up of small batch size2. Increased exploration through parallelism3. Average replica helps with direction to good minima

Peter Pietzuch - Imperial College London 18

Correction rule

Average model trajectory

Replica X’s trajectory

Replica Y’s trajectory

Initial weights

Page 19: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

When To Apply Corrections?

• Synchronously apply corrections to model replicas

• Other techniques� Control impact of average model using momentum� Auto-tune number of model replicas

Peter Pietzuch - Imperial College London 20

Correction rule

Average model trajectory

Replica X’s trajectory

Replica Y’s trajectory

Initial weights

Page 20: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

• Execute concurrent tasks to leverage GPU concurrency

• Schedule fine-grained compute & synchronisationtasks

• Minimise amount of data transfer among different GPUs

Crossbow: Multi-GPU Deep Learning System

(1) Train multiplemodels in parallel

(2) Synchronous hierarchicalmodel averaging

(3) Efficient fine-grained task engine

task1

task2

task3

model

model

model

GPU

buffer-1buffer-2

buffer-3buffer-4

Reusable data buffers

Currently in use

task

GPU-1 GPU-2

model modelmodel model

Model

Peter Pietzuch - Imperial College London 21

Page 21: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

GPU Parallelism with Multiple Model Replicas

• On each GPU, use different streams for training tasks• Task: series of operations based on computation of model layers

� Task associated with (a) model replica and (b) batch of training data

data

model

task

stream 1

stream 2

stream 3

model1 task1

task2

task3

model2

model3

data1

data2

data3

Dispatch to stream

Peter Pietzuch - Imperial College London 22

Page 22: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

Synchronous Hierarchical Elastic Model Averaging

• Intra-GPU, cross-GPU, and CPU-GPU communication have different performance characteristics

• Crossbow uses hierarchical aggregation to avoid bottlenecks

GPU-1 GPU-2

model model

model

model model

model

Model

Sync withinGPU

Sync across GPUs

Peter Pietzuch - Imperial College London 23

Page 23: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

CrossBow Task Execution Engine

Peter Pietzuch - Imperial College London 24

Computegradient

Replica Queue

Elasticaverage

Updatereplica

Averagegradients

Averageall gradients

Updatereplica

Computegradient

Computegradient

Replica Queue

Elasticaverage

Updatereplica

Updatereplica

Computegradient

Elasticaverage

Elasticaverage

Updatedevice model

Map

ped

mem

ory

Auto-tuner

Updatereplica

Updatereplica

Dataset

W

Worker threads

W

W

Computegradient

Sync. Queue

Replica Queue

Elasticaverage

Updatereplica

Averagegradients

Updatereplica

Computegradient

Replica

Queue

Elasticaverage

Updatereplica

Updatereplica

Computegradient

Computegradient

Elasticaverage

Elasticaverage

Updatedevice model

Updatereplica

Updatereplica

New Replica Queue

Sync. Queue

New Replica Queue

Batch NBatch N+1

Batch N+2

Mon

itor

GPU

1G

PU 2

A

Create new queue on-the-fly Ready to compute another gradient

Sche

dule

nex

t ava

ilabl

e re

plic

a

• Many small tasks for (1) replica training and (2) synchronisation

• CrossBow has GPU/CPU task engine for multiplexing & overlapping small tasks

Page 24: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

CrossBow: Benefit of Synchronous Model Averaging

Peter Pietzuch - Imperial College London 25

0

500

1000

1500

2000

2500

1 2 4 8

TensorFlow CrossBow 1-replica CrossBow

Number of GPUs

Tim

e to

69%

test

acc

urac

y (s

econ

ds)

• VGG-16 with Cifar-100 dataset on Titan X GPUs

Page 25: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

CrossBow: Statistical Efficiency with Many Models

Peter Pietzuch - Imperial College London 26

0

20

40

60

80

100

0 20 40 60 80 100 120 140

Test

acc

urac

y (%

)

Epochs

m=1m=2m=4m=8

m=16m=32

Increasing parallelism does not collapse convergence, while each replica reuses same learning setup

• ResNet-50 with ImageNet dataset on Titan X GPUs

Page 26: Scaling Deep Learning on Multi-GPU Servers · Scaling Deep Learning on Multi-GPU Servers Peter Pietzuch (with Alexandros Koliousis, Luo Mai, PijikaWatcharapichat, Matthias Weidlich,

Summary: Scaling Deep Leaning on Multi-GPU Servers

• Need to make training throughput independent from hyper-parameters� Current need for hyper-parameter tuning too complex� Need new designs for deep learning systems

• Crossbow: Scaling DNN training with small batch sizes on many GPUs� Multiple model replicas per GPU for high hardware efficiency� Synchronous hierarchical model averaging for high statistical efficiency� Requires new CPU/GPU task engine design

27

Peter Pietzuchhttps://lsds.doc.ic.ac.uk — [email protected]

Thank You — Any Questions?

Peter Pietzuch - Imperial College London