Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Crossbow: Scaling Deep Learning on Multi-GPU Servers
Peter Pietzuchwith Alexandros Koliousis, Luo Mai, Pijika Watcharapichat, Matthias Weidlich, Paolo Costa
Imperial College Londonhttp://lsds.doc.ic.ac.uk<[email protected]>
CASTOR Software Days – Stockholm, Sweden – October 2019
Large-Scale Data & Systems Group
Machine Learning with Deep Neural Networks (DNN)
Revolutionised solutions in vision, speech recognition, …DNN models are trained by giving examples (instead of programming)
`
`
`
audio
words
text
topics
hello audience
images
labels`
`
`
Peter Pietzuch - Imperial College London 2
When DNN output is wrong, tweak its parameters
Training DNNs
• Obtain DNN model that minimises classification error
• Use Stochastic Gradient Descent (SGD) for training:
• 1. Begin with random model• 2. Consider mini-batch of
training data• 3. Iteratively calculate gradients
& update model parameters w
Model parameters wEr
ror
lowest error
random optimal
• converge
Peter Pietzuch - Imperial College London 3
Training DNNs on GPUs
• GPUs are good at parallelising gradient computation
Peter Pietzuch - Imperial College London 4
Training DNNs in Parallel with GPUs
• With large datasets, speed up by calculating gradients on multiple GPUs • Every GPU has model replica with a copy of model parameters (or weights)
0
0
0
0
0
1
1 1
11
1 0
10
1
0
0
0
0
0
1
1
1
11
1
0
1
00
01 0
0
0
0
0
01
11
1
1
10
01
GPU 1 GPU 2 GPU 3
gradient
model
gradient gradient
model model
Shard 1 Shard 2 Shard 3
But model replicaswould divergeover time…
Mini-batch (of training data)Peter Pietzuch - Imperial College London 5
Model Synchronisation among GPUs
• Parameter server: Maintains global model
GPU 1 GPU 2 GPU 3
model model model
model
gradient gradientgradient
Globalmodel
• GPUs:1. Send gradients to
update global model2. Synchronise local
model replicas with global model
Peter Pietzuch - Imperial College London 6
The Problem with Large Batch Sizes
Training with large mini-batches is bad for your health.
More importantly, it’s bad for your test error.
Friends don’t let friends use mini-batches larger than 32.
Yann LeCun @ylecun
2:00 PM – 26 Apr 2018447 1.2K
Peter Pietzuch - Imperial College London 7
Why Use Large Batch Sizes?
dataset
gradient gradient gradient
weights
average
An even bigger batchA bigger batchA batch
Keep work per GPU constant to scale
E.g. ~32 to 256 labelled images
Peter Pietzuch - Imperial College London 8
What is the Best Batch Size on a GPU?
• ResNet-32 on NVIDIA Titan X GPU
Peter Pietzuch - Imperial College London
1134
361302 354 379
445
0200400600800
10001200
32 64 128 256 512 1024Tim
e to
acc
urac
y (s
ec)
Batch size b
TensorFlow
9
Training DNNs Favours Small Batch Sizes
We want frequent, less “noisy” updates
w
GPU gradsmallbatch
weights
wweights
grad
grad
avg
GPU
largebatch
GPU
w
w
GPU gradsmallbatch
Peter Pietzuch - Imperial College London 10
Statistical Efficiency Needs Small Batch Sizes
Peter Pietzuch - Imperial College London
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100 120 140
Test
acc
urac
y (%
)
Epochs
b=64b=128b=256b=512
b=1024b=2048b=4096
11
small batch sizes
largebatch sizes
Hardware Efficiency Needs Large Batch Sizes
Keep work per GPU constant → increase batch size with #GPUs
Peter Pietzuch - Imperial College London
Computegradient
Updatereplica
Computegradient
Updatereplica
Computegradient
ComputegradientAverage
GPU
1G
PU 2
Batch size ½ b
Batch size ½ b
Batch size ½ b
Batch size ½ b
12
Tension between Hardware & Statistical Efficiency
• Practitioners increase batch size due to hardware efficiency
• But best batch size depends on both hardware & statistical efficiency
Peter Pietzuch - Imperial College London 13
Training with large mini-batches is bad for your health.
More importantly, it’s bad for your test error.
Friends don’t let friends use mini-batches larger than 32.
Yann LeCun @ylecun
2:00 PM – 26 Apr 2018
447 1.2K
Current Practice: Hyper-Parameter Tuning
• Adjust hyper-parameters (eg learning rate, momentum etc) to avoid reduction in statistical efficiency
• Linear scaling rule:"When mini-batch size is multiplied by k, multiply learning rate by k”
• Goyal et al. (2017)
• Drawbacks– Manual, labour-intensive process– Highly model specific – not portable and does not work for some models– Less effective for very large batch sizes…
Peter Pietzuch - Imperial College London 14
Limits of Hyper-Parameter Tuning
“When mini-batch size is multiplied by k, multiply learning rate by k”
20
25
30
35
40
256 1,024 4,096 16,384 65,536
Top-
1 Va
lidat
ion
Erro
r
Mini-batch size
ResNet-50, 32 images/GPU
8,192
Peter Pietzuch - Imperial College London 15
Fundamental Challenge of GPU Scaling
• “If batch size could be made arbitrarily large while still training effectively, then training is amenable to standard weak scaling approaches. However, if the training rate of some models is restricted to small batch sizes, then we will need to find other algorithmic and architectural approaches to their acceleration.”
– J. Dean, D. Patterson et al., “A New Golden Age in Computer Architecture”, IEEE Micro, 2018
• How to design a deep learning system that scales training with multiple GPUs, even when the preferred batch size is small?
Peter Pietzuch - Imperial College London 16
(1) How to increasehardware efficiency with small batches?
(2) How to synchronisemodel replicas?
(3) How to reducescheduling &synchronisationoverheads?
task1
task2
task3
model
model
model
GPU
buffer-1buffer-2
buffer-3buffer-4
Reusable data buffers
Currently in use
task
GPU-1 GPU-2
model modelmodel model
Model
Peter Pietzuch - Imperial College London 17
Problem: Small Batch Sizes Underutilise GPUs
Peter Pietzuch - Imperial College London
0
20
40
60
80
100
0 20 40 60 80 100
GPU
Occ
upan
cy (%
)
CDF (%)
Under-used resources
18
How to Process Small Batches Efficiently?
One batch per GPU →Not enough data and instruction parallelism for every operator
batchgrad
weightsbatch
grad
weights
One per GPU0
100
Operations
GPU
Util
. (%
)
Peter Pietzuch - Imperial College London 19
Idea: Train Multiple Model Replicas per GPU
One learning process (or learner) per GPU stream
batchgrad
weightsLearner
Learner
Learner
Stream
Stream
Stream
1 GPU
Scheduler
Peter Pietzuch - Imperial College London 20
Effect of Training Multiple Model Replicas per GPU
• But now we must synchronise a large number of learners/model replicas...Peter Pietzuch - Imperial College London
0
20
40
60
80
100
32 64 128 256 512 1024Thro
ughp
ut in
crea
se (%
)
Batch size b
Regained resources
21
(1) How to increaseefficiency with small batches?
(2) How to synchronisemodel replicas?
task1
task2
task3
model
model
model
GPU
GPU 1 GPU 2
model modelmodel model
Model
Peter Pietzuch - Imperial College London 22
• Train multiple model replicas per GPU
Problem: Why not Synchronous Parallel SGD?
All learners always start from the same pointLimited exploration of parameter space
Peter Pietzuch - Imperial College London 23
Idea: Maintain Independent Model Replicas
• Benefits: – Increased exploration of space through parallelism– Each model replica uses small batch size
Peter Pietzuch - Imperial College London 24
Average model trajectory
Replica X’s trajectory
Replica Y’s trajectory
Initial weights
Crossbow: Synchronous Model Averaging
Allow learners to diverge but correct trajectories based on average modelAccelerate average model trajectory with momentum to find minima faster
correction
correction
Momentum-accelerated
Peter Pietzuch - Imperial College London 25
GPUs with Synchronous Model Averaging
Learner
Replica …Learner
Replica
GPU 2
Learner
Replica …Learner
Replica
Learner
Replica …Learner
Replica
GPU 3GPU 1
• Synchronously apply corrections to model replicas
Peter Pietzuch - Imperial College London 26
GPUs with Synchronous Model Averaging
Learner
Replica …
Reference Model
Learner
Learner
Replica
GPU 2
Learner
Replica …
Average Model
Learner
Learner
Replica
Learner
Replica …
Reference Model
Learner
Learner
Replica
GPU 3GPU 1
• Synchronously apply corrections to model replicas
Peter Pietzuch - Imperial College London 27
GPUs with Synchronous Model Averaging
Learner
Replica …
Reference Model
Learner
Learner
Replica
GPU 2
Learner
Replica …
Average Model
Learner
Learner
Replica
Learner
Replica …
Reference Model
Learner
Learner
Replica
GPU 3GPU 1
Synchronous Model Averaging
• Ensures consistent view of average model• Takes GPU bandwidth into account during synchronisation
Peter Pietzuch - Imperial College London 28
• Train multiple model replicas per GPU
• Use synchronous model averaging
(1) How to increaseefficiency with small batches?
(2) How to synchronisemodel replicas?
(3) How to reducescheduling andsynchronisationoverheads?
task1
task2
task3
model
model
model
GPU
GPU-1 GPU-2
model modelmodel model
Model
Peter Pietzuch - Imperial College London 29
buffer-1buffer-2
buffer-3buffer-4
Reusable data buffers
Currently in use
task
Crossbow Architecture
Auto-tuner
Java24K LOC
C/C++15K LOC
Dataset
Datapre-processData
pre-processDatapre-processors
Input batches
Model replicas
Learner streams
Ready queues
Task schedulerDataflows
GPU
1G
PU 2
Learner
Synch
Integration with TensorFlow
github.com/lsds/crossbow
Learner
Synch
Peter Pietzuch - Imperial College London 30
Efficient Task Scheduling
• Execute compute and synchronisation tasks
• Fine-grained concurrency
• Need efficient scheduler to feed all GPUs with tasks
Dataflow
Task scheduler
Input batches
Ready queues
Dataset
Data pre-processors
<Multiple threads>
Model replicas
Learn. streams
Auto-tuner
Peter Pietzuch - Imperial College London 31
Interleaving Compute & Synchronisation Tasks
Peter Pietzuch - Imperial College London
Computegradient
Replica Queue
Elasticaverage
Updatereplica
Averagegradients
Averageall gradients
Updatereplica
Computegradient
Computegradient
Replica Queue
Elasticaverage
Updatereplica
Updatereplica
Computegradient
Elasticaverage
Elasticaverage
Updatedevice model
Map
ped
mem
ory
Auto-tuner
Updatereplica
Updatereplica
Dataset
W
Worker threads
W
W
Computegradient
Sync. Queue
Replica Queue
Elasticaverage
Updatereplica
Averagegradients
Updatereplica
Computegradient
Replica
Queue
Elasticaverage
Updatereplica
Updatereplica
Computegradient
Computegradient
Elasticaverage
Elasticaverage
Updatedevice model
Updatereplica
Updatereplica
New Replica Queue
Sync. Queue
New Replica Queue
Batch NBatch N+1
Batch N+2
Mon
itor
GPU
1G
PU 2
A
…
Create new queue on-the-fly Ready to compute another gradient
Sche
dule
nex
t ava
ilabl
e re
plic
a
32
Auto-Tuning the Number of Model Replicas
• Monitor training throughput
• Dynamically adustnumber of learners
• Uses object pooling & lazy materialization
Auto-tuner
Input batches
Task manager
Dataset
Data pre-processors
Model replicas
Learn. streams
Peter Pietzuch - Imperial College London 33
Experimental Evaluation
Peter Pietzuch - Imperial College London 34
Does Crossbow Train Effectively with Small Batch Sizes?
Multiple learners per GPU improve hardware efficiency
TensorFlow Crossbow
b = 64
b = 256b = 64
1 learner
b = 64 4 learners
0200400600800
100012001400
Tim
e to
test
acc
urac
y (s
ec)
Peter Pietzuch - Imperial College London 35
• ResNet-32 with ImageNet dataset on 1 Titan X GPU
0
20000
40000
60000
80000
100000
8
TT
A(5
3%
) (s
ec)
g
TensorFlow
32
Crossbow (m=1)
32
Crossbow
16
2
Does Crossbow Scale to multiple GPUs?
Tim
e to
test
acc
urac
y (s
ec)
GPUs
ResNet-50
Peter Pietzuch - Imperial College London 36
Synchronous Model Averaging improves statistical efficiency
• ResNet-50 with ImageNet dataset on 8 Titan X GPUs
Does Crossbow Train Effectively Across Models?
Training with multiple learners always better than training with large batches
1.31.5
1.2
ResNet-32 VGG ResNet-50 ResNet-101
2.7
1
2
3
4
LeNet
Spee
d-up
ove
r Ten
sorF
low
1 GPU 8 GPUs4.3
Peter Pietzuch - Imperial College London 37
• ResNet-50 with ImageNetdataset on Titan X GPUs
What is the Statistical Efficiency with Many Learners?
Peter Pietzuch - Imperial College London
0
20
40
60
80
100
0 20 40 60 80 100 120 140
Test
acc
urac
y (%
)
Epochs
m=1m=2m=4m=8
m=16m=32
Increasing parallelism has less effect on statistical efficiency38
Crossbow: Scaling GPU Deep Learning
• Need to make training throughput independent from hyper-parameters– Rethink the design of future deep learning systems
• Crossbow: Scaling DNN training with small batch sizes on many GPUs– Multiple model replicas per GPU for high hardware efficiency– Synchronous model averaging for high statistical efficiency
• Exciting research challenges for next generation deep learning systems
Peter Pietzuch - Imperial College London 39
Peter Pietzuchhttps://lsds.doc.ic.ac.uk — [email protected]
Thank You — Any Questions?
github.com/lsds/crossbow