Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Scalable and Distributed Deep Learning (DL): Co-Design MPI Runtimes and DL Frameworks
Ammar Ahmad Awan, Hari Subramoni, and Dhabaleswar K. Panda
Network Based Computing Laboratory
Dept. of Computer Science and Engineering
The Ohio State University
OSU Booth Talk (SC ’18)
OSU Booth - SC ‘18 2Network Based Computing Laboratory High-Performance Deep Learning
• Introduction
– Deep Learning Trends
– CPUs and GPUs for Deep Learning
– Message Passing Interface (MPI)
• Research Challenges: Exploiting HPC for Deep Learning
• Proposed Solutions
• Conclusion
Agenda
OSU Booth - SC ‘18 3Network Based Computing Laboratory High-Performance Deep Learning
• Easily implement and experiment with Deep Neural Networks
– Several Deep Learning (DL) frameworks have emerged
• Caffe, Microsoft Cognitive Toolkit (CNTK), TensorFlow, PyTorch, and
counting....
– Focus on CUDA-Aware MPI based DL frameworks
• Most frameworks have been optimized for NVIDIA GPUs and the
CUDA programming model
– However, distributed training (MPI+CUDA) is still emerging
– Fragmentation in efforts also exists – gRPC, MPI, NCCL, Gloo, etc.
Deep Learning Frameworks
OSU Booth - SC ‘18 4Network Based Computing Laboratory High-Performance Deep Learning
• NVIDIA GPUs - main driving force for faster training of Deep
Neural Networks (DNNs)
• The ImageNet Challenge - (ILSVRC)
– DL models like AlexNet, ResNet, and VGG
– 90% of the ImageNet teams used GPUs in 2014*
– GPUs: A natural fit for DL –throughput-oriented (dense-compute)
– And, GPUs are growing in the HPC arena as well! – Top500 (Jun ‘18)
Deep Learning and GPUs
*https://blogs.nvidia.com/blog/2014/09/07/imagenet/
https://www.top500.org/
OSU Booth - SC ‘18 5Network Based Computing Laboratory High-Performance Deep Learning
And CPUs are catching up fast
1- https://dl.acm.org/citation.cfm?id=19935162- http://ieeexplore.ieee.org/abstract/document/5762730/3- https://dspace.mit.edu/bitstream/handle/1721.1/51839/MIT-CSAIL-TR-2010-013.pdf?sequence=1
• Intel CPUs are everywhere and many-core
CPUs are emerging according to Top500.org
• Host CPUs exist even on the GPU nodes
– Many-core Xeon Phis are increasing
• Usually, we hear CPUs are 10x – 100x slower
than GPUs? [1-3]
– But, CPU-based ML/DL is getting attention and
performance has significantly improved now
https://www.top500.org/statistics/list/
System Count for Xeon Phi
OSU Booth - SC ‘18 6Network Based Computing Laboratory High-Performance Deep Learning
• What is Message Passing Interface (MPI)?
– a de-facto standard for expressing distributed-memory parallel programming
– used for communication between processes in multi-process applications
• MVAPICH2 is a high performance implementation of the MPI standard
• What can MPI do for Deep Learning?
– MPI has been used for large scale scientific applications
– Deep Learning can also exploit MPI to perform high-performance communication
• Why do I need communication in Deep Learning?
– If you use one GPU or one CPU, you do not need communication
– But, one GPU or CPU is not enough!
– DL wants as many compute elements as it can get!
– MPI is a great fit – Broadcast, Reduce, and Allreduce is what most DL workloads require
What to use for scale-out? (Distributed training of Neural Nets.)
OSU Booth - SC ‘18 7Network Based Computing Laboratory High-Performance Deep Learning
Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 2,925 organizations in 86 countries
– More than 489,000 (> 0.48 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Jul ‘18 ranking)
• 2nd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China
• 12th, 556,104 cores (Oakforest-PACS) in Japan
• 15th, 367,024 cores (Stampede2) at TACC
• 24th, 241,108-core (Pleiades) at NASA and many others
– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decade
OSU Booth - SC ‘18 8Network Based Computing Laboratory High-Performance Deep Learning
• There are several Deep Learning (DL) or DNN Training frameworks
• Every (almost every) framework has been optimized for NVIDIA GPUs
– cuBLAS and cuDNN have led to significant performance gains!
• But every framework is able to execute on a CPU as well
– So why are we not using them?
– Performance has been “terrible” and several studies have reported
significant degradation when using CPUs (see nvidia.qwiklab.com)
• But there is hope, actually a lot of great progress here!
– And MKL-DNN, just like cuDNN, has definitely rekindled this!!
– The landscape for CPU-based DL looks promising..
Deep Learning Frameworks – CPUs or GPUs?
OSU Booth - SC ‘18 9Network Based Computing Laboratory High-Performance Deep Learning
• Introduction
• Research Challenges: Exploiting HPC for
Deep Learning
• Proposed Solutions
• Conclusion
Agenda
OSU Booth - SC ‘18 10Network Based Computing Laboratory High-Performance Deep Learning
How to efficiently exploit
heterogeneous High Performance
Computing (HPC) resources for high-
performance and high-productivity
Deep Learning?
The Key Question!
OSU Booth - SC ‘18 11Network Based Computing Laboratory High-Performance Deep Learning
1. What are the fundamental
issues in designing DL
frameworks?
– Memory Requirements
– Computation
Requirements
– Communication Overhead
2. Why do we need to support
distributed training?
– To overcome the limits of
single-node training
– To better utilize hundreds
of existing HPC Clusters
Research Challenges to Exploit HPC Technologies
InfiniBand GPUCPU
CNTK
Gradient Aggregation
Model PropagationForward
Backward
Deep Learning and Machine Learning Frameworks
Caffe/OSU-Caffe
Caffe2 TensorFlow MXNet
Communication Runtimes to support Distributed Training
HPC Platforms
Major Computation and Communication Phases in DL Frameworks
1
2
OSU Booth - SC ‘18 12Network Based Computing Laboratory High-Performance Deep Learning
3. What are the new design challenges
brought forward by DL frameworks for
Communication runtimes?
– Large Message Collective
Communication and Reductions
– GPU Buffers (CUDA-Awareness)
4. Can a Co-design approach help in
achieving Scale-up and Scale-out efficiently?
– Co-Design the support at Runtime
level and Exploit it at the DL
Framework level
– What performance benefits can
be observed?
– What needs to be fixed at the
communication runtime layer?
5.
Research Challenges to Exploit HPC Technologies (Cont’d)
CUDA-Awareness
InfiniBand GPUCPU
Large-message Collectives
CNTK
Point-to-PointOperations
Gradient Aggregation
Model PropagationForward
Backward
Deep Learning and Machine Learning Frameworks
Caffe/OSU-Caffe
Caffe2 TensorFlow MXNet
Communication Runtimes (MPI/NCCL/Gloo/MLSL)
HPC Platforms
Major Computation and Communication Phases in DL Frameworks
3
4 Co-Design Opportunities
OSU Booth - SC ‘18 13Network Based Computing Laboratory High-Performance Deep Learning
• Introduction
• Research Challenges: Exploiting HPC for
Deep Learning
• Proposed Solutions
• Conclusion
Agenda
OSU Booth - SC ‘18 14Network Based Computing Laboratory High-Performance Deep Learning
Overview of the Proposed Solutions
Application Layer
OSU-CaffeDistributed Training
with TensorFlowOut-of-Core
DNN TrainingDNN Training on
CPUs/GPUsCA-CNTK
Communication Middleware Layer (Deep Learning Aware MVAPICH2-GDR)
CUDA-Aware Large Msg.
MPI_Reduce
Co-Design Layer
CUDA-Aware Large Msg.
MPI_Allreduce
Co-Design OSU-Caffe and MVAPICH2-GDR
CUDA-Aware MPI_Bcast
NCCL-basedBcast
Pure MPIBcast
MPI Bcast vs. NCCL2 Bcast
Hardware PlatformsCPU GPU InfiniBand HCA
OSU Booth - SC ‘18 15Network Based Computing Laboratory High-Performance Deep Learning
Bcast (GPU 0)
packed_comm_buff
L1
L2
..
Ln
F
L1
L2
..
Ln
L1
L2
..
Ln
L1
L2
..
Ln
Params
GP
U 0
Params
GP
U 1
Params
GP
U 2
Params
GP
U 3
Gradients
1. Data
Propagation
2. Forward
Backward Pass
3. Gradient
Aggregation
B F B F B F B
packed_reduce_
buffpacked_reduce_
buff
packed_reduce_
buff
packed_reduce_
buff
ApplyUpdates
Reduce (GPU 0)
Loop {}
Caffe Architecture
http://hidl.cse.ohio-state.edu
OSU Booth - SC ‘18 16Network Based Computing Laboratory High-Performance Deep Learning
• Deep Learning frameworks are a different game
– Unusually large message sizes (order of megabytes)
– Most communication based on GPU buffers
• Existing State-of-the-art
– cuDNN, cuBLAS, NCCL --> scale-up performance
– CUDA-Aware MPI --> scale-out performance
• Proposed: Can we co-design the MPI runtime (MVAPICH2-
GDR) and the DL framework (Caffe) to achieve both?
– Efficient Overlap of Computation and Communication
– Efficient Large-Message Communication (Reductions)
– What application co-designs are needed to exploit
communication-runtime co-designs?
OSU-Caffe: Co-design to Tackle New Challenges for MPI Runtimes
Scal
e-u
p P
erfo
rman
ce
Scale-out Performance
cuDNN
NCCL
gRPC
Hadoop
ProposedCo-Designs
MPIcuBLAS
A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)
OSU Booth - SC ‘18 17Network Based Computing Laboratory High-Performance Deep Learning
• Caffe : A flexible and layered Deep Learning
framework.
• Benefits and Weaknesses
– Multi-GPU Training within a single node
– Performance degradation for GPUs across
different sockets
– Limited Scale-out
• OSU-Caffe: MPI-based Parallel Training
– Enable Scale-up (within a node) and Scale-out
(across multi-GPU nodes)
– Scale-out on 64 GPUs for training CIFAR-10
network on CIFAR-10 dataset
– Scale-out on 128 GPUs for training GoogLeNet
network on ImageNet dataset
OSU-Caffe 0.9: Scalable Deep Learning on GPU Clusters
0
50
100
150
200
250
8 16 32 64 128
Trai
nin
g Ti
me
(sec
on
ds)
No. of GPUs
GoogLeNet (ImageNet) on 128 GPUs
Caffe OSU-Caffe (1024) OSU-Caffe (2048)
Invalid use case
OSU-Caffe 0.9 available from HiDL site
OSU Booth - SC ‘18 18Network Based Computing Laboratory High-Performance Deep Learning
• Several Approaches to Distributed Training
– Google RPC (gRPC)
– gRPC+X (X= Verbs API, GDR, and MPI)
– No-gRPC
• No-gRPC designs use:
– MPI or
– NCCL
• Performance is heavily influenced by
– MPI_Allreduce
Scalable Distributed DNN Training using TensorFlow and MPI
Distributed TensorFlow
gRPCAccelerated
gRPC
gRPC+X
gRPC+MPI
gRPC+Verbs
gRPC+GDR
No-gRPC
Baidu-MPI
Horovod
MPI
NCCL
A. A. Awan et al. “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs and Performance Evaluation”, Submitted to IPDPS-19 for peer-review, Available from: https://arxiv.org/abs/1810.11112
OSU Booth - SC ‘18 19Network Based Computing Laboratory High-Performance Deep Learning
Faster Allreduce in the proposed MPI-Opt
implemented in MVAPICH2-GDR
TensorFlow with CUDA-Aware MPI: NCCL vs. MVAPICH2-GDR
A. A. Awan et al. “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs and Performance Evaluation”, Submitted to IPDPS-19 for peer-review, Available from: https://arxiv.org/abs/1810.11112
1
10
100
1000
10000
100000
10000008 32 128
512 2K 8K 32K
128K
512K 2M 8M 32M
128M
Late
ncy
(µ
s)
Message Size (Bytes)
MPI NCCL2 MPI-Opt
1
4
16
64
256
1024
4096
16384
1 2 4 8 16 32 64
Imag
es/s
eco
nd (
Hig
her
is
bet
ter)
No. of Nodes (GPUs)
Horovod-NCCL2 Horovod-MPI-Opt Ideal
Faster (near-ideal) DNN Training speed-ups in TensorFlow-Horovod
–>
OSU Booth - SC ‘18 20Network Based Computing Laboratory High-Performance Deep Learning
0
100
200
300
2 4 8 16 32 64 128
Late
ncy
(m
s)
Message Size (MB)
Reduce – 192 GPUs
Large Message Optimized Collectives for Deep Learning
0
100
200
128 160 192
Late
ncy
(m
s)
No. of GPUs
Reduce – 64 MB
0
100
200
300
16 32 64
Late
ncy
(m
s)
No. of GPUs
Allreduce - 128 MB
0
50
100
2 4 8 16 32 64 128
Late
ncy
(m
s)
Message Size (MB)
Bcast – 64 GPUs
0
50
100
16 32 64
Late
ncy
(m
s)
No. of GPUs
Bcast 128 MB
• MVAPICH2-GDR provides
optimized collectives for
large message sizes
• Optimized Reduce,
Allreduce, and Bcast
• Good scaling with large
number of GPUs
• Available in MVAPICH2-
GDR 2.2 and higher
0
100
200
300
2 4 8 16 32 64 128La
ten
cy (
ms)
Message Size (MB)
Allreduce – 64 GPUs
OSU Booth - SC ‘18 21Network Based Computing Laboratory High-Performance Deep Learning
• NCCL has some limitations
– Only works for a single node, thus, no scale-out on
multiple nodes
– Degradation across IOH (socket) for scale-up (within a
node)
• We propose optimized MPI_Bcast
– Communication of very large GPU buffers (order of
megabytes)
– Scale-out on large number of dense multi-GPU nodes
• Hierarchical Communication that efficiently exploits:
– CUDA-Aware MPI_Bcast in MV2-GDR
– NCCL Broadcast primitive
Efficient Broadcast for MVAPICH2-GDR using NVIDIA NCCL
1
10
100
1000
10000
100000
1 8
64
512
4K
32K
256
K
2M
16M
128
M
Late
ncy
(u
s)Lo
g Sc
ale
Message Size
MV2-GDR MV2-GDR-Opt
100x
0
10
20
30
2 4 8 16 32 64Tim
e (s
eco
nd
s)
Number of GPUs
MV2-GDR MV2-GDR-Opt
Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement )
Performance Benefits: OSU Micro-benchmarks
Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning, A. Awan , K. Hamidouche , A. Venkatesh , and D. K. Panda, EuroMPI 16[Best Paper Runner-Up]
OSU Booth - SC ‘18 22Network Based Computing Laboratory High-Performance Deep Learning
• MPI_Bcast: Design and Performance Tuning for DL
Workloads
– Design ring-based algorithms for large messages
– Harness a multitude of algorithms and techniques for best
performance across the full range of message size and
process/GPU count
• Performance Benefits
– Performance comparable or better than NCCL-
augmented approaches for large messages
– Up to 10X improvement for small/medium message sizes
with micro-benchmarks and up to 7% improvement for
VGG training
Pure MPI Large Message Broadcast
0.001
0.01
0.1
1
10
100
1000
1 8
64
51
2
4K
32
K
25
6K
2M
16
M
12
8M
Late
ncy
(m
s)-
logs
cale
MessageSize(bytes)
MV2-GDR-NCCL MV2-GDR-Opt
0
5
10
15
20
25
30
2 4 8 32 64 128
Trai
nin
gTi
me
(sec
on
ds)
No.ofGPUs
MV2-GDR-NCCL MV2-GDR-Opt
VGG Training with CNTK
A. A. Awan, C-H. Chu, H. Subramoni, and D. K. Panda. Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?, EuroMPI ‘18
MPI Bcast Benchmark: 128 GPUs (8 nodes)
OSU Booth - SC ‘18 23Network Based Computing Laboratory High-Performance Deep Learning
• Performance depends on
many factors
• Hardware Architectures
– GPUs
– Multi-/Many-core CPUs
– Software Libraries: cuDNN (for
GPUs), MKL-DNN/MKL 2017
(for CPUs)
• Hardware and Software co-
design
– Software libraries optimized
for one platform will not help
the other!
– cuDNN vs. MKL-DNN
Understanding the Impact of Execution EnvironmentsDLApplications(ImageRecognition,SpeechProcessing,etc.)
DLFrameworks(Caffe,TensorFlow,etc.)
BLASLibraries
Hardware
Many-coreGPU(PascalP100)
GenericConvolutionLayer
MKLOptimizedConvolutionLayer
MKL2017 cuDNN/cuBLAS
Multi-/Many-core(Xeon,XeonPhi)
cuDNN OptimizedConvolutionLayer
OtherBLASLibraries
OpenBLASATLAS
OtherProcessors
A. A. Awan, H. Subramoni, D. Panda, “An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures” 3rd Workshop on Machine Learning in High Performance Computing Environments, held in conjunction with SC17, Nov 2017.
OSU Booth - SC ‘18 24Network Based Computing Laboratory High-Performance Deep Learning
• We use MCDRAM as Cache for all the
subsequent results
• On average, DDR-All is up to 1.5X slower
than MCDRAM
• MKL engine is up to 3X better than default
Caffe engine
• Biggest gains for Intel Xeon Phi (many-
core) architecture
• Both Haswell and Broadwell architectures
get significant speedups (up to 1.5X)
Impact of MKL engine and MC-DRAM for Intel-Caffe
0200400600800
10001200140016001800
Trai
nin
gT
ime
(m
s)
CPUArchitectures
0
100
200
300
400
500
600
700
DDR-All MCDRAM-All MCDRAMasCache
Tra
inin
gT
ime
(m
s)
MemoryConfigurations
Forward Backward
OSU Booth - SC ‘18 25Network Based Computing Laboratory High-Performance Deep Learning
The Full Landscape for AlexNet Training on CPU/GPU
0
50
100
150
200
250
300
350
400
450
500
Tim
e(m
s)
conv1 conv2 conv3 conv4 conv5
• Convolutions in the Forward and Backward Pass
• Faster Convolutions Faster Training
• Most performance gains are based on conv2 and conv3.
0
100
200
300
400
500
600
700
800
Tim
e(
ms)
conv1 conv2 conv3 conv4 conv5
OSU Booth - SC ‘18 26Network Based Computing Laboratory High-Performance Deep Learning
• What if your Neural Net is bigger than the GPU memory (out-of-core)?
– Use our proposed Unified Memory solution called OC-DNN :-)
Out-of-core DNN Training
FileSystem
Data Layer (L1) Layer 2
<<Kernel>>
………. Layer N
<<Kernel>> …
D2D H2D D2HF2H/H2F I/O
…
FileSystem
<<Kernel>>
<<Kernel>> …
…
D2D
H2D
D2H
F2H/H2F
I/O
M2M
F2M/M2F
Prefetch () Advise (Evict())
Prefetch ()
Advise (Evict()) ………. ……….
Prefetch ()
Advise (Evict())……….……….
Forward Propagation (Existing)
Backward Propagation (Existing)
Forward Propagation (Proposed)
Backward Propagation (Proposed)
Unified Data Layer
A. A. Awan, C-H Chu, X. Lu, H. Subramoni, D.K. Panda, “OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training”, 25th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC) 2018.
OSU Booth - SC ‘18 27Network Based Computing Laboratory High-Performance Deep Learning
0
200
400
600
800
1000
1200
1400
1600
Img
/se
c (H
igh
er
is b
ett
er)
caffe-gpu oc-caffe-naïve oc-caffe-opt
caffe-cpu intel-caffe intel-caffe-opt
oc-caffe-opt is 5Xbetter than
intel-caffe-opt
caffe-gpucannot
run
X
Out-of-core DNN Training
A. A. Awan, C-H Chu, X. Lu, H. Subramoni, D.K. Panda, “OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training”, 25th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC) 2018.
0
50
100
150
200
250
300
Img
/sec
(H
igh
er i
s b
ett
er)
caffe-gpu oc-caffe-naïve oc-caffe-opt
caffe-cpu intel-caffe intel-caffe-opt
caffe-gpucannot
run
X
oc-caffe-opt is2.7X better than
intel-caffe-opt
0
5
10
15
20
Img
/se
c (H
igh
er
is b
ett
er)
caffe-gpu oc-caffe-naïve oc-caffe-opt
caffe-cpu intel-caffe intel-caffe-opt
oc-caffe-optis 80%
better than intel-caffe
caffe-gpucannot
run
X
intel-caffe-opt
(N/A)
X
AlexNet GoogLeNet ResNet-50
• We exploit Unified Memory designs in CUDA 9 and hardware support in Volta GPU
– Better performance than CPU-based solutions for all state-of-the-art Image models
OSU Booth - SC ‘18 28Network Based Computing Laboratory High-Performance Deep Learning
• Introduction
• Research Challenges: Exploiting HPC for
Deep Learning
• Proposed Solutions
• Conclusion
Agenda
OSU Booth - SC ‘18 29Network Based Computing Laboratory High-Performance Deep Learning
Summary
• Deep Learning is on the rise
– Rapid advances in software, hardware, and availability of large datasets
• Single node or single GPU is not enough for Deep Learning workloads
• We need to focus on distributed Deep Learning but there are many
challenges
• MPI offers a great abstraction for communication in DL Training tasks
• A co-design of Deep Learning frameworks and communication runtimes
will be required to make DNN Training scalable
OSU Booth - SC ‘18 30Network Based Computing Laboratory High-Performance Deep Learning
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
High Performance Deep Learninghttp://hidl.cse.ohio-state.edu/
http://web.cse.ohio-state.edu/~awan.10
The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/