Scalable and Distributed Deep Learning (DL): Co …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing

Scalable and Distributed Deep Learning (DL): Co-Design MPI Runtimes and DL Frameworks

Ammar Ahmad Awan, Hari Subramoni, and Dhabaleswar K. Panda

Network Based Computing Laboratory

Dept. of Computer Science and Engineering

The Ohio State University

OSU Booth Talk (SC ’18)

OSU Booth - SC ‘18 2Network Based Computing Laboratory High-Performance Deep Learning

• Introduction

– Deep Learning Trends

– CPUs and GPUs for Deep Learning

– Message Passing Interface (MPI)

• Research Challenges: Exploiting HPC for Deep Learning

• Proposed Solutions

• Conclusion

Agenda


• Easily implement and experiment with Deep Neural Networks

– Several Deep Learning (DL) frameworks have emerged

• Caffe, Microsoft Cognitive Toolkit (CNTK), TensorFlow, PyTorch, and

counting....

– Focus on CUDA-Aware MPI based DL frameworks

• Most frameworks have been optimized for NVIDIA GPUs and the

CUDA programming model

– However, distributed training (MPI+CUDA) is still emerging

– Fragmentation in efforts also exists – gRPC, MPI, NCCL, Gloo, etc.

Deep Learning Frameworks


• NVIDIA GPUs - main driving force for faster training of Deep

Neural Networks (DNNs)

• The ImageNet Challenge - (ILSVRC)

– DL models like AlexNet, ResNet, and VGG

– 90% of the ImageNet teams used GPUs in 2014*

– GPUs: A natural fit for DL –throughput-oriented (dense-compute)

– And, GPUs are growing in the HPC arena as well! – Top500 (Jun ‘18)

Deep Learning and GPUs

*https://blogs.nvidia.com/blog/2014/09/07/imagenet/

https://www.top500.org/

https://blogs.nvidia.com/blog/2014/09/07/imagenet/

https://www.top500.org/statistics/list/


And CPUs are catching up fast

1- https://dl.acm.org/citation.cfm?id=19935162- http://ieeexplore.ieee.org/abstract/document/5762730/3- https://dspace.mit.edu/bitstream/handle/1721.1/51839/MIT-CSAIL-TR-2010-013.pdf?sequence=1

• Intel CPUs are everywhere and many-core

CPUs are emerging according to Top500.org

• Host CPUs exist even on the GPU nodes

– Many-core Xeon Phis are increasing

• Usually, we hear CPUs are 10x – 100x slower

than GPUs? [1-3]

– But, CPU-based ML/DL is getting attention and

performance has significantly improved now


System Count for Xeon Phi

https://dl.acm.org/citation.cfm?id=1993516

http://ieeexplore.ieee.org/abstract/document/5762730/

https://dspace.mit.edu/bitstream/handle/1721.1/51839/MIT-CSAIL-TR-2010-013.pdf?sequence=1



• What is Message Passing Interface (MPI)?

– a de-facto standard for expressing distributed-memory parallel programming

– used for communication between processes in multi-process applications

• MVAPICH2 is a high performance implementation of the MPI standard

• What can MPI do for Deep Learning?

– MPI has been used for large scale scientific applications

– Deep Learning can also exploit MPI to perform high-performance communication

• Why do I need communication in Deep Learning?

– If you use one GPU or one CPU, you do not need communication

– But, one GPU or CPU is not enough!

– DL wants as many compute elements as it can get!

– MPI is a great fit – Broadcast, Reduce, and Allreduce is what most DL workloads require

What to use for scale-out? (Distributed training of Neural Nets.)


Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 2,925 organizations in 86 countries

– More than 489,000 (> 0.48 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Jul ‘18 ranking)

• 2nd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China

• 12th, 556,104 cores (Oakforest-PACS) in Japan

• 15th, 367,024 cores (Stampede2) at TACC

• 24th, 241,108-core (Pleiades) at NASA and many others

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decade

http://mvapich.cse.ohio-state.edu/


• There are several Deep Learning (DL) or DNN Training frameworks

• Every (almost every) framework has been optimized for NVIDIA GPUs

– cuBLAS and cuDNN have led to significant performance gains!

• But every framework is able to execute on a CPU as well

– So why are we not using them?

– Performance has been “terrible” and several studies have reported

significant degradation when using CPUs (see nvidia.qwiklab.com)

• But there is hope, actually a lot of great progress here!

– And MKL-DNN, just like cuDNN, has definitely rekindled this!!

– The landscape for CPU-based DL looks promising..

Deep Learning Frameworks – CPUs or GPUs?


• Introduction

• Research Challenges: Exploiting HPC for

Deep Learning


• Conclusion

Agenda


How to efficiently exploit

heterogeneous High Performance

Computing (HPC) resources for high-

performance and high-productivity

Deep Learning?

The Key Question!


1. What are the fundamental

issues in designing DL

frameworks?

– Memory Requirements

– Computation

Requirements

– Communication Overhead

2. Why do we need to support

distributed training?

– To overcome the limits of

single-node training

– To better utilize hundreds

of existing HPC Clusters

Research Challenges to Exploit HPC Technologies

InfiniBand GPUCPU

CNTK

Gradient Aggregation

Model PropagationForward

Backward

Deep Learning and Machine Learning Frameworks

Caffe/OSU-Caffe

Caffe2 TensorFlow MXNet

Communication Runtimes to support Distributed Training

HPC Platforms

Major Computation and Communication Phases in DL Frameworks

1

2


3. What are the new design challenges

brought forward by DL frameworks for

Communication runtimes?

– Large Message Collective

Communication and Reductions

– GPU Buffers (CUDA-Awareness)

4. Can a Co-design approach help in

achieving Scale-up and Scale-out efficiently?

– Co-Design the support at Runtime

level and Exploit it at the DL

Framework level

– What performance benefits can

be observed?

– What needs to be fixed at the

communication runtime layer?

5.

Research Challenges to Exploit HPC Technologies (Cont’d)

CUDA-Awareness

InfiniBand GPUCPU

Large-message Collectives

CNTK

Point-to-PointOperations

Gradient Aggregation

Model PropagationForward

Backward

Deep Learning and Machine Learning Frameworks

Caffe/OSU-Caffe

Caffe2 TensorFlow MXNet

Communication Runtimes (MPI/NCCL/Gloo/MLSL)

HPC Platforms

Major Computation and Communication Phases in DL Frameworks

3

4 Co-Design Opportunities


• Introduction


Deep Learning


• Conclusion

Agenda


Overview of the Proposed Solutions

Application Layer

OSU-CaffeDistributed Training

with TensorFlowOut-of-Core

DNN TrainingDNN Training on

CPUs/GPUsCA-CNTK

Communication Middleware Layer (Deep Learning Aware MVAPICH2-GDR)

CUDA-Aware Large Msg.

MPI_Reduce

Co-Design Layer

CUDA-Aware Large Msg.

MPI_Allreduce

Co-Design OSU-Caffe and MVAPICH2-GDR

CUDA-Aware MPI_Bcast

NCCL-basedBcast

Pure MPIBcast

MPI Bcast vs. NCCL2 Bcast

Hardware PlatformsCPU GPU InfiniBand HCA


Bcast (GPU 0)

packed_comm_buff

L1

L2

..

Ln

F

L1

L2

..

Ln

L1

L2

..

Ln

L1

L2

..

Ln

Params

GP

U 0

Params

GP

U 1

Params

GP

U 2

Params

GP

U 3

Gradients

1. Data

Propagation

2. Forward

Backward Pass

3. Gradient

Aggregation

B F B F B F B

packed_reduce_

buffpacked_reduce_

buff

packed_reduce_

buff

packed_reduce_

buff

ApplyUpdates

Reduce (GPU 0)

Loop {}

Caffe Architecture

http://hidl.cse.ohio-state.edu

http://hidl.cse.ohio-state.edu/


• Deep Learning frameworks are a different game

– Unusually large message sizes (order of megabytes)

– Most communication based on GPU buffers

• Existing State-of-the-art

– cuDNN, cuBLAS, NCCL --> scale-up performance

– CUDA-Aware MPI --> scale-out performance

• Proposed: Can we co-design the MPI runtime (MVAPICH2-

GDR) and the DL framework (Caffe) to achieve both?

– Efficient Overlap of Computation and Communication

– Efficient Large-Message Communication (Reductions)

– What application co-designs are needed to exploit

communication-runtime co-designs?

OSU-Caffe: Co-design to Tackle New Challenges for MPI Runtimes

Scal

e-u

p P

erfo

rman

ce

Scale-out Performance

cuDNN

NCCL

gRPC

Hadoop

ProposedCo-Designs

MPIcuBLAS

A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)


• Caffe : A flexible and layered Deep Learning

framework.

• Benefits and Weaknesses

– Multi-GPU Training within a single node

– Performance degradation for GPUs across

different sockets

– Limited Scale-out

• OSU-Caffe: MPI-based Parallel Training

– Enable Scale-up (within a node) and Scale-out

(across multi-GPU nodes)

– Scale-out on 64 GPUs for training CIFAR-10

network on CIFAR-10 dataset

– Scale-out on 128 GPUs for training GoogLeNet

network on ImageNet dataset

OSU-Caffe 0.9: Scalable Deep Learning on GPU Clusters

0

50

100

150

200

250

8 16 32 64 128

Trai

nin

g Ti

me

(sec

on

ds)

No. of GPUs

GoogLeNet (ImageNet) on 128 GPUs

Caffe OSU-Caffe (1024) OSU-Caffe (2048)

Invalid use case

OSU-Caffe 0.9 available from HiDL site


• Several Approaches to Distributed Training

– Google RPC (gRPC)

– gRPC+X (X= Verbs API, GDR, and MPI)

– No-gRPC

• No-gRPC designs use:

– MPI or

– NCCL

• Performance is heavily influenced by

– MPI_Allreduce

Scalable Distributed DNN Training using TensorFlow and MPI

Distributed TensorFlow

gRPCAccelerated

gRPC

gRPC+X

gRPC+MPI

gRPC+Verbs

gRPC+GDR

No-gRPC

Baidu-MPI

Horovod

MPI

NCCL

A. A. Awan et al. “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs and Performance Evaluation”, Submitted to IPDPS-19 for peer-review, Available from: https://arxiv.org/abs/1810.11112

https://arxiv.org/abs/1810.11112


Faster Allreduce in the proposed MPI-Opt

implemented in MVAPICH2-GDR

TensorFlow with CUDA-Aware MPI: NCCL vs. MVAPICH2-GDR

A. A. Awan et al. “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs and Performance Evaluation”, Submitted to IPDPS-19 for peer-review, Available from: https://arxiv.org/abs/1810.11112

1

10

100

1000

10000

100000

10000008 32 128

512 2K 8K 32K

128K

512K 2M 8M 32M

128M

Late

ncy

(µ

s)

Message Size (Bytes)

MPI NCCL2 MPI-Opt

1

4

16

64

256

1024

4096

16384

1 2 4 8 16 32 64

Imag

es/s

eco

nd (

Hig

her

is

bet

ter)

No. of Nodes (GPUs)

Horovod-NCCL2 Horovod-MPI-Opt Ideal

Faster (near-ideal) DNN Training speed-ups in TensorFlow-Horovod

–>

https://arxiv.org/abs/1810.11112


0

100

200

300

2 4 8 16 32 64 128

Late

ncy

(m

s)

Message Size (MB)

Reduce – 192 GPUs

Large Message Optimized Collectives for Deep Learning

0

100

200

128 160 192

Late

ncy

(m

s)

No. of GPUs

Reduce – 64 MB

0

100

200

300

16 32 64

Late

ncy

(m

s)

No. of GPUs

Allreduce - 128 MB

0

50

100

2 4 8 16 32 64 128

Late

ncy

(m

s)

Message Size (MB)

Bcast – 64 GPUs

0

50

100

16 32 64

Late

ncy

(m

s)

No. of GPUs

Bcast 128 MB

• MVAPICH2-GDR provides

optimized collectives for

large message sizes

• Optimized Reduce,

Allreduce, and Bcast

• Good scaling with large

number of GPUs

• Available in MVAPICH2-

GDR 2.2 and higher

0

100

200

300

2 4 8 16 32 64 128La

ten

cy (

ms)

Message Size (MB)

Allreduce – 64 GPUs


• NCCL has some limitations

– Only works for a single node, thus, no scale-out on

multiple nodes

– Degradation across IOH (socket) for scale-up (within a

node)

• We propose optimized MPI_Bcast

– Communication of very large GPU buffers (order of

megabytes)

– Scale-out on large number of dense multi-GPU nodes

• Hierarchical Communication that efficiently exploits:

– CUDA-Aware MPI_Bcast in MV2-GDR

– NCCL Broadcast primitive

Efficient Broadcast for MVAPICH2-GDR using NVIDIA NCCL

1

10

100

1000

10000

100000

1 8

64

512

4K

32K

256

K

2M

16M

128

M

Late

ncy

(u

s)Lo

g Sc

ale

Message Size

MV2-GDR MV2-GDR-Opt

100x

0

10

20

30

2 4 8 16 32 64Tim

e (s

eco

nd

s)

Number of GPUs

MV2-GDR MV2-GDR-Opt

Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement )

Performance Benefits: OSU Micro-benchmarks

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning, A. Awan , K. Hamidouche , A. Venkatesh , and D. K. Panda, EuroMPI 16[Best Paper Runner-Up]


• MPI_Bcast: Design and Performance Tuning for DL

Workloads

– Design ring-based algorithms for large messages

– Harness a multitude of algorithms and techniques for best

performance across the full range of message size and

process/GPU count

• Performance Benefits

– Performance comparable or better than NCCL-

augmented approaches for large messages

– Up to 10X improvement for small/medium message sizes

with micro-benchmarks and up to 7% improvement for

VGG training

Pure MPI Large Message Broadcast

0.001

0.01

0.1

1

10

100

1000

1 8

64

51

2

4K

32

K

25

6K

2M

16

M

12

8M

Late

ncy

(m

s)-

logs

cale

MessageSize(bytes)

MV2-GDR-NCCL MV2-GDR-Opt

0

5

10

15

20

25

30

2 4 8 32 64 128

Trai

nin

gTi

me

(sec

on

ds)

No.ofGPUs

MV2-GDR-NCCL MV2-GDR-Opt

VGG Training with CNTK

A. A. Awan, C-H. Chu, H. Subramoni, and D. K. Panda. Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?, EuroMPI ‘18

MPI Bcast Benchmark: 128 GPUs (8 nodes)


• Performance depends on

many factors

• Hardware Architectures

– GPUs

– Multi-/Many-core CPUs

– Software Libraries: cuDNN (for

GPUs), MKL-DNN/MKL 2017

(for CPUs)

• Hardware and Software co-

design

– Software libraries optimized

for one platform will not help

the other!

– cuDNN vs. MKL-DNN

Understanding the Impact of Execution EnvironmentsDLApplications(ImageRecognition,SpeechProcessing,etc.)

DLFrameworks(Caffe,TensorFlow,etc.)

BLASLibraries

Hardware

Many-coreGPU(PascalP100)

GenericConvolutionLayer

MKLOptimizedConvolutionLayer

MKL2017 cuDNN/cuBLAS

Multi-/Many-core(Xeon,XeonPhi)

cuDNN OptimizedConvolutionLayer

OtherBLASLibraries

OpenBLASATLAS

OtherProcessors

A. A. Awan, H. Subramoni, D. Panda, “An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures” 3rd Workshop on Machine Learning in High Performance Computing Environments, held in conjunction with SC17, Nov 2017.


• We use MCDRAM as Cache for all the

subsequent results

• On average, DDR-All is up to 1.5X slower

than MCDRAM

• MKL engine is up to 3X better than default

Caffe engine

• Biggest gains for Intel Xeon Phi (many-

core) architecture

• Both Haswell and Broadwell architectures

get significant speedups (up to 1.5X)

Impact of MKL engine and MC-DRAM for Intel-Caffe

0200400600800

10001200140016001800

Trai

nin

gT

ime

(m

s)

CPUArchitectures

0

100

200

300

400

500

600

700

DDR-All MCDRAM-All MCDRAMasCache

Tra

inin

gT

ime

(m

s)

MemoryConfigurations

Forward Backward


The Full Landscape for AlexNet Training on CPU/GPU

0

50

100

150

200

250

300

350

400

450

500

Tim

e(m

s)

conv1 conv2 conv3 conv4 conv5

• Convolutions in the Forward and Backward Pass

• Faster Convolutions Faster Training

• Most performance gains are based on conv2 and conv3.

0

100

200

300

400

500

600

700

800

Tim

e(

ms)

conv1 conv2 conv3 conv4 conv5


• What if your Neural Net is bigger than the GPU memory (out-of-core)?

– Use our proposed Unified Memory solution called OC-DNN :-)

Out-of-core DNN Training

FileSystem

Data Layer (L1) Layer 2

<<Kernel>>

………. Layer N

<<Kernel>> …

D2D H2D D2HF2H/H2F I/O

…

FileSystem

<<Kernel>>

<<Kernel>> …

…

D2D

H2D

D2H

F2H/H2F

I/O

M2M

F2M/M2F

Prefetch () Advise (Evict())

Prefetch ()

Advise (Evict()) ………. ……….

Prefetch ()

Advise (Evict())……….……….

Forward Propagation (Existing)

Backward Propagation (Existing)

Forward Propagation (Proposed)

Backward Propagation (Proposed)

Unified Data Layer

A. A. Awan, C-H Chu, X. Lu, H. Subramoni, D.K. Panda, “OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training”, 25th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC) 2018.


0

200

400

600

800

1000

1200

1400

1600

Img

/se

c (H

igh

er

is b

ett

er)

caffe-gpu oc-caffe-naïve oc-caffe-opt

caffe-cpu intel-caffe intel-caffe-opt

oc-caffe-opt is 5Xbetter than

intel-caffe-opt

caffe-gpucannot

run

X

Out-of-core DNN Training

A. A. Awan, C-H Chu, X. Lu, H. Subramoni, D.K. Panda, “OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training”, 25th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC) 2018.

0

50

100

150

200

250

300

Img

/sec

(H

igh

er i

s b

ett

er)



caffe-gpucannot

run

X

oc-caffe-opt is2.7X better than

intel-caffe-opt

0

5

10

15

20

Img

/se

c (H

igh

er

is b

ett

er)



oc-caffe-optis 80%

better than intel-caffe

caffe-gpucannot

run

X

intel-caffe-opt

(N/A)

X

AlexNet GoogLeNet ResNet-50

• We exploit Unified Memory designs in CUDA 9 and hardware support in Volta GPU

– Better performance than CPU-based solutions for all state-of-the-art Image models

mailto:[email protected]




• Introduction


Deep Learning


• Conclusion

Agenda


Summary

• Deep Learning is on the rise

– Rapid advances in software, hardware, and availability of large datasets

• Single node or single GPU is not enough for Deep Learning workloads

• We need to focus on distributed Deep Learning but there are many

challenges

• MPI offers a great abstraction for communication in DL Training tasks

• A co-design of Deep Learning frameworks and communication runtimes

will be required to make DNN Training scalable


Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

High Performance Deep Learninghttp://hidl.cse.ohio-state.edu/

[email protected]

http://web.cse.ohio-state.edu/~awan.10

The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/

The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/

http://nowlab.cse.ohio-state.edu/

http://nowlab.cse.ohio-state.edu/



Documents

Scalable and Distributed Deep Learning (DL): Co …mvapich.cse.ohio-state.edu/static/media/talks/slide/awan...A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing