46
Davide Rossetti, Senior Engineer, CUDA Driver HPC, GPUS AND NETWORKING

HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

  • Upload
    others

  • View
    20

  • Download
    0

Embed Size (px)

Citation preview

Page 1: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

Davide Rossetti, Senior Engineer, CUDA Driver

HPC, GPUS AND NETWORKING

Page 2: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

2

INDEX

GPU-friendly networks

history: APEnet+

research: GGAS, PEACH2,

corporate: w/ MLNX OFED 1.5.4.1+GDR, OFED 2.1/2.2

GDR enabler for low-perf CPUs

current limitations and WARs

accelerate eager for GPU

perspective: I/O bus limited -> NVLINK

dream: Frankenstein board

Page 3: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

3

SCALING PROBLEM

many interesting apps bang on the scaling wall

single big problem = capability computing

fixed physical problem size

parallel decomposition

lots of GPUs O(10^2-10^3)

different reasons

unbalance

poor coding

overheads -> GPU<->network interaction

even optimized apps

let’s move the wall further away…

Page 4: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

4

GPU-friendly networks

Page 5: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

5

RESEARCH

APEnet+ card:

Network Processor, SoC design

FPGA based

8-ports switch

6 bidirectional links (max 34 Gbps)

PCIe X8 Gen2 in X16 (4+4 GB/s)

32bit RISC

Accelerators:

Zero-copy RDMA host interface

GPU P2P support

P2P/GPUDirectRDMA on APEnet+

[*] R.Ammendola et al, “GPU peer-to-peer techniques applied to a cluster interconnect”, CASS 2013

Page 6: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

6

DIRECT VS (PIPELINED) STAGING• Staging to host memory:

• 14us latency

• 7-10us penalty for each cudaMemcpy

• < 1us NIC HW latency

• SW complexity

• Pipelined algorithm

• MVAPICH2, OpenMPI

Direct path:

7us latency (11/2011)

GPU<->GPU path

both NV P2P and GPUDirectRDMA

GPU<->3rd party devices

HPC scaling

Other applications: HEP, Astronomy

Page 7: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

7

APENET

Page 8: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

8

GPU-NETWORK FUSION CARD

Frankenstein board

Take Half of a “GTX 590”

Add dedicated APEnet

Tune net bandwidth

Put 1,2,.. per server

Perfect for HPC

PLX

Page 9: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

9

RESEARCH

PEACH2 @tsukuba

GGAS=Global GPU Address Spaces @Heidelberg

PEACH2 & GGAS

Center for Computational Sciences, Univ. of Tsukuba

PEACH2 board (Production version for

HA-PACS/TCA)

PCI Express Gen2 x8 peripheral board

Compatible with PCIe Spec.

2014/02/19 External Review 7

Top View Side View

Page 10: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

10

GPUDIRECTRDMA ON MELLANOX IB BOARDS

GPUDirect RDMA support

early prototype started in Feb 2013

presented at GTC 2013 on MVAPICH2 1.9

rework for Mellanox SW stack

May 2013 - Dec 2013

showcased at SC’13 at Mellanox booth

shipping since 1/2014 MOFED 2.1

OpenMPI 1.7.4

MVAPICH2 2.0b

Page 11: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

11

GPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI

Less overhead = lower latency and more effective bandwidth

When GDRDMA is effective:

3x more BW

~70% less latency

Not always good:

Experimentally, cross-over behavior at 8-16KB

GDRDMA reading BW cap at 800MB/s on SNB

GDRDMA writing BW saturates.

2 nodes MPI benchmarks

with and without GPUDirect RDMA

Intel SNB, K20c, MLNX FDR PCIe X8 gen3

Page 12: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

12

ON IVY BRIDGE XEON

Page 13: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

13

MPI USING GPUDIRECT RDMA

any space for improvement ?

10 Webinar - June ‘14

GPU-GPU Internode MPI Latency

0

100

200

300

400

500

600

700

800

8K 32K 128K 512K 2M

1-Rail

2-Rail

1-Rail-GDR

2-Rail-GDR

Large Message Latency

Message Size (Bytes)

La

ten

cy (

us

)

MVAPICH2-GDR 2.0b

Intel Ivy Bridge (E5-2680 v2) node with 20 cores w/ NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA,

CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in

10 %

0

5

10

15

20

25

1 4 16 64 256 1K 4K

1-Rail2-Rail1-Rail-GDR2-Rail-GDR

Small Message Latency

Message Size (Bytes)

La

ten

cy (

us

)

67 %

5.49 usec

Performance of MVAPICH2 with GPUDirect-RDMA: Latency back in GTC14 time…

13

Page 14: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

14

BENCHMARK RESULTS

MV2-GDR 2.0b original version

MV2-GDR-New experimental version

latency

17 Webinar - June ‘14

Performance of MVAPICH2 with GPUDirect-RDMA: Latency

0

2

4

6

8

10

12

1 4 16 64 256 1024 4096

MV2-GDR 2.0b

MV2-GDR-New (Loopback)

MV2-GDR-New (Fast Copy)

Small Message Latency

Message Size (Bytes)

Late

ncy (

us) 7.3 usec

5.0 usec

3.5 usec

52%

Intel Ivy Bridge (E5-2630 v2) node with 12 cores w/ NVIDIA Tesla K20c GPU, Mellanox Connect-IB Dual-FDR HCA

CUDA 6.0, Mellanox OFED 2.1 with GPUDirect-RDMA Plug-in

GPU-GPU Internode MPI Latency

Page 15: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

15

RESTORING EAGER

Eager protocol on IB:

sender:

copy on pre-registered TX buffers

ibv_post_send, on UD,UC,RC,…

receiver:

pre-post temp RX buffers, use credits for proper bookkeeping

RX matching then copy to final destination

recalling Eager protocol

Page 16: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

16

RESTORING EAGER

get rid of rendezvous as:

bad interplay with apps

excessive sync among nodes

more round-trip’s

Problem:

staging: moving data from host temp buf to GPU final dest

cudaMemcpy has >4us overhead (~24KB threshold at 6GB/s)

Eager essential tool for low-latency / small msgs

Page 17: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

17

RESTORING EAGER

IB loopback

use NIC as low-latency copy engine

GDRCopy

CPU-driven zero-latency BAR1 copy

Two tricks

Page 18: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

18

RX IB LOOPBACK FLOW

CPU

GPU

MLNX

C-IB

TX side

GPU MEMsrc_buf

CPU

MEMCPU

GPU

MLNX

C-IB

GPU MEM

CPU

MEM

dst_buf

post buf

RX side

Page 19: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

19

EAGER WITH GDRCOPY

experimental super low-latency copy library

three sets of primitives:

Pin/Unpin, setup/tear down BAR1 mappings of GPU memory

Map/Unmap, memory-map BAR1 on user-space CPU address range

CPU can use standard load/store instructions (MMIO) to access the GPU memory.

Copy to/from PCIe BAR, highly tuned R/W functions

(ab)use CPU to copy data to/from GPU mem

Page 20: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

20

MPI RX GDRCOPY FLOW

CPU

GPU

MLNX

C-IB

TX side

GPU MEMsrc_buf

CPU

MEMCPU

GPU

MLNX

C-IB

GPU MEM

CPU

MEM

dst_buf

post buf

RX side

GDRCopy

data path

Page 21: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

21 21

GDRCOPY

zero latency vs ~1us for loopback vs 4-6us for cudaMemcpy

D-H: 20-30MB/s

H-D: 6GB/s vs 9GB/s for cudaMemcpy

sensitive to

CPU MMIO performance

NUMA effects, eg on IVB D-H=3GB/s when on wrong socket

available soon on https://github.com/drossetti/gdrcopy

performance

Page 22: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

22

BENCHMARK RESULTS

on IVB Xeon + K40m + MLNX C-IB

# MV-2.0b MV2-GDR-2.0b MV2-GDR loopback gdrcopy

0 1.27 1.31 1.16 1.22 1.21

1 19.23 7.03 5.29 4.77 3.19

2 19.26 7.02 5.28 4.78 3.18

4 19.35 7.01 5.26 4.77 3.17

8 19.45 7.00 5.26 4.79 3.17

* with MVAPICH2 GDR pre-release

Page 23: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

23

BENCHMARK RESULTS

bandwidth

18 Webinar - June ‘14

Performance of MVAPICH2 with GPUDirect-RDMA: BW & BiBW

Intel Ivy Bridge (E5-2630 v2) node with 12 cores w/ NVIDIA Tesla K20c GPU, Mellanox Connect-IB Dual-FDR HCA

CUDA 6.0, Mellanox OFED 2.1 with GPUDirect-RDMA Plug-in

0

500

1000

1500

2000

2500

3000

1 4 16 64 256 1024 4096

MV2-GDR-2.0b

MV2-GDR-New (Loopback)

MV2-GDR-New (Fast Copy)

Small Message Bandwidth

Message Size (Bytes)

Ban

dw

idth

(M

B/s

)

2.2x

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

1 4 16 64 256 1024 4096

MV2-GDR-2.0b

MV2-GDR-New (Loopback)

MV2-GDR-New (Fast Copy)

Small Message Bi-Bandwidth

Message Size (Bytes)

Bi-

Ba

nd

wid

th (

MB

/s)

2.1x

GPU-GPU Internode MPI Bandwidth & Bi-Bandwidth

18 Webinar - June ‘14

Performance of MVAPICH2 with GPUDirect-RDMA: BW & BiBW

Intel Ivy Bridge (E5-2630 v2) node with 12 cores w/ NVIDIA Tesla K20c GPU, Mellanox Connect-IB Dual-FDR HCA

CUDA 6.0, Mellanox OFED 2.1 with GPUDirect-RDMA Plug-in

0

500

1000

1500

2000

2500

3000

1 4 16 64 256 1024 4096

MV2-GDR-2.0b

MV2-GDR-New (Loopback)

MV2-GDR-New (Fast Copy)

Small Message Bandwidth

Message Size (Bytes)

Ba

nd

wid

th (

MB

/s)

2.2x

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

1 4 16 64 256 1024 4096

MV2-GDR-2.0b

MV2-GDR-New (Loopback)

MV2-GDR-New (Fast Copy)

Small Message Bi-Bandwidth

Message Size (Bytes)

Bi-

Ba

nd

wid

th (

MB

/s)

2.1x

GPU-GPU Internode MPI Bandwidth & Bi-Bandwidth

Page 24: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

24

coming soon…

Page 25: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

25

MULTI-GPU APPS SCALING PROBLEM

The Glotzer Group @ University of Michigan

Time Spent in Subroutines

LJ Pair

Ghost UpdateNeighbor

List

MVAPICH2 2.0b

K20X GPU

64000 particles

• Compute kernels scale• Communication cost is constant

Ê

Ê

Ê

ÊÊ

Ê

ÊÊ

‡‡ ‡

‡‡

ÏÏ Ï Ï Ï Ï Ï

1 2 3 4 6 8 12 16

100

200

300

150

P H= GPUsL

Tst

ep@m

sDNVT1

NVT2

NetForce

Page 26: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

26

HOOMD-BLUE PROFILING

MV2-GDR 2.0b

MPI Launch

latency

GPU BAR1 here but invisibleCPU idle

Page 27: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

27

LIMITED SCALING

CPU mgmt, eg cudaStreamQuery() & MPI_Test():

poll GPU kernels and trigger NIC comm

poll NIC comm then schedule GPU kernels

Today scaling needs fast CPU

implications for ARM

slow CPU not enough at ~ 10^2 GPUs

how it works today

Page 28: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

28

1D STENCIL COMPUTATION

halo<<<stream0>>>(h0,p0);

MPIX_Isend(p0,stream0,&req0[0]);

halo<<<stream1>>>(h1,p1);

MPIX_Isend(p1,stream1,&req[0]);

bulk<<<stream2>>>(d);

MPIX_Irecv(h0,stream0,&req0[1]);

MPIX_Wait(stream0, req0, 2);

cudaEventRecord(e0,stream0);

MPIX_Irecv(h1,stream1,&req1[1]);

MPIX_Wait(stream1, req1, 2);

cudaEventRecord(e1,stream1);

cudaStreamWaitEvent(stream2, e0);

cudaStreamWaitEvent(stream2, e1);

Energy<<<stream2>>>(d);

cudaStreamSynchronize(stream2);

Even record

will wake the

stream wait

event

Wait for both

Isend and

Irecv

expand into CUDA + IBverbs + verbs/CUDA interop

Page 29: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

29

DEPENDENCY GRAPHHalo krn

Isend

event

record

Bulk krn

Stream event

wait

stream0 stream1stream2

Stream event

wait

Irecv

Wait

Halo krn

Isend

event

record

Irecv

Wait

Energy krn

Command

Engine

Compute

Engine

Page 30: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

30

NEXT GEN I/O

NVlink 1.0 in Pascal

beyond PCIe

additional data path, 80-200GB/s

NVlink attached network ?

Page 31: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

31

SUMMARY

p2p, platforms (IVB) still limited, Haswell ?

challenge, manage multiple datapaths and topologies, eventually NVLINK

reduce overheads, use GPU scheduler

Page 32: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

32

Thank You !!!

Questions ?

Page 33: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

33

PCI BAR ?BAR=Base Address Register

PCI resource

up to 6 32bits BAR regs

64bits BAR uses 2 regs

physical address range

prefetchable ~ cachable[danger here]

most GPUs expose 3 BARs

BAR1 space varies:

K20/40c=256MB

K40m=16GB [new!]

$ lspci –vv –s 1:0.0

01:00.0 3D controller: NVIDIA Corporation Device 1024 (rev a1)

Subsystem: NVIDIA Corporation Device 0983

Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-

Stepping- SERR- FastB2B- DisINTx-

Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-

<MAbort- >SERR- <PERR- INTx-

Latency: 0

Interrupt: pin A routed to IRQ 86

Region 0: Memory at fa000000 (32-bit, non-prefetchable) [size=16M]

Region 1: Memory at b0000000 (64-bit, prefetchable) [size=256M]

Region 3: Memory at c0000000 (64-bit, prefetchable) [size=32M]

$ nvidia-smi

BAR1 Memory Usage

Page 34: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

34

GPU BAR

BAR1 = aperture on GPU memory

GPUDirectRDMA == GPU BAR1 as used by 3rd party devices:

network cards

storage devices

SSD

reconfigurable devices

FPGAs

CPU

GPU3rd party

device

SYSMEM

BAR1 DMA engine

device mem

Page 35: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

35

PLATFORM-RELATED LIMITATIONS

PCIE topology and NUMA effects:

number and type of traversed chipsets/bridges/switches

GPU memory reading BW most affected

Sandy Bridge Xeon severely limited

On older chipsets, writing BW affected too

crossing inter-CPU bus huge bottleneck

currently allowed though

Page 36: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

36

NUMA EFFECTS AND PCIE TOPOLOGY

GPU BAR1 reading is half PCIE peak, by design

GPUDirect RDMA performance may suffer from bottlenecks

GPU memory reading most affected

old chipsets, writing too

IVB Xeon PCIE much better

Page 37: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

37

MPI STRATEGY

use GPUDirectRDMA for small/mid size buffers

threshold depends on platform

and on NUMA (e.g. crossing QPI)

don’t tax BAR1 too much

bookkeeping needed [being implemented]

revert to pipelined staging through SYSMEM

GPU CE latency >= 6us

no Eager-send for G-to-G/H

excessive inter-node sync

Page 38: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

38

PLATFORM & BENCHMARKS

SMC 2U SYS-2027GR-TRFH

IVB Xeon

PEX 8747 PCIE switch, on a raiser-card

SMC 1U SYS-1027GR-TRF

NVIDIA K40m

Mellanox dual-port Connect-IB

Benchmarks:

GPU-extended ibv_ud_pingpong

GPU-extended ib_rdma_bw

Page 39: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

39

HOST TO GPU (IVY BRIDGE)

GK110B

x16 Gen3 x16 Gen3

IVB Xeon

GK110B

SYSMEM SYSMEM

Single rail FDR

6.1 GB/s

BW: 6.1GB/s

lat: 1.7us IVB Xeon

MLNX

C-IB

MLNX

C-IB

Single rail FDR

6.1 GB/s

TX side RX side

Page 40: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

40

GPU TO HOST (IVY BRIDGE)

GK110B

x16 Gen3 x16 Gen3

IVB Xeon

GK110B

SYSMEM SYSMEM

Single rail FDR

6.1 GB/s

BW: 3.4/3.7*GB/s

lat: 1.7**usIVB Xeon

MLNX

C-IB

MLNX

C-IB

Single rail FDR

6.1 GB/s

RX side TX side

Page 41: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

41

GPU TO GPU (IVY BRIDGE)

IVB Xeon

GK110B

x16 Gen3 x16 Gen3

IVB Xeon

GK110B

SYSMEM SYSMEM

Single rail FDR

6.1 GB/sSingle rail FDR

6.1 GB/s

BW: 3.4/3.7*GB/s

lat: 1.9us

MLNX

C-IB

MLNX

C-IB

TX side RX side

Page 42: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

42

GPU TO HOST (SANDY BRIDGE)

GK110

x16 Gen2 x8 Gen3

SNB Xeon

GK110

SYSMEM SYSMEM

Single rail FDR

6.1 GB/s

SNB XeonBW: 800MB/s

MLNX

C-X3

MLNX

C-X3

Single rail FDR

6.1 GB/s

RX side TX side

x16 Gen2 x8 Gen3

Page 43: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

43

GPU TO GPU, TX ACROSS QPI

GK110B

x16 Gen3 x16 Gen3

IVB Xeon

GK110B

SYSMEMIVB Xeon

Single rail FDR

6.1 GB/sSingle rail FDR

6.1 GB/s

IVB XeonBW: 1.1GB/s

lat: 1.9us

MLNX

C-IB

MLNX

C-IB

TX side RX side

Page 44: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

44

GPU TO GPU, RX ACROSS QPI

GK110B

x16 Gen3 x16 Gen3

IVB Xeon

GK110B

SYSMEM

Single rail FDR

6.1 GB/sSingle rail FDR

6.1 GB/s

IVB XeonIVB XeonBW: .25GB/s

lat: 2.2us

MLNX

C-IB

MLNX

C-IB

RX side TX side

Page 45: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

45

HOST TO GPU, RX ACROSS PLX

IVB Xeon

GK110B

X16 Gen3

PLX

X16 Gen3

BW: 6.1GB/s

lat: 1.9usIVB Xeon

GK110B

PLX

MLNX

C-IB

MLNX

C-IB

SYSMEM

TX side RX sideSingle rail FDR

6.1 GB/s

Single rail FDR

6.1 GB/s

Page 46: HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

46

Single rail FDR

6.1 GB/s

GPU TO GPU, TX/RX ACROSS PLX

IVB Xeon

GK110B

X16 Gen3

PLX

X16 Gen3

Single rail FDR

6.1 GB/s

BW: 5.8GB/s

lat: 1.9usIVB Xeon

GK110B

PLX

MLNX

C-IB

MLNX

C-IB

TX side RX side