MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans

MVAPICH2 and MVAPICH2-‐X Projects: Latest Developments and Future Plans

Dhabaleswar K. (DK) Panda The Ohio State University

E-‐mail: [email protected]‐state.edu h<p://www.cse.ohio-‐state.edu/~panda

Talk at HPC Advisory Council European Conference (2014)

by

High-‐End CompuPng (HEC): PetaFlop to ExaFlop

HPC Advisory Council European Conference, June '14 2

Expected to have an ExaFlop system in 2020-‐2022!

100 PFlops in 2015

1 EFlops in 2018?


Trends for Commodity CompuPng Clusters in the Top 500 List (hWp://www.top500.org)

0 10 20 30 40 50 60 70 80 90 100

0 50

100 150 200 250 300 350 400 450 500

Percen

tage of C

lusters

Num

ber o

f Clusters

Timeline

Percentage of Clusters Number of Clusters

Drivers of Modern HPC Cluster Architectures

•  MulR-‐core processors are ubiquitous

•  InfiniBand very popular in HPC clusters • 

•  Accelerators/Coprocessors becoming common in high-‐end systems

•  Pushing the envelope for Exascale compuRng

Accelerators / Coprocessors high compute density, high performance/waW

>1 TFlop DP on a chip

High Performance Interconnects -‐ InfiniBand <1usec latency, >100Gbps Bandwidth

Tianhe – 2 (1) Titan (2) Stampede (6) Tianhe – 1A (10)

4

MulP-‐core Processors

HPC Advisory Council European Conference, June '14

•  207 IB Clusters (41%) in the November 2013 Top500 list

(h<p://www.top500.org)

•  InstallaRons in the Top 40 (19 systems):


Large-‐scale InfiniBand InstallaPons

519,640 cores (Stampede) at TACC (7th) 70,560 cores (Helios) at Japan/IFERC (24th)

147, 456 cores (Super MUC) in Germany (10th) 138,368 cores (Tera-‐100) at France/CEA (29th)

74,358 cores (Tsubame 2.5) at Japan/GSIC (11th) 60,000-‐cores, iDataPlex DX360M4 at Germany/Max-‐Planck (31st)

194,616 cores (Cascade) at PNNL (13th) 53,504 cores (PRIMERGY) at Australia/NCI (32nd)

110,400 cores (Pangea) at France/Total (14th) 77,520 cores (Conte) at Purdue University (33rd)

96,192 cores (Pleiades) at NASA/Ames (16th) 48,896 cores (MareNostrum) at Spain/BSC (34th)

73,584 cores (Spirit) at USA/Air Force (18th) 222,072 (PRIMERGY) at Japan/Kyushu (36th)

77,184 cores (Curie thin nodes) at France/CEA (20th) 78,660 cores (Lomonosov) in Russia (37th )

120, 640 cores (Nebulae) at China/NSCS (21st) 137,200 cores (Sunway Blue Light) in China 40th)

72,288 cores (Yellowstone) at NCAR (22nd) and many more!

5


Parallel Programming Models Overview

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory

Logical shared memory

Shared Memory Model

SHMEM, DSM Distributed Memory Model

MPI (Message Passing Interface)

ParRRoned Global Address Space (PGAS)

Global Arrays, UPC, Chapel, X10, CAF, …

•  Programming models provide abstract machine models

•  Models can be mapped on different types of systems –  Distributed Shared Memory (DSM), MPI within a node, etc.

–  Logical shared memory across nodes (PGAS)

7

ParPPoned Global Address Space (PGAS) Models


•  Key features -  Simple shared memory abstracRons

-  Light weight one-‐sided communicaRon

-  Easier to express irregular communicaRon

•  Different approaches to PGAS -  Languages

•  Unified Parallel C (UPC)

•  Co-‐Array Fortran (CAF)

•  X10

-  Libraries •  OpenSHMEM

•  Global Arrays

•  Chapel

•  Hierarchical architectures with mulRple address spaces

•  (MPI + PGAS) Model –  MPI across address spaces

–  PGAS within an address space

•  MPI is good at moving data between address spaces

•  Within an address space, MPI can interoperate with other shared memory programming models

•  Can co-‐exist with OpenMP for offloading computaRon

•  ApplicaRons can have kernels with different communicaRon pa<erns

•  Can benefit from different models

•  Re-‐wriRng complete applicaRons can be a huge effort

•  Port criRcal kernels to the desired model instead


MPI+PGAS for Exascale Architectures and ApplicaPons

Hybrid (MPI+PGAS) Programming

•  ApplicaRon sub-‐kernels can be re-‐wri<en in MPI/PGAS based on communicaRon characterisRcs

•  Benefits: –  Best of Distributed CompuRng Model

–  Best of Shared Memory CompuRng Model

•  Exascale Roadmap*: –  “Hybrid Programming is a pracRcal way to

program exascale systems”

* The InternaConal Exascale SoDware Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, InternaConal Journal of High Performance Computer ApplicaCons, ISSN 1094-‐3420

Kernel 1 MPI

Kernel 2 MPI

Kernel 3 MPI

Kernel N MPI

HPC ApplicaPon

Kernel 2 PGAS

Kernel N PGAS


Designing Sogware Libraries for MulP-‐Petaflop and Exaflop Systems: Challenges

Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc.

ApplicaPon Kernels/ApplicaPons

Networking Technologies (InfiniBand, 40/100GigE,

Aries, BlueGene)

MulP/Many-‐core Architectures

Point-‐to-‐point CommunicaPon (two-‐sided & one-‐sided)

CollecPve CommunicaPon

SynchronizaPon & Locks

I/O & File Systems

Fault Tolerance

CommunicaPon Library or RunPme for Programming Models

10

Accelerators (NVIDIA and MIC)

Middleware


Co-‐Design OpportuniPes

and Challenges

across Various Layers

Performance Scalability Fault-‐

Resilience

•  Scalability for million to billion processors –  Support for highly-‐efficient inter-‐node and intra-‐node communicaRon (both two-‐sided

and one-‐sided) –  Extremely minimum memory footprint

•  Balancing intra-‐node and inter-‐node communicaRon for next generaRon mulR-‐core (128-‐1024 cores/node) –  MulRple end-‐points per node

•  Support for efficient mulR-‐threading •  Support for GPGPUs and Accelerators •  Scalable CollecRve communicaRon

–  Offload –  Non-‐blocking –  Topology-‐aware –  Power-‐aware

•  Fault-‐tolerance/resiliency •  QoS support for communicaRon and I/O •  Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC,

MPI + OpenSHMEM, …)

Challenges in Designing (MPI+X) at Exascale

11 HPC Advisory Council European Conference, June '14

•  High Performance open-‐source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)

–  MVAPICH (MPI-‐1), MVAPICH2 (MPI-‐2.2 and MPI-‐3.0), Available since 2002

–  MVAPICH2-‐X (MPI + PGAS), Available since 2012

–  Support for GPGPUs and MIC

–  Used by more than 2,150 organizaPons in 72 countries

–  More than 214,000 downloads from OSU site directly

–  Empowering many TOP500 clusters •  7th ranked 519,640-‐core cluster (Stampede) at TACC

•  11th ranked 74,358-‐core cluster (Tsubame 2.5) at Tokyo InsRtute of Technology

•  16th ranked 96,192-‐core cluster (Pleiades) at NASA

•  75th ranked 16,896-‐core cluster (Keenland) at GaTech and many others . . .

–  Available with sotware stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE)

–  h<p://mvapich.cse.ohio-‐state.edu

•  Partner in the U.S. NSF-‐TACC Stampede System

MVAPICH2/MVAPICH2-‐X Sogware


•  Released on 06/20/14

•  Major Features and Enhancements –  Based on MPICH-‐3.1

–  MPI-‐3 RMA Support

–  Support for Non-‐blocking collecRves

–  MPI-‐T support

–  CMA support is default for intra-‐node communicaRon

–  OpRmizaRon of collecRves with CMA support

–  Large message transfer support

–  Reduced memory footprint

–  Improved Job start-‐up Rme

–  OpRmizaRon of collecRves and tuning for mulRple plaworms

–  Updated hwloc to version 1.9

•  MVAPICH2-‐X 2.0 GA supports hybrid MPI + PGAS (UPC and OpenSHMEM) programming models –  Based on MVAPICH2 2.0 GA including MPI-‐3 features

–  Compliant with UPC 2.18.0 and OpenSHMEM v1.0f

13

MVAPICH2 2.0 GA and MVAPICH2-‐X 2.0 GA



and one-‐sided) –  Extremely minimum memory footprint –  Support with Direct Connected Transport (DCT)

•  CollecRve communicaRon –  Hardware-‐mulRcast-‐based –  Offload and Non-‐blocking –  Topology-‐aware –  Power-‐aware

•  Support for GPGPUs •  Support for Intel MICs •  Fault-‐tolerance •  Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with

Unified RunRme

Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-‐X for Exascale



One-‐way Latency: MPI over IB with MVAPICH2

0.00

1.00

2.00

3.00

4.00

5.00

6.00 Small Message Latency

Message Size (bytes)

Latency (us)

1.66

1.56

1.64

1.82

0.99 1.09

1.12

DDR, QDR -‐ 2.4 GHz Quad-‐core (Westmere) Intel PCI Gen2 with IB switch FDR -‐ 2.6 GHz Octa-‐core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-‐Dual FDR -‐ 2.6 GHz Octa-‐core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-‐Dual FDR -‐ 2.8 GHz Deca-‐core (IvyBridge) Intel PCI Gen3 with IB switch

0.00

50.00

100.00

150.00

200.00

250.00 Qlogic-‐DDR Qlogic-‐QDR ConnectX-‐DDR ConnectX2-‐PCIe2-‐QDR ConnectX3-‐PCIe3-‐FDR Sandy-‐ConnectIB-‐DualFDR Ivy-‐ConnectIB-‐DualFDR

Large Message Latency


Latency (us)


Bandwidth: MPI over IB with MVAPICH2

0

2000

4000

6000

8000

10000

12000

14000 UnidirecPonal Bandwidth

Band

width (M

Bytes/sec)


3280

3385

1917 1706

6343

12485

12810

0

5000

10000

15000

20000

25000

30000

4 16 64 256 1024 4K 16K 64K 256K 1M

Qlogic-‐DDR Qlogic-‐QDR ConnectX-‐DDR ConnectX2-‐PCIe2-‐QDR ConnectX3-‐PCIe3-‐FDR Sandy-‐ConnectIB-‐DualFDR Ivy-‐ConnectIB-‐DualFDR

BidirecPonal Bandwidth

Band

width (M

Bytes/sec)


3341 3704

4407

11643

6521

21025

24727

DDR, QDR -‐ 2.4 GHz Quad-‐core (Westmere) Intel PCI Gen2 with IB switch FDR -‐ 2.6 GHz Octa-‐core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-‐Dual FDR -‐ 2.6 GHz Octa-‐core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-‐Dual FDR -‐ 2.8 GHz Deca-‐core (IvyBridge) Intel PCI Gen3 with IB switch

0

0.2

0.4

0.6

0.8

1

0 1 2 4 8 16 32 64 128 256 512 1K

Latency (us)

Message Size (Bytes)

Latency

Intra-‐Socket Inter-‐Socket

MVAPICH2 Two-‐Sided Intra-‐Node Performance (Shared memory and Kernel-‐based Zero-‐copy Support (LiMIC and CMA))

17

Latest MVAPICH2 2.0

Intel Ivy-‐bridge

0.18 us

0.45 us

0 2000 4000 6000 8000 10000 12000 14000 16000

Band

width (M

B/s)


Bandwidth (Inter-‐socket)

inter-‐Socket-‐CMA inter-‐Socket-‐Shmem inter-‐Socket-‐LiMIC

0 2000 4000 6000 8000

10000 12000 14000 16000

Band

width (M

B/s)


Bandwidth (Intra-‐socket)

intra-‐Socket-‐CMA intra-‐Socket-‐Shmem intra-‐Socket-‐LiMIC

14,250 MB/s 13,749 MB/s


0

0.5

1

1.5

2

2.5

3

3.5

1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Latency (us)


Inter-‐node Get/Put Latency

Get Put

MPI-‐3 RMA Get/Put with Flush Performance

18

Latest MVAPICH2 2.0, Intel Sandy-‐bridge with Connect-‐IB (single-‐port)

2.04 us

1.56 us

0

5000

10000

15000

20000

1 2 4 8 16

32

64

128

256

512 1K

2K

4K

8K

16K

32K

64K

128K

256K

Band

width (M

bytes/sec)


Intra-‐socket Get/Put Bandwidth

Get Put

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K

Latency (us)


Intra-‐socket Get/Put Latency

Get Put

0.08 us


0 1000 2000 3000 4000 5000 6000 7000 8000

1 4 16 64 256 1K 4K 16K 64K 256K

Band

width (M

bytes/sec)


Inter-‐node Get/Put Bandwidth

Get Put 6881

6876

14926

15364

•  Memory usage for 32K processes with 8-‐cores per node can be 54 MB/process (for connecRons)

•  NAMD performance improves when there is frequent communicaRon to many peers


Minimizing memory footprint with XRC and Hybrid Mode Memory Usage Performance on NAMD (1024 cores)

19

•  Both UD and RC/XRC have benefits

•  Hybrid for the best of both

•  Available since MVAPICH2 1.7 as integrated interface

•  RunRme Parameters: RC -‐ default;

•  UD -‐ MV2_USE_ONLY_UD=1

•  Hybrid -‐ MV2_HYBRID_ENABLE_THRESHOLD=1

M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable ConnecPon,” Cluster ‘08

0

100

200

300

400

500

1 4 16 64 256 1K 4K 16K Mem

ory (M

B/process)

ConnecPons

MVAPICH2-‐RC MVAPICH2-‐XRC

0

0.5

1

1.5

apoa1 er-gre f1atpase jac

Normalized

Tim

e

Dataset

MVAPICH2-‐RC MVAPICH2-‐XRC

0

2

4

6

128 256 512 1024

Time (us)

Number of Processes

UD Hybrid RC

26% 40% 30% 38%


Dynamic Connected (DC) Transport in MVAPICH2

Nod

e 0

P1

P0

Node 1

P3

P2 Node 3

P7

P6

Nod

e 2

P5

P4

IB Network

•  Constant connecRon cost (One QP for any peer) •  Full Feature Set (RDMA, Atomics etc) •  Separate objects for send (DC IniRator) and receive (DC

Target) –  DC Target idenRfied by “DCT Number” –  Messages routed with (DCT Number, LID) –  Requires same “DC Key” to enable communicaRon

•  IniRal study done in MVAPICH2 –  More details in Research Papers 06, Wednesday June 25;

12:00 PM to 12:30 PM

0

0.5

1

160 320 620 Normalized

ExecuPo

n Time

Number of Processes

NAMD -‐ Apoa1: Large data set RC DC-‐Pool UD XRC

DCT support available publicly in Mellanox OFED – 2.2.01

10 22

47 97

1 1 1 2

10 10 10 10

1 1 3

5

1

10

100

80 160 320 640

Conn

ecPo

n Mem

ory (KB)

Number of Processes

Memory Footprint for Alltoall RC DC-‐Pool UD XRC





Unified RunRme




Hardware MulPcast-‐aware MPI_Bcast on Stampede

0 5

10 15 20 25 30 35 40

2 8 32 128 512

Latency (u

s)


Small Messages (102,400 Cores) Default MulPcast

ConnectX-‐3-‐FDR (54 Gbps): 2.7 GHz Dual Octa-‐core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch

0

100

200

300

400

500

2K 8K 32K 128K

Latency (u

s)


Large Messages (102,400 Cores) Default MulPcast

0 5 10 15 20 25 30

Latency (us)

Number of Nodes

16 Byte Message

Default

MulPcast

0

50

100

150

200

Latency (us)

Number of Nodes

32 KByte Message

Default

MulPcast

0

1

2

3

4

5

512 600 720 800 ApplicaP

on Run

-‐Tim

e (s)

Data Size

0 5

10 15

64 128 256 512 Run-‐Time (s)

Number of Processes

PCG-‐Default Modified-‐PCG-‐Offload

ApplicaPon benefits with Non-‐Blocking CollecPves based on CX-‐3 CollecPve Offload

23

Modified P3DFFT with Offload-‐Alltoall does up to 17% be<er than default version (128 Processes)

K. Kandalla, et. al.. High-‐Performance and Scalable Non-‐Blocking All-‐to-‐All with CollecPve Offload on InfiniBand Clusters: A Study with Parallel 3D FFT, ISC 2011


17%

0 0.2 0.4 0.6 0.8 1

1.2

10 20 30 40 50 60 70

Normalized

Pe

rforman

ce

HPL-‐Offload HPL-‐1ring HPL-‐Host

HPL Problem Size (N) as % of Total Memory

4.5%

Modified HPL with Offload-‐Bcast does up to 4.5% be<er than default version (512 Processes)

Modified Pre-‐Conjugate Gradient Solver with Offload-‐Allreduce does up to 21.8% be<er than default version

K. Kandalla, et. al, Designing Non-‐blocking Broadcast with CollecPve Offload on InfiniBand Clusters: A Case Study with HPL, HotI 2011

K. Kandalla, et. al., Designing Non-‐blocking Allreduce with CollecPve Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12

21.8%

Can Network-‐Offload based Non-‐Blocking Neighborhood MPI CollecPves Improve CommunicaPon Overheads of Irregular Graph Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda, IWPAPS’ 12


Network-‐Topology-‐Aware Placement of Processes Can we design a highly scalable network topology detecRon service for IB? How do we design the MPI communicaRon library in a network-‐topology-‐aware manner to efficiently leverage the topology informaRon generated by our service? What are the potenRal benefits of using a network-‐topology-‐aware MPI library on the performance of parallel scienRfic applicaRons?

Overall performance and Split up of physical communicaPon for MILC on Ranger

Performance for varying system sizes Default for 2048 core run Topo-‐Aware for 2048 core run

15%

H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable InfiniBand Topology Service to Enable Network-‐Topology-‐Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper Finalist

•  Reduce network topology discovery Pme from O(N2hosts) to O(Nhosts)

•  15% improvement in MILC execuPon Pme @ 2048 cores •  15% improvement in Hypre execuPon Pme @ 1024 cores

25

Power and Energy Savings with Power-‐Aware CollecPves Performance and Power Comparison : MPI_Alltoall with 64 processes on 8 nodes

K. Kandalla, E. P. Mancini, S. Sur and D. K. Panda, “Designing Power Aware CollecPve CommunicaPon Algorithms for Infiniband Clusters”, ICPP ‘10

CPMD ApplicaPon Performance and Power Savings

5% 32%






Unified RunRme



•  Before CUDA 4: AddiRonal copies –  Low performance and low producRvity

•  Ater CUDA 4: Host-‐based pipeline –  Unified Virtual Address –  Pipeline CUDA copies with IB transfers –  High performance and high producRvity

•  Ater CUDA 5.5: GPUDirect-‐RDMA support –  GPU to GPU direct transfer –  Bypass the host memory

–  Hybrid design to avoid PCI bo<lenecks


MVAPICH2-‐GPU: CUDA-‐Aware MPI

InfiniBand

GPU

GPU Memor

y

CPU

Chip set

MVAPICH2 1.8, 1.9, 2.0 Releases •  Support for MPI communicaRon from NVIDIA GPU device

memory •  High performance RDMA-‐based inter-‐node point-‐to-‐point

communicaRon (GPU-‐GPU, GPU-‐Host and Host-‐GPU) •  High performance intra-‐node point-‐to-‐point

communicaRon for mulR-‐GPU adapters/node (GPU-‐GPU, GPU-‐Host and Host-‐GPU)

•  Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-‐node communicaRon for mulRple GPU adapters/node

•  OpRmized and tuned collecRves for GPU device buffers •  MPI datatype support for point-‐to-‐point and collecRve

communicaRon from GPU device buffers


•  Hybrid design using GPUDirect RDMA

–  GPUDirect RDMA and Host-‐based pipelining

–  Alleviates P2P bandwidth bo<lenecks on SandyBridge and IvyBridge

•  Support for communicaRon using mulR-‐rail

•  Support for Mellanox Connect-‐IB and ConnectX VPI adapters

•  Support for RoCE with Mellanox ConnectX VPI adapters


GPUDirect RDMA (GDR) with CUDA

IB Adapter

System Memory

GPU Memory

GPU

CPU

Chipset

P2P write: 5.2 GB/s P2P read: < 1.0 GB/s

SNB E5-‐2670

P2P write: 6.4 GB/s P2P read: 3.5 GB/s

IVB E5-‐2680V2

SNB E5-‐2670 /

IVB E5-‐2680V2

S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy and D. K. Panda, Efficient Inter-‐node MPI CommunicaPon using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs, Int'l Conference on Parallel Processing (ICPP '13)

30

Performance of MVAPICH2 with GPUDirect-‐RDMA: Latency GPU-‐GPU Internode MPI Latency

0

100

200

300

400

500

600

700

800

8K 32K 128K 512K 2M

1-‐Rail 2-‐Rail 1-‐Rail-‐GDR 2-‐Rail-‐GDR

Large Message Latency


Latency (us)

Based on MVAPICH2-‐2.0b Intel Ivy Bridge (E5-‐2680 v2) node with 20 cores

NVIDIA Tesla K40c GPU, Mellanox Connect-‐IB Dual-‐FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-‐RDMA Plug-‐in

10 %

0

5

10

15

20

25

1 4 16 64 256 1K 4K


Small Message Latency


Latency (us)

67 %

5.49 usec


31

Performance of MVAPICH2 with GPUDirect-‐RDMA: Bandwidth GPU-‐GPU Internode MPI Uni-‐DirecPonal Bandwidth

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 4 16 64 256 1K 4K


Small Message Bandwidth


Band

width (M

B/s)

0

2000

4000

6000

8000

10000

12000

8K 32K 128K 512K 2M


Large Message Bandwidth


Band

width (M

B/s)

5x

9.8 GB/s




32

Performance of MVAPICH2 with GPUDirect-‐RDMA: Bi-‐Bandwidth GPU-‐GPU Internode MPI Bi-‐direcPonal Bandwidth

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 4 16 64 256 1K 4K


Small Message Bi-‐Bandwidth


Bi-‐Ban

dwidth (M

B/s)

0

5000

10000

15000

20000

25000

8K 32K 128K 512K 2M


Large Message Bi-‐Bandwidth


Bi-‐Ban

dwidth (M

B/s)

4.3x

19 %

19 GB/s




•  MVAPICH2-‐2.0b with GDR support can be downloaded from h<ps://mvapich.cse.ohio-‐state.edu/download/mvapich2gdr/

•  System sotware requirements •  Mellanox OFED 2.1 or later

•  NVIDIA Driver 331.20 or later

•  NVIDIA CUDA Toolkit 5.5 or later

•  Plugin for GPUDirect RDMA

–  h<p://www.mellanox.com/page/products_dyn?product_family=116

•  Has opRmized designs for point-‐to-‐point communicaRon using GDR

•  Contact MVAPICH help list with any quesRons related to the package

mvapich-‐[email protected]‐state.edu

Using MVAPICH2-‐GPUDirect Version


New Design of MVAPICH2 with GPUDirect-‐RDMA

0

2

4

6

8

10

12

1 4 16 64 256 1024 4096

MV2-‐GDR 2.0b

MV2-‐GDR-‐New (Loopback)

MV2-‐GDR-‐New (Fast Copy)



Latency (us)

7.3 usec

5.0 usec

3.5 usec

52%

GPU-‐GPU Internode MPI Latency

Intel Ivy Bridge (E5-‐2630 v2) node with 12 cores w/ NVIDIA Tesla K20c GPU, Mellanox Connect-‐IB Dual-‐FDR HCA

CUDA 6.0, Mellanox OFED 2.1 with GPUDirect-‐RDMA Plug-‐in HPC Advisory Council European Conference, June '14

0

500

1000

1500

2000

2500

3000

4 8 16 32 64

Average TPS

Number of GPU Nodes

MV2-‐2.0b-‐GDR MV2-‐NewGDR-‐Loopback MV2-‐NewGDR-‐Fastcopy

•  Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) •  Strong Scaling: fixed 64K particles •  Loopback and Fastcopy get up to 45% and 48% improvement for 32 GPUs •  Weak Sailing: fixed 2K particles / GPU •  Loopback and Fastcopy get up to 54% and 56% improvement for 16 GPUs

HOOMD-blue Strong Scaling

Application-Level Evaluation (HOOMD-blue)

HOOMD-blue Weak Scaling

48% 47% 56% 53%

0 500

1000 1500 2000 2500 3000 3500 4000 4500

4 8 16 32 64

Average TPS

Number of GPU Nodes

MV2-‐2.0b-‐GDR MV2-‐NewGDR-‐Loopback MV2-‐NewGDR-‐Fastcopy

ApplicaPon-‐Level EvaluaPon (HOOMD-‐blue)


MPI-‐3 RMA Support with GPUDirect RDMA

Intel Ivy Bridge (E5-‐2630 v2) node with 12 cores NVIDIA Tesla K20c GPU, Mellanox Connect-‐IB HCA

CUDA 6.0, Mellanox OFED 2.1 with GPUDirect-‐RDMA Plugin

MPI-3 RMA provides flexible synchronization and completion primitives

0

1

2

3

4

5

6

7

8

9

1 4 16 64 256 1024

MV2-‐GDR 2.0b

MV2-‐GDR-‐New (Send/Recv)

MV2-‐GDR-‐New (Put+Flush)



Latency (usec)

39%

0

500000

1000000

1500000

2000000

2500000

1 4 16 64 256 1024

MV2-‐GDR 2.0b

MV2-‐GDR-‐New (Send/Recv)

MV2-‐GDR-‐New (Put+Flush)

Small Message Rate


Message Rate (M

sgs/sec)

2.8usec

New release with the advanced designs coming soon






Unified RunRme



MPI ApplicaPons on MIC Clusters

Xeon Xeon Phi

MulR-‐core Centric

Many-‐core Centric

MPI Program

MPI Program

Offloaded ComputaRon

MPI Program MPI Program

MPI Program

Host-‐only

Offload (/reverse Offload)

Symmetric

Coprocessor-‐only

•  Flexibility in launching MPI jobs on clusters with Xeon Phi


Data Movement on Intel Xeon Phi Clusters

CPU CPU QPI

MIC

PCIe

MIC

MIC

CPU

MIC

IB

Node 0 Node 1 1. Intra-‐Socket 2. Inter-‐Socket 3. Inter-‐Node 4. Intra-‐MIC 5. Intra-‐Socket MIC-‐MIC 6. Inter-‐Socket MIC-‐MIC 7. Inter-‐Node MIC-‐MIC 8. Intra-‐Socket MIC-‐Host

10. Inter-‐Node MIC-‐Host 9. Inter-‐Socket MIC-‐Host

MPI Process

39

11. Inter-‐Node MIC-‐MIC with IB adapter on remote socket

and more . . .

•  CriRcal for runRmes to opRmize data movement, hiding the complexity

•  Connected as PCIe devices – Flexibility but Complexity


40

MVAPICH2-‐MIC Design for Clusters with IB and MIC

•  Offload Mode

•  Intranode CommunicaRon

•  Coprocessor-‐only Mode

•  Symmetric Mode

•  Internode CommunicaRon

•  Coprocessors-‐only

•  Symmetric Mode

•  MulR-‐MIC Node ConfiguraRons


Direct-‐IB CommunicaPon

41

IB Write to Xeon Phi (P2P Write)

IB Read from Xeon Phi (P2P Read)

CPU

Xeon Phi

PCIe

QPI

E5-‐2680 v2 E5-‐2670

5280 MB/s (83%)

IB FDR HCA

MT4099

6396 MB/s (100%)

962 MB/s (15%) 3421 MB/s (54%)

1075 MB/s (17%) 1179 MB/s (19%)

370 MB/s (6%) 247 MB/s (4%)

(SandyBridge) (IvyBridge)

•  IB-‐verbs based communicaPon from Xeon Phi

•  No porPng effort for MPI libraries with IB support

Peak IB FDR Bandwidth: 6397 MB/s

Intra-‐socket

IB Write to Xeon Phi (P2P Write)

IB Read from Xeon Phi (P2P Read) Inter-‐socket

•  Data movement between IB HCA and Xeon Phi: implemented as peer-‐to-‐peer transfers


MIC-‐Remote-‐MIC P2P CommunicaPon


Bandwidth

BeWer Be

Wer

BeWer

Latency (Large Messages)

0

1000

2000

3000

4000

5000

8K 32K 128K 512K 2M

Lat

ency

(use

c)


0

2000

4000

6000

1 16 256 4K 64K 1M

Ban

dwid

th (M

B/s

ec)


5236

Intra-‐socket P2P

Inter-‐socket P2P

0 2000 4000 6000 8000

10000 12000 14000 16000

8K 32K 128K 512K 2M

Lat

ency

(use

c)


Latency (Large Messages)

0

2000

4000

6000

1 16 256 4K 64K 1M

Ban

dwid

th (M

B/s

ec)


BeWer

5594

Bandwidth

InterNode – Symmetric Mode -‐ P3DFFT Performance


•  Note •  ApplicaRon performance in symmetric mode closely Red with load balancing of computaRon between host and the Xeon Phi

•  Try different decomposiRons / threading levels to idenRfy best se|ngs for target applicaRon

3DStencil Comm. Kernel

P3DFFT – 512x512x4K

0

2

4

6

8

10

4+4 8+8

Tim

e pe

r St

ep (s

ec)

Processes on Host+MIC (32 Nodes)

0

5

10

15

20

2+2 2+8

Com

m p

er S

tep

(sec

)

Processes on Host-MIC - 8 Threads/Proc (8 Nodes)

50.2%

MV2-‐INTRA-‐OPT MV2-‐MIC

16.2%

S. Potluri, D. Bureddy, K. Hamidouche, A. Venkatesh, K. Kandalla, H. Subramoni and D. K. Panda, MVAPICH-‐PRISM: A Proxy-‐based CommunicaPon Framework using InfiniBand and SCIF for Intel MIC

Clusters Int'l Conference on SupercompuPng (SC '13), November 2013.

44

OpPmized MPI CollecPves for MIC Clusters (Allgather & Alltoall)


A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda -‐ High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014

0

5000

10000

15000

20000

25000

1 2 4 8 16 32 64 128 256 512 1K

Latency (usecs)


32-‐Node-‐Allgather (16H + 16 M) Small Message Latency

MV2-‐MIC

MV2-‐MIC-‐Opt

0

500

1000

1500

8K 16K 32K 64K 128K 256K 512K 1M

Latency (usecs) x 1000


32-‐Node-‐Allgather (8H + 8 M) Large Message Latency

MV2-‐MIC

MV2-‐MIC-‐Opt

0

200

400

600

800

4K 8K 16K 32K 64K 128K 256K 512K

Latency (usecs) x 1000


32-‐Node-‐Alltoall (8H + 8 M) Large Message Latency

MV2-‐MIC

MV2-‐MIC-‐Opt

0

10

20

30

40

50

MV2-‐MIC-‐Opt MV2-‐MIC

ExecuP

on Tim

e (secs)

32 Nodes (8H + 8M), Size = 2K*2K*1K

P3DFFT Performance CommunicaRon

ComputaRon

76% 58%

55%





Unified RunRme



•  Component failures are common in large-‐scale clusters

•  Imposes need on reliability and fault tolerance

•  MulRple challenges: –  Checkpoint-‐Restart vs. Process MigraRon

–  Benefits of Scalable Checkpoint Restart (SCR) Support •  ApplicaRon guided •  ApplicaRon transparent


Fault Tolerance


Checkpoint-‐Restart vs. Process MigraPon Low Overhead Failure PredicPon with IPMI

X. Ouyang, R. Rajachandrasekar, X. Besseron, D. K. Panda, High Performance Pipelined Process MigraPon with RDMA, CCGrid 2011 X. Ouyang, S. Marcarelli, R. Rajachandrasekar and D. K. Panda, RDMA-‐Based Job MigraPon Framework for MPI over InfiniBand, Cluster 2010

10.7X

2.3X

4.3X

1

Time to migrate 8 MPI processes

•  Job-‐wide Checkpoint/Restart is not scalable •  Job-‐pause and Process MigraRon framework can deliver pro-‐acRve fault-‐tolerance •  Also allows for cluster-‐wide load-‐balancing by means of job compacRon

0

5000

10000

15000

20000

25000

30000

Migration w/o RDMA

CR (ext3) CR (PVFS)

Exe

cutio

n Ti

me

(sec

onds

)

Job Stall

Checkpoint (Migration)

Resume

Restart

2.03x

4.49x

LU Class C Benchmark (64 Processes)

MulP-‐Level CheckpoinPng with ScalableCR (SCR)


⊕

Checkpoint Cost a

nd Resilien

cy

Low

High

•  LLNL’s Scalable Checkpoint/Restart library

•  Can be used for applicaRon guided and applicaRon transparent checkpoinRng

•  EffecRve uRlizaRon of storage hierarchy –  Local: Store checkpoint data on node’s local

storage, e.g. local disk, ramdisk

–  Partner: Write to local storage and on a partner node

–  XOR: Write file to local storage and small sets of nodes collecRvely compute and store parity redundancy data (RAID-‐5)

–  Stable Storage: Write to parallel file system

ApplicaPon-‐guided MulP-‐Level CheckpoinPng


0

20

40

60

80

100

PFS MVAPICH2+SCR (Local)

MVAPICH2+SCR (Partner)

MVAPICH2+SCR (XOR)

Checkpoint W

riRng Tim

e (s)

RepresentaRve SCR-‐Enabled ApplicaRon

•  Checkpoint wriRng phase Rmes of representaRve SCR-‐enabled MPI applicaRon

•  512 MPI processes (8 procs/node)

•  Approx. 51 GB checkpoints

Transparent MulP-‐Level CheckpoinPng


•  ENZO Cosmology applicaPon – RadiaRon Transport workload

•  Using MVAPICH2’s CR protocol instead of the applicaRon’s in-‐built CR mechanism

•  512 MPI processes running on the Stampede Supercomputer@TACC

•  Approx. 12.8 GB checkpoints: ~2.5X reducPon in checkpoint I/O costs





Unified RunRme



MVAPICH2-‐X for Hybrid MPI + PGAS ApplicaPons


MPI ApplicaPons, OpenSHMEM ApplicaPons, UPC ApplicaPons, Hybrid (MPI + PGAS) ApplicaPons

Unified MVAPICH2-‐X RunPme

InfiniBand, RoCE, iWARP

OpenSHMEM Calls MPI Calls UPC Calls

•  Unified communicaRon runRme for MPI, UPC, OpenSHMEM available with MVAPICH2-‐X 1.9 onwards! –  h<p://mvapich.cse.ohio-‐state.edu

•  Feature Highlights –  Supports MPI(+OpenMP), OpenSHMEM, UPC, MPI(+OpenMP) + OpenSHMEM,

MPI(+OpenMP) + UPC –  MPI-‐3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard

compliant –  Scalable Inter-‐node and Intra-‐node communicaRon – point-‐to-‐point and collecRves

OpenSHMEM CollecPve CommunicaPon Performance

53

Reduce (1,024 processes) Broadcast (1,024 processes)

Collect (1,024 processes) Barrier

0 50 100 150 200 250 300 350 400

128 256 512 1024 2048

Time (us)

No. of Processes

1 10

100 1000

10000 100000

1000000 10000000

Time (us)

Message Size

180X

1

10

100

1000

10000

4 16 64 256 1K 4K 16K 64K 256K

Time (us)

Message Size

18X

1

10

100

1000

10000

100000

1 4 16 64 256 1K 4K 16K 64K

Time (us)

Message Size

MV2X-‐SHMEM Scalable-‐SHMEM OMPI-‐SHMEM

12X

3X


OpenSHMEM ApplicaPon EvaluaPon

54

0

100

200

300

400

500

600

256 512

Time (s)

Number of Processes

UH-‐SHMEM

OMPI-‐SHMEM (FCA)

Scalable-‐SHMEM (FCA)

MV2X-‐SHMEM

0

10

20

30

40

50

60

70

80

256 512

Time (s)

Number of Processes

UH-‐SHMEM

OMPI-‐SHMEM (FCA)

Scalable-‐SHMEM (FCA)

MV2X-‐SHMEM

Heat Image Daxpy

•  Improved performance for OMPI-‐SHMEM and Scalable-‐SHMEM with FCA •  ExecuRon Rme for 2DHeat Image at 512 processes (sec):

-  UH-‐SHMEM – 523, OMPI-‐SHMEM – 214, Scalable-‐SHMEM – 193, MV2X-‐SHMEM – 169

•  ExecuRon Rme for DAXPY at 512 processes (sec): -  UH-‐SHMEM – 57, OMPI-‐SHMEM – 56, Scalable-‐SHMEM – 9.2, MV2X-‐

SHMEM – 5.2 J. Jose, J. Zhang, A. Venkatesh, S. Potluri, and D. K. Panda, A Comprehensive Performance EvaluaPon of OpenSHMEM Libraries on InfiniBand Clusters, OpenSHMEM Workshop (OpenSHMEM’14), March 2014



Hybrid MPI+OpenSHMEM Graph500 Design ExecuPon Time

J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, InternaPonal SupercompuPng Conference (ISC’13), June 2013

0 1 2 3 4 5 6 7 8 9

26 27 28 29

Billion

s of T

raversed

Edges Per

Second

(TEP

S)

Scale

MPI-‐Simple MPI-‐CSC MPI-‐CSR Hybrid(MPI+OpenSHMEM)

0 1 2 3 4 5 6 7 8 9

1024 2048 4096 8192

Billion

s of T

raversed

Edges Per

Second

(TEP

S)

# of Processes

MPI-‐Simple MPI-‐CSC MPI-‐CSR Hybrid(MPI+OpenSHMEM)

Strong Scalability

Weak Scalability

J. Jose, K. Kandalla, M. Luo and D. K. Panda, SupporPng Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance EvaluaPon, Int'l Conference on Parallel Processing (ICPP '12), September 2012

0 5

10 15 20 25 30 35

4K 8K 16K

Time (s)

No. of Processes

MPI-‐Simple MPI-‐CSC MPI-‐CSR Hybrid (MPI+OpenSHMEM)

13X

7.6X

•  Performance of Hybrid (MPI+OpenSHMEM) Graph500 Design

•  8,192 processes -‐ 2.4X improvement over MPI-‐CSR -‐ 7.6X improvement over MPI-‐Simple

•  16,384 processes -‐ 1.5X improvement over MPI-‐CSR -‐ 13X improvement over MPI-‐Simple

56

Hybrid MPI+OpenSHMEM Sort ApplicaPon ExecuPon Time Strong Scalability

•  Performance of Hybrid (MPI+OpenSHMEM) Sort ApplicaRon •  ExecuRon Time

-‐ 4TB Input size at 4,096 cores: MPI – 2408 seconds, Hybrid: 1172 seconds -‐ 51% improvement over MPI-‐based design

•  Strong Scalability (configuraRon: constant input size of 500GB) -‐ At 4,096 cores: MPI – 0.16 TB/min, Hybrid – 0.36 TB/min -‐ 55% improvement over MPI based design

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

512 1,024 2,048 4,096

Sort Rate (TB/min)

No. of Processes

MPI Hybrid

0

500

1000

1500

2000

2500

3000

500GB-‐512 1TB-‐1K 2TB-‐2K 4TB-‐4K

Time (secon

ds)

Input Data -‐ No. of Processes

MPI

Hybrid

51% 55%


•  Performance and Memory scalability toward 900K-‐1M cores –  Enhanced design with Dynamically Connected Transport (DCT) service and Connect-‐IB

•  Enhanced OpRmizaRon for GPGPU and Coprocessor Support –  Extending the GPGPU support (GPU-‐Direct RDMA) with CUDA 6.5 and Beyond –  Support for Intel MIC (Knight Landing)

•  Taking advantage of CollecRve Offload framework –  Including support for non-‐blocking collecRves (MPI 3.0)

•  RMA support (as in MPI 3.0) •  Extended topology-‐aware collecRves •  Power-‐aware collecRves •  Support for MPI Tools Interface (as in MPI 3.0) •  Checkpoint-‐Restart and migraRon support with in-‐memory checkpoinRng •  Hybrid MPI+PGAS programming support with GPGPUs and Accelertors

MVAPICH2/MVPICH2-‐X – Plans for Exascale


•  InfiniBand with RDMA feature is gaining momentum in HPC systems with best performance and greater usage

•  As the HPC community moves to Exascale, new soluRons are needed in the MPI and Hybrid MPI+PGAS stacks for supporRng GPUs and Accelerators

•  Demonstrated how such soluRons can be designed with MVAPICH2/MVAPICH2-‐X and their performance benefits

•  Such designs will allow applicaRon scienRsts and engineers to take advantage of upcoming exascale systems

58

Concluding Remarks



Funding Acknowledgments

Funding Support by

Equipment Support by

59


Personnel Acknowledgments Current Students

–  S. Chakraborthy (Ph.D.)

–  N. Islam (Ph.D.)

–  J. Jose (Ph.D.)

–  M. Li (Ph.D.)

–  R. Rajachandrasekhar (Ph.D.)

Past Students –  P. Balaji (Ph.D.)

–  D. BunRnas (Ph.D.)

–  S. Bhagvat (M.S.)

–  L. Chai (Ph.D.)

–  B. Chandrasekharan (M.S.)

–  N. Dandapanthula (M.S.)

–  V. Dhanraj (M.S.)

–  T. Gangadharappa (M.S.)

–  K. Gopalakrishnan (M.S.)

–  G. Santhanaraman (Ph.D.) –  A. Singh (Ph.D.)

–  J. Sridhar (M.S.)

–  S. Sur (Ph.D.)

–  H. Subramoni (Ph.D.)

–  K. Vaidyanathan (Ph.D.)

–  A. Vishnu (Ph.D.)

–  J. Wu (Ph.D.)

–  W. Yu (Ph.D.)

60

Past Research ScienCst –  S. Sur

Current Post-‐Docs –  X. Lu

–  K. Hamidouche

Current Programmers –  M. Arnold

–  J. Perkins

Past Post-‐Docs –  H. Wang –  X. Besseron –  H.-‐W. Jin

–  M. Luo

–  W. Huang (Ph.D.)

–  W. Jiang (M.S.)

–  S. Kini (M.S.)

–  M. Koop (Ph.D.)

–  R. Kumar (M.S.)

–  S. Krishnamoorthy (M.S.)

–  K. Kandalla (Ph.D.)

–  P. Lai (M.S.)

–  J. Liu (Ph.D.)

–  M. Luo (Ph.D.)

–  A. Mamidala (Ph.D.)

–  G. Marsh (M.S.)

–  V. Meshram (M.S.)

–  S. Naravula (Ph.D.)

–  R. Noronha (Ph.D.)

–  X. Ouyang (Ph.D.)

–  S. Pai (M.S.)

–  S. Potluri (Ph.D.)

–  M. Rahman (Ph.D.) –  D. Shankar (Ph.D.)

–  R. Shir (Ph.D.)

–  A. Venkatesh (Ph.D.)

–  J. Zhang (Ph.D.)

–  E. Mancini –  S. Marcarelli

–  J. Vienne

Current Senior Research Associate –  H. Subramoni

Past Programmers –  D. Bureddy

•  August 25-‐27, 2014; Columbus, Ohio, USA

•  Keynote Talks, Invited Talks, Contributed PresentaRons

•  Half-‐day tutorial on MVAPICH2, MVAPICH2-‐X and MVAPICH2-‐GDR opRmizaRon and tuning

•  Speakers (Confirmed so far): –  Keynote: Dan Stanzione (TACC)

–  Keynote: Darren Kerbyson (PNNL)

–  Sadaf Alam (CSCS, Switzerland)

–  Onur Celebioglu (Dell)

–  Jens Glaser (Univ. of Michigan)

–  Jeff Hammond (Intel)

–  Adam Moody (LLNL)

–  David Race (Cray)

–  Davide Rosse| (NVIDIA)

–  Gilad Shainer (Mellanox)

–  Sayantan Sur (Intel)

–  Mahidhar TaRneni (SDSC)

–  Jerome Vienne (TACC)

•  Call for Contributed PresentaPons (Deadline: July 1, 2014)

•  More details at: h<p://mug.mvapich.cse.ohio-‐state.edu 61

Upcoming 2nd Annual MVAPICH User Group (MUG) MeePng


•  Looking for Bright and EnthusiasRc Personnel to join as –  VisiRng ScienRsts

–  Post-‐Doctoral Researchers

–  PhD Students

–  MPI Programmer/Sotware Engineer

–  Hadoop/Big Data Programmer/Sotware Engineer

•  If interested, please contact me at this conference and/or send an e-‐mail to [email protected]‐state.edu

62

MulPple PosiPons Available in My Group



Pointers

hWp://nowlab.cse.ohio-‐state.edu

[email protected]‐state.edu hWp://www.cse.ohio-‐state.edu/~panda

63

hWp://mvapich.cse.ohio-‐state.edu

hWp://hadoop-‐rdma.cse.ohio-‐state.edu