Upload
insidehpc
View
1.682
Download
1
Embed Size (px)
DESCRIPTION
In this video from the 2014 HPC Advisory Council Europe Conference, DK Panda from Ohio State University presents: MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans. This talk will focus on latest developments and future plans for the MVAPICH2 and MVAPICH2-X projects. For the MVAPICH2 project, we will focus on scalable and highly-optimized designs for pt-to-pt communication (two-sided and one-sided MPI-3 RMA), collective communication (blocking and MPI-3 non-blocking), support for GPGPUs and Intel MIC, support for MPI-T interface and schemes for fault-tolerance/fault-resilience. For the MVAPICH2-X project, will focus on efficient support for hybrid MPI and PGAS (UPC and OpenSHMEM) programming model with unified runtime." Watch the video presentation: http://wp.me/p3RLHQ-coF
Citation preview
MVAPICH2 and MVAPICH2-‐X Projects: Latest Developments and Future Plans
Dhabaleswar K. (DK) Panda The Ohio State University
E-‐mail: [email protected]‐state.edu h<p://www.cse.ohio-‐state.edu/~panda
Talk at HPC Advisory Council European Conference (2014)
by
High-‐End CompuPng (HEC): PetaFlop to ExaFlop
HPC Advisory Council European Conference, June '14 2
Expected to have an ExaFlop system in 2020-‐2022!
100 PFlops in 2015
1 EFlops in 2018?
HPC Advisory Council European Conference, June '14 3
Trends for Commodity CompuPng Clusters in the Top 500 List (hWp://www.top500.org)
0 10 20 30 40 50 60 70 80 90 100
0 50
100 150 200 250 300 350 400 450 500
Percen
tage of C
lusters
Num
ber o
f Clusters
Timeline
Percentage of Clusters Number of Clusters
Drivers of Modern HPC Cluster Architectures
• MulR-‐core processors are ubiquitous
• InfiniBand very popular in HPC clusters •
• Accelerators/Coprocessors becoming common in high-‐end systems
• Pushing the envelope for Exascale compuRng
Accelerators / Coprocessors high compute density, high performance/waW
>1 TFlop DP on a chip
High Performance Interconnects -‐ InfiniBand <1usec latency, >100Gbps Bandwidth
Tianhe – 2 (1) Titan (2) Stampede (6) Tianhe – 1A (10)
4
MulP-‐core Processors
HPC Advisory Council European Conference, June '14
• 207 IB Clusters (41%) in the November 2013 Top500 list
(h<p://www.top500.org)
• InstallaRons in the Top 40 (19 systems):
HPC Advisory Council European Conference, June '14
Large-‐scale InfiniBand InstallaPons
519,640 cores (Stampede) at TACC (7th) 70,560 cores (Helios) at Japan/IFERC (24th)
147, 456 cores (Super MUC) in Germany (10th) 138,368 cores (Tera-‐100) at France/CEA (29th)
74,358 cores (Tsubame 2.5) at Japan/GSIC (11th) 60,000-‐cores, iDataPlex DX360M4 at Germany/Max-‐Planck (31st)
194,616 cores (Cascade) at PNNL (13th) 53,504 cores (PRIMERGY) at Australia/NCI (32nd)
110,400 cores (Pangea) at France/Total (14th) 77,520 cores (Conte) at Purdue University (33rd)
96,192 cores (Pleiades) at NASA/Ames (16th) 48,896 cores (MareNostrum) at Spain/BSC (34th)
73,584 cores (Spirit) at USA/Air Force (18th) 222,072 (PRIMERGY) at Japan/Kyushu (36th)
77,184 cores (Curie thin nodes) at France/CEA (20th) 78,660 cores (Lomonosov) in Russia (37th )
120, 640 cores (Nebulae) at China/NSCS (21st) 137,200 cores (Sunway Blue Light) in China 40th)
72,288 cores (Yellowstone) at NCAR (22nd) and many more!
5
HPC Advisory Council European Conference, June '14 6
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Logical shared memory
Shared Memory Model
SHMEM, DSM Distributed Memory Model
MPI (Message Passing Interface)
ParRRoned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems – Distributed Shared Memory (DSM), MPI within a node, etc.
– Logical shared memory across nodes (PGAS)
7
ParPPoned Global Address Space (PGAS) Models
HPC Advisory Council European Conference, June '14
• Key features - Simple shared memory abstracRons
- Light weight one-‐sided communicaRon
- Easier to express irregular communicaRon
• Different approaches to PGAS - Languages
• Unified Parallel C (UPC)
• Co-‐Array Fortran (CAF)
• X10
- Libraries • OpenSHMEM
• Global Arrays
• Chapel
• Hierarchical architectures with mulRple address spaces
• (MPI + PGAS) Model – MPI across address spaces
– PGAS within an address space
• MPI is good at moving data between address spaces
• Within an address space, MPI can interoperate with other shared memory programming models
• Can co-‐exist with OpenMP for offloading computaRon
• ApplicaRons can have kernels with different communicaRon pa<erns
• Can benefit from different models
• Re-‐wriRng complete applicaRons can be a huge effort
• Port criRcal kernels to the desired model instead
HPC Advisory Council European Conference, June '14 8
MPI+PGAS for Exascale Architectures and ApplicaPons
Hybrid (MPI+PGAS) Programming
• ApplicaRon sub-‐kernels can be re-‐wri<en in MPI/PGAS based on communicaRon characterisRcs
• Benefits: – Best of Distributed CompuRng Model
– Best of Shared Memory CompuRng Model
• Exascale Roadmap*: – “Hybrid Programming is a pracRcal way to
program exascale systems”
* The InternaConal Exascale SoDware Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, InternaConal Journal of High Performance Computer ApplicaCons, ISSN 1094-‐3420
Kernel 1 MPI
Kernel 2 MPI
Kernel 3 MPI
Kernel N MPI
HPC ApplicaPon
Kernel 2 PGAS
Kernel N PGAS
HPC Advisory Council European Conference, June '14 9
Designing Sogware Libraries for MulP-‐Petaflop and Exaflop Systems: Challenges
Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc.
ApplicaPon Kernels/ApplicaPons
Networking Technologies (InfiniBand, 40/100GigE,
Aries, BlueGene)
MulP/Many-‐core Architectures
Point-‐to-‐point CommunicaPon (two-‐sided & one-‐sided)
CollecPve CommunicaPon
SynchronizaPon & Locks
I/O & File Systems
Fault Tolerance
CommunicaPon Library or RunPme for Programming Models
10
Accelerators (NVIDIA and MIC)
Middleware
HPC Advisory Council European Conference, June '14
Co-‐Design OpportuniPes
and Challenges
across Various Layers
Performance Scalability Fault-‐
Resilience
• Scalability for million to billion processors – Support for highly-‐efficient inter-‐node and intra-‐node communicaRon (both two-‐sided
and one-‐sided) – Extremely minimum memory footprint
• Balancing intra-‐node and inter-‐node communicaRon for next generaRon mulR-‐core (128-‐1024 cores/node) – MulRple end-‐points per node
• Support for efficient mulR-‐threading • Support for GPGPUs and Accelerators • Scalable CollecRve communicaRon
– Offload – Non-‐blocking – Topology-‐aware – Power-‐aware
• Fault-‐tolerance/resiliency • QoS support for communicaRon and I/O • Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC,
MPI + OpenSHMEM, …)
Challenges in Designing (MPI+X) at Exascale
11 HPC Advisory Council European Conference, June '14
• High Performance open-‐source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-‐1), MVAPICH2 (MPI-‐2.2 and MPI-‐3.0), Available since 2002
– MVAPICH2-‐X (MPI + PGAS), Available since 2012
– Support for GPGPUs and MIC
– Used by more than 2,150 organizaPons in 72 countries
– More than 214,000 downloads from OSU site directly
– Empowering many TOP500 clusters • 7th ranked 519,640-‐core cluster (Stampede) at TACC
• 11th ranked 74,358-‐core cluster (Tsubame 2.5) at Tokyo InsRtute of Technology
• 16th ranked 96,192-‐core cluster (Pleiades) at NASA
• 75th ranked 16,896-‐core cluster (Keenland) at GaTech and many others . . .
– Available with sotware stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE)
– h<p://mvapich.cse.ohio-‐state.edu
• Partner in the U.S. NSF-‐TACC Stampede System
MVAPICH2/MVAPICH2-‐X Sogware
12 HPC Advisory Council European Conference, June '14
• Released on 06/20/14
• Major Features and Enhancements – Based on MPICH-‐3.1
– MPI-‐3 RMA Support
– Support for Non-‐blocking collecRves
– MPI-‐T support
– CMA support is default for intra-‐node communicaRon
– OpRmizaRon of collecRves with CMA support
– Large message transfer support
– Reduced memory footprint
– Improved Job start-‐up Rme
– OpRmizaRon of collecRves and tuning for mulRple plaworms
– Updated hwloc to version 1.9
• MVAPICH2-‐X 2.0 GA supports hybrid MPI + PGAS (UPC and OpenSHMEM) programming models – Based on MVAPICH2 2.0 GA including MPI-‐3 features
– Compliant with UPC 2.18.0 and OpenSHMEM v1.0f
13
MVAPICH2 2.0 GA and MVAPICH2-‐X 2.0 GA
HPC Advisory Council European Conference, June '14
• Scalability for million to billion processors – Support for highly-‐efficient inter-‐node and intra-‐node communicaRon (both two-‐sided
and one-‐sided) – Extremely minimum memory footprint – Support with Direct Connected Transport (DCT)
• CollecRve communicaRon – Hardware-‐mulRcast-‐based – Offload and Non-‐blocking – Topology-‐aware – Power-‐aware
• Support for GPGPUs • Support for Intel MICs • Fault-‐tolerance • Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with
Unified RunRme
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-‐X for Exascale
14 HPC Advisory Council European Conference, June '14
HPC Advisory Council European Conference, June '14 15
One-‐way Latency: MPI over IB with MVAPICH2
0.00
1.00
2.00
3.00
4.00
5.00
6.00 Small Message Latency
Message Size (bytes)
Latency (us)
1.66
1.56
1.64
1.82
0.99 1.09
1.12
DDR, QDR -‐ 2.4 GHz Quad-‐core (Westmere) Intel PCI Gen2 with IB switch FDR -‐ 2.6 GHz Octa-‐core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-‐Dual FDR -‐ 2.6 GHz Octa-‐core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-‐Dual FDR -‐ 2.8 GHz Deca-‐core (IvyBridge) Intel PCI Gen3 with IB switch
0.00
50.00
100.00
150.00
200.00
250.00 Qlogic-‐DDR Qlogic-‐QDR ConnectX-‐DDR ConnectX2-‐PCIe2-‐QDR ConnectX3-‐PCIe3-‐FDR Sandy-‐ConnectIB-‐DualFDR Ivy-‐ConnectIB-‐DualFDR
Large Message Latency
Message Size (bytes)
Latency (us)
HPC Advisory Council European Conference, June '14 16
Bandwidth: MPI over IB with MVAPICH2
0
2000
4000
6000
8000
10000
12000
14000 UnidirecPonal Bandwidth
Band
width (M
Bytes/sec)
Message Size (bytes)
3280
3385
1917 1706
6343
12485
12810
0
5000
10000
15000
20000
25000
30000
4 16 64 256 1024 4K 16K 64K 256K 1M
Qlogic-‐DDR Qlogic-‐QDR ConnectX-‐DDR ConnectX2-‐PCIe2-‐QDR ConnectX3-‐PCIe3-‐FDR Sandy-‐ConnectIB-‐DualFDR Ivy-‐ConnectIB-‐DualFDR
BidirecPonal Bandwidth
Band
width (M
Bytes/sec)
Message Size (bytes)
3341 3704
4407
11643
6521
21025
24727
DDR, QDR -‐ 2.4 GHz Quad-‐core (Westmere) Intel PCI Gen2 with IB switch FDR -‐ 2.6 GHz Octa-‐core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-‐Dual FDR -‐ 2.6 GHz Octa-‐core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-‐Dual FDR -‐ 2.8 GHz Deca-‐core (IvyBridge) Intel PCI Gen3 with IB switch
0
0.2
0.4
0.6
0.8
1
0 1 2 4 8 16 32 64 128 256 512 1K
Latency (us)
Message Size (Bytes)
Latency
Intra-‐Socket Inter-‐Socket
MVAPICH2 Two-‐Sided Intra-‐Node Performance (Shared memory and Kernel-‐based Zero-‐copy Support (LiMIC and CMA))
17
Latest MVAPICH2 2.0
Intel Ivy-‐bridge
0.18 us
0.45 us
0 2000 4000 6000 8000 10000 12000 14000 16000
Band
width (M
B/s)
Message Size (Bytes)
Bandwidth (Inter-‐socket)
inter-‐Socket-‐CMA inter-‐Socket-‐Shmem inter-‐Socket-‐LiMIC
0 2000 4000 6000 8000
10000 12000 14000 16000
Band
width (M
B/s)
Message Size (Bytes)
Bandwidth (Intra-‐socket)
intra-‐Socket-‐CMA intra-‐Socket-‐Shmem intra-‐Socket-‐LiMIC
14,250 MB/s 13,749 MB/s
HPC Advisory Council European Conference, June '14
0
0.5
1
1.5
2
2.5
3
3.5
1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Latency (us)
Message Size (Bytes)
Inter-‐node Get/Put Latency
Get Put
MPI-‐3 RMA Get/Put with Flush Performance
18
Latest MVAPICH2 2.0, Intel Sandy-‐bridge with Connect-‐IB (single-‐port)
2.04 us
1.56 us
0
5000
10000
15000
20000
1 2 4 8 16
32
64
128
256
512 1K
2K
4K
8K
16K
32K
64K
128K
256K
Band
width (M
bytes/sec)
Message Size (Bytes)
Intra-‐socket Get/Put Bandwidth
Get Put
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K
Latency (us)
Message Size (Bytes)
Intra-‐socket Get/Put Latency
Get Put
0.08 us
HPC Advisory Council European Conference, June '14
0 1000 2000 3000 4000 5000 6000 7000 8000
1 4 16 64 256 1K 4K 16K 64K 256K
Band
width (M
bytes/sec)
Message Size (Bytes)
Inter-‐node Get/Put Bandwidth
Get Put 6881
6876
14926
15364
• Memory usage for 32K processes with 8-‐cores per node can be 54 MB/process (for connecRons)
• NAMD performance improves when there is frequent communicaRon to many peers
HPC Advisory Council European Conference, June '14
Minimizing memory footprint with XRC and Hybrid Mode Memory Usage Performance on NAMD (1024 cores)
19
• Both UD and RC/XRC have benefits
• Hybrid for the best of both
• Available since MVAPICH2 1.7 as integrated interface
• RunRme Parameters: RC -‐ default;
• UD -‐ MV2_USE_ONLY_UD=1
• Hybrid -‐ MV2_HYBRID_ENABLE_THRESHOLD=1
M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable ConnecPon,” Cluster ‘08
0
100
200
300
400
500
1 4 16 64 256 1K 4K 16K Mem
ory (M
B/process)
ConnecPons
MVAPICH2-‐RC MVAPICH2-‐XRC
0
0.5
1
1.5
apoa1 er-gre f1atpase jac
Normalized
Tim
e
Dataset
MVAPICH2-‐RC MVAPICH2-‐XRC
0
2
4
6
128 256 512 1024
Time (us)
Number of Processes
UD Hybrid RC
26% 40% 30% 38%
HPC Advisory Council European Conference, June '14 20
Dynamic Connected (DC) Transport in MVAPICH2
Nod
e 0
P1
P0
Node 1
P3
P2 Node 3
P7
P6
Nod
e 2
P5
P4
IB Network
• Constant connecRon cost (One QP for any peer) • Full Feature Set (RDMA, Atomics etc) • Separate objects for send (DC IniRator) and receive (DC
Target) – DC Target idenRfied by “DCT Number” – Messages routed with (DCT Number, LID) – Requires same “DC Key” to enable communicaRon
• IniRal study done in MVAPICH2 – More details in Research Papers 06, Wednesday June 25;
12:00 PM to 12:30 PM
0
0.5
1
160 320 620 Normalized
ExecuPo
n Time
Number of Processes
NAMD -‐ Apoa1: Large data set RC DC-‐Pool UD XRC
DCT support available publicly in Mellanox OFED – 2.2.01
10 22
47 97
1 1 1 2
10 10 10 10
1 1 3
5
1
10
100
80 160 320 640
Conn
ecPo
n Mem
ory (KB)
Number of Processes
Memory Footprint for Alltoall RC DC-‐Pool UD XRC
• Scalability for million to billion processors – Support for highly-‐efficient inter-‐node and intra-‐node communicaRon (both two-‐sided
and one-‐sided) – Extremely minimum memory footprint – Support with Direct Connected Transport (DCT)
• CollecRve communicaRon – Hardware-‐mulRcast-‐based – Offload and Non-‐blocking – Topology-‐aware – Power-‐aware
• Support for GPGPUs • Support for Intel MICs • Fault-‐tolerance • Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with
Unified RunRme
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-‐X for Exascale
21 HPC Advisory Council European Conference, June '14
HPC Advisory Council European Conference, June '14 22
Hardware MulPcast-‐aware MPI_Bcast on Stampede
0 5
10 15 20 25 30 35 40
2 8 32 128 512
Latency (u
s)
Message Size (Bytes)
Small Messages (102,400 Cores) Default MulPcast
ConnectX-‐3-‐FDR (54 Gbps): 2.7 GHz Dual Octa-‐core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch
0
100
200
300
400
500
2K 8K 32K 128K
Latency (u
s)
Message Size (Bytes)
Large Messages (102,400 Cores) Default MulPcast
0 5 10 15 20 25 30
Latency (us)
Number of Nodes
16 Byte Message
Default
MulPcast
0
50
100
150
200
Latency (us)
Number of Nodes
32 KByte Message
Default
MulPcast
0
1
2
3
4
5
512 600 720 800 ApplicaP
on Run
-‐Tim
e (s)
Data Size
0 5
10 15
64 128 256 512 Run-‐Time (s)
Number of Processes
PCG-‐Default Modified-‐PCG-‐Offload
ApplicaPon benefits with Non-‐Blocking CollecPves based on CX-‐3 CollecPve Offload
23
Modified P3DFFT with Offload-‐Alltoall does up to 17% be<er than default version (128 Processes)
K. Kandalla, et. al.. High-‐Performance and Scalable Non-‐Blocking All-‐to-‐All with CollecPve Offload on InfiniBand Clusters: A Study with Parallel 3D FFT, ISC 2011
HPC Advisory Council European Conference, June '14
17%
0 0.2 0.4 0.6 0.8 1
1.2
10 20 30 40 50 60 70
Normalized
Pe
rforman
ce
HPL-‐Offload HPL-‐1ring HPL-‐Host
HPL Problem Size (N) as % of Total Memory
4.5%
Modified HPL with Offload-‐Bcast does up to 4.5% be<er than default version (512 Processes)
Modified Pre-‐Conjugate Gradient Solver with Offload-‐Allreduce does up to 21.8% be<er than default version
K. Kandalla, et. al, Designing Non-‐blocking Broadcast with CollecPve Offload on InfiniBand Clusters: A Case Study with HPL, HotI 2011
K. Kandalla, et. al., Designing Non-‐blocking Allreduce with CollecPve Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12
21.8%
Can Network-‐Offload based Non-‐Blocking Neighborhood MPI CollecPves Improve CommunicaPon Overheads of Irregular Graph Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda, IWPAPS’ 12
HPC Advisory Council European Conference, June '14 24
Network-‐Topology-‐Aware Placement of Processes Can we design a highly scalable network topology detecRon service for IB? How do we design the MPI communicaRon library in a network-‐topology-‐aware manner to efficiently leverage the topology informaRon generated by our service? What are the potenRal benefits of using a network-‐topology-‐aware MPI library on the performance of parallel scienRfic applicaRons?
Overall performance and Split up of physical communicaPon for MILC on Ranger
Performance for varying system sizes Default for 2048 core run Topo-‐Aware for 2048 core run
15%
H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda, Design of a Scalable InfiniBand Topology Service to Enable Network-‐Topology-‐Aware Placement of Processes, SC'12 . BEST Paper and BEST STUDENT Paper Finalist
• Reduce network topology discovery Pme from O(N2hosts) to O(Nhosts)
• 15% improvement in MILC execuPon Pme @ 2048 cores • 15% improvement in Hypre execuPon Pme @ 1024 cores
25
Power and Energy Savings with Power-‐Aware CollecPves Performance and Power Comparison : MPI_Alltoall with 64 processes on 8 nodes
K. Kandalla, E. P. Mancini, S. Sur and D. K. Panda, “Designing Power Aware CollecPve CommunicaPon Algorithms for Infiniband Clusters”, ICPP ‘10
CPMD ApplicaPon Performance and Power Savings
5% 32%
HPC Advisory Council European Conference, June '14
• Scalability for million to billion processors – Support for highly-‐efficient inter-‐node and intra-‐node communicaRon (both two-‐sided
and one-‐sided) – Extremely minimum memory footprint – Support with Direct Connected Transport (DCT)
• CollecRve communicaRon – Hardware-‐mulRcast-‐based – Offload and Non-‐blocking – Topology-‐aware – Power-‐aware
• Support for GPGPUs • Support for Intel MICs • Fault-‐tolerance • Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with
Unified RunRme
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-‐X for Exascale
26 HPC Advisory Council European Conference, June '14
• Before CUDA 4: AddiRonal copies – Low performance and low producRvity
• Ater CUDA 4: Host-‐based pipeline – Unified Virtual Address – Pipeline CUDA copies with IB transfers – High performance and high producRvity
• Ater CUDA 5.5: GPUDirect-‐RDMA support – GPU to GPU direct transfer – Bypass the host memory
– Hybrid design to avoid PCI bo<lenecks
HPC Advisory Council European Conference, June '14 27
MVAPICH2-‐GPU: CUDA-‐Aware MPI
InfiniBand
GPU
GPU Memor
y
CPU
Chip set
MVAPICH2 1.8, 1.9, 2.0 Releases • Support for MPI communicaRon from NVIDIA GPU device
memory • High performance RDMA-‐based inter-‐node point-‐to-‐point
communicaRon (GPU-‐GPU, GPU-‐Host and Host-‐GPU) • High performance intra-‐node point-‐to-‐point
communicaRon for mulR-‐GPU adapters/node (GPU-‐GPU, GPU-‐Host and Host-‐GPU)
• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-‐node communicaRon for mulRple GPU adapters/node
• OpRmized and tuned collecRves for GPU device buffers • MPI datatype support for point-‐to-‐point and collecRve
communicaRon from GPU device buffers
28 HPC Advisory Council European Conference, June '14
• Hybrid design using GPUDirect RDMA
– GPUDirect RDMA and Host-‐based pipelining
– Alleviates P2P bandwidth bo<lenecks on SandyBridge and IvyBridge
• Support for communicaRon using mulR-‐rail
• Support for Mellanox Connect-‐IB and ConnectX VPI adapters
• Support for RoCE with Mellanox ConnectX VPI adapters
HPC Advisory Council European Conference, June '14 29
GPUDirect RDMA (GDR) with CUDA
IB Adapter
System Memory
GPU Memory
GPU
CPU
Chipset
P2P write: 5.2 GB/s P2P read: < 1.0 GB/s
SNB E5-‐2670
P2P write: 6.4 GB/s P2P read: 3.5 GB/s
IVB E5-‐2680V2
SNB E5-‐2670 /
IVB E5-‐2680V2
S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy and D. K. Panda, Efficient Inter-‐node MPI CommunicaPon using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs, Int'l Conference on Parallel Processing (ICPP '13)
30
Performance of MVAPICH2 with GPUDirect-‐RDMA: Latency GPU-‐GPU Internode MPI Latency
0
100
200
300
400
500
600
700
800
8K 32K 128K 512K 2M
1-‐Rail 2-‐Rail 1-‐Rail-‐GDR 2-‐Rail-‐GDR
Large Message Latency
Message Size (Bytes)
Latency (us)
Based on MVAPICH2-‐2.0b Intel Ivy Bridge (E5-‐2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-‐IB Dual-‐FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-‐RDMA Plug-‐in
10 %
0
5
10
15
20
25
1 4 16 64 256 1K 4K
1-‐Rail 2-‐Rail 1-‐Rail-‐GDR 2-‐Rail-‐GDR
Small Message Latency
Message Size (Bytes)
Latency (us)
67 %
5.49 usec
HPC Advisory Council European Conference, June '14
31
Performance of MVAPICH2 with GPUDirect-‐RDMA: Bandwidth GPU-‐GPU Internode MPI Uni-‐DirecPonal Bandwidth
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 4 16 64 256 1K 4K
1-‐Rail 2-‐Rail 1-‐Rail-‐GDR 2-‐Rail-‐GDR
Small Message Bandwidth
Message Size (Bytes)
Band
width (M
B/s)
0
2000
4000
6000
8000
10000
12000
8K 32K 128K 512K 2M
1-‐Rail 2-‐Rail 1-‐Rail-‐GDR 2-‐Rail-‐GDR
Large Message Bandwidth
Message Size (Bytes)
Band
width (M
B/s)
5x
9.8 GB/s
HPC Advisory Council European Conference, June '14
Based on MVAPICH2-‐2.0b Intel Ivy Bridge (E5-‐2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-‐IB Dual-‐FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-‐RDMA Plug-‐in
32
Performance of MVAPICH2 with GPUDirect-‐RDMA: Bi-‐Bandwidth GPU-‐GPU Internode MPI Bi-‐direcPonal Bandwidth
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 4 16 64 256 1K 4K
1-‐Rail 2-‐Rail 1-‐Rail-‐GDR 2-‐Rail-‐GDR
Small Message Bi-‐Bandwidth
Message Size (Bytes)
Bi-‐Ban
dwidth (M
B/s)
0
5000
10000
15000
20000
25000
8K 32K 128K 512K 2M
1-‐Rail 2-‐Rail 1-‐Rail-‐GDR 2-‐Rail-‐GDR
Large Message Bi-‐Bandwidth
Message Size (Bytes)
Bi-‐Ban
dwidth (M
B/s)
4.3x
19 %
19 GB/s
HPC Advisory Council European Conference, June '14
Based on MVAPICH2-‐2.0b Intel Ivy Bridge (E5-‐2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-‐IB Dual-‐FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-‐RDMA Plug-‐in
• MVAPICH2-‐2.0b with GDR support can be downloaded from h<ps://mvapich.cse.ohio-‐state.edu/download/mvapich2gdr/
• System sotware requirements • Mellanox OFED 2.1 or later
• NVIDIA Driver 331.20 or later
• NVIDIA CUDA Toolkit 5.5 or later
• Plugin for GPUDirect RDMA
– h<p://www.mellanox.com/page/products_dyn?product_family=116
• Has opRmized designs for point-‐to-‐point communicaRon using GDR
• Contact MVAPICH help list with any quesRons related to the package
mvapich-‐[email protected]‐state.edu
Using MVAPICH2-‐GPUDirect Version
HPC Advisory Council European Conference, June '14
New Design of MVAPICH2 with GPUDirect-‐RDMA
0
2
4
6
8
10
12
1 4 16 64 256 1024 4096
MV2-‐GDR 2.0b
MV2-‐GDR-‐New (Loopback)
MV2-‐GDR-‐New (Fast Copy)
Small Message Latency
Message Size (Bytes)
Latency (us)
7.3 usec
5.0 usec
3.5 usec
52%
GPU-‐GPU Internode MPI Latency
Intel Ivy Bridge (E5-‐2630 v2) node with 12 cores w/ NVIDIA Tesla K20c GPU, Mellanox Connect-‐IB Dual-‐FDR HCA
CUDA 6.0, Mellanox OFED 2.1 with GPUDirect-‐RDMA Plug-‐in HPC Advisory Council European Conference, June '14
0
500
1000
1500
2000
2500
3000
4 8 16 32 64
Average TPS
Number of GPU Nodes
MV2-‐2.0b-‐GDR MV2-‐NewGDR-‐Loopback MV2-‐NewGDR-‐Fastcopy
• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) • Strong Scaling: fixed 64K particles • Loopback and Fastcopy get up to 45% and 48% improvement for 32 GPUs • Weak Sailing: fixed 2K particles / GPU • Loopback and Fastcopy get up to 54% and 56% improvement for 16 GPUs
HOOMD-blue Strong Scaling
Application-Level Evaluation (HOOMD-blue)
HOOMD-blue Weak Scaling
48% 47% 56% 53%
0 500
1000 1500 2000 2500 3000 3500 4000 4500
4 8 16 32 64
Average TPS
Number of GPU Nodes
MV2-‐2.0b-‐GDR MV2-‐NewGDR-‐Loopback MV2-‐NewGDR-‐Fastcopy
ApplicaPon-‐Level EvaluaPon (HOOMD-‐blue)
HPC Advisory Council European Conference, June '14
MPI-‐3 RMA Support with GPUDirect RDMA
Intel Ivy Bridge (E5-‐2630 v2) node with 12 cores NVIDIA Tesla K20c GPU, Mellanox Connect-‐IB HCA
CUDA 6.0, Mellanox OFED 2.1 with GPUDirect-‐RDMA Plugin
MPI-3 RMA provides flexible synchronization and completion primitives
0
1
2
3
4
5
6
7
8
9
1 4 16 64 256 1024
MV2-‐GDR 2.0b
MV2-‐GDR-‐New (Send/Recv)
MV2-‐GDR-‐New (Put+Flush)
Small Message Latency
Message Size (Bytes)
Latency (usec)
39%
0
500000
1000000
1500000
2000000
2500000
1 4 16 64 256 1024
MV2-‐GDR 2.0b
MV2-‐GDR-‐New (Send/Recv)
MV2-‐GDR-‐New (Put+Flush)
Small Message Rate
Message Size (Bytes)
Message Rate (M
sgs/sec)
2.8usec
New release with the advanced designs coming soon
HPC Advisory Council European Conference, June '14
• Scalability for million to billion processors – Support for highly-‐efficient inter-‐node and intra-‐node communicaRon (both two-‐sided
and one-‐sided) – Extremely minimum memory footprint – Support with Direct Connected Transport (DCT)
• CollecRve communicaRon – Hardware-‐mulRcast-‐based – Offload and Non-‐blocking – Topology-‐aware – Power-‐aware
• Support for GPGPUs • Support for Intel MICs • Fault-‐tolerance • Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with
Unified RunRme
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-‐X for Exascale
37 HPC Advisory Council European Conference, June '14
MPI ApplicaPons on MIC Clusters
Xeon Xeon Phi
MulR-‐core Centric
Many-‐core Centric
MPI Program
MPI Program
Offloaded ComputaRon
MPI Program MPI Program
MPI Program
Host-‐only
Offload (/reverse Offload)
Symmetric
Coprocessor-‐only
• Flexibility in launching MPI jobs on clusters with Xeon Phi
38 HPC Advisory Council European Conference, June '14
Data Movement on Intel Xeon Phi Clusters
CPU CPU QPI
MIC
PCIe
MIC
MIC
CPU
MIC
IB
Node 0 Node 1 1. Intra-‐Socket 2. Inter-‐Socket 3. Inter-‐Node 4. Intra-‐MIC 5. Intra-‐Socket MIC-‐MIC 6. Inter-‐Socket MIC-‐MIC 7. Inter-‐Node MIC-‐MIC 8. Intra-‐Socket MIC-‐Host
10. Inter-‐Node MIC-‐Host 9. Inter-‐Socket MIC-‐Host
MPI Process
39
11. Inter-‐Node MIC-‐MIC with IB adapter on remote socket
and more . . .
• CriRcal for runRmes to opRmize data movement, hiding the complexity
• Connected as PCIe devices – Flexibility but Complexity
HPC Advisory Council European Conference, June '14
40
MVAPICH2-‐MIC Design for Clusters with IB and MIC
• Offload Mode
• Intranode CommunicaRon
• Coprocessor-‐only Mode
• Symmetric Mode
• Internode CommunicaRon
• Coprocessors-‐only
• Symmetric Mode
• MulR-‐MIC Node ConfiguraRons
HPC Advisory Council European Conference, June '14
Direct-‐IB CommunicaPon
41
IB Write to Xeon Phi (P2P Write)
IB Read from Xeon Phi (P2P Read)
CPU
Xeon Phi
PCIe
QPI
E5-‐2680 v2 E5-‐2670
5280 MB/s (83%)
IB FDR HCA
MT4099
6396 MB/s (100%)
962 MB/s (15%) 3421 MB/s (54%)
1075 MB/s (17%) 1179 MB/s (19%)
370 MB/s (6%) 247 MB/s (4%)
(SandyBridge) (IvyBridge)
• IB-‐verbs based communicaPon from Xeon Phi
• No porPng effort for MPI libraries with IB support
Peak IB FDR Bandwidth: 6397 MB/s
Intra-‐socket
IB Write to Xeon Phi (P2P Write)
IB Read from Xeon Phi (P2P Read) Inter-‐socket
• Data movement between IB HCA and Xeon Phi: implemented as peer-‐to-‐peer transfers
HPC Advisory Council European Conference, June '14
MIC-‐Remote-‐MIC P2P CommunicaPon
42 HPC Advisory Council European Conference, June '14
Bandwidth
BeWer Be
Wer
BeWer
Latency (Large Messages)
0
1000
2000
3000
4000
5000
8K 32K 128K 512K 2M
Lat
ency
(use
c)
Message Size (Bytes)
0
2000
4000
6000
1 16 256 4K 64K 1M
Ban
dwid
th (M
B/s
ec)
Message Size (Bytes)
5236
Intra-‐socket P2P
Inter-‐socket P2P
0 2000 4000 6000 8000
10000 12000 14000 16000
8K 32K 128K 512K 2M
Lat
ency
(use
c)
Message Size (Bytes)
Latency (Large Messages)
0
2000
4000
6000
1 16 256 4K 64K 1M
Ban
dwid
th (M
B/s
ec)
Message Size (Bytes)
BeWer
5594
Bandwidth
InterNode – Symmetric Mode -‐ P3DFFT Performance
43 HPC Advisory Council European Conference, June '14
• Note • ApplicaRon performance in symmetric mode closely Red with load balancing of computaRon between host and the Xeon Phi
• Try different decomposiRons / threading levels to idenRfy best se|ngs for target applicaRon
3DStencil Comm. Kernel
P3DFFT – 512x512x4K
0
2
4
6
8
10
4+4 8+8
Tim
e pe
r St
ep (s
ec)
Processes on Host+MIC (32 Nodes)
0
5
10
15
20
2+2 2+8
Com
m p
er S
tep
(sec
)
Processes on Host-MIC - 8 Threads/Proc (8 Nodes)
50.2%
MV2-‐INTRA-‐OPT MV2-‐MIC
16.2%
S. Potluri, D. Bureddy, K. Hamidouche, A. Venkatesh, K. Kandalla, H. Subramoni and D. K. Panda, MVAPICH-‐PRISM: A Proxy-‐based CommunicaPon Framework using InfiniBand and SCIF for Intel MIC
Clusters Int'l Conference on SupercompuPng (SC '13), November 2013.
44
OpPmized MPI CollecPves for MIC Clusters (Allgather & Alltoall)
HPC Advisory Council European Conference, June '14
A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda -‐ High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014
0
5000
10000
15000
20000
25000
1 2 4 8 16 32 64 128 256 512 1K
Latency (usecs)
Message Size (Bytes)
32-‐Node-‐Allgather (16H + 16 M) Small Message Latency
MV2-‐MIC
MV2-‐MIC-‐Opt
0
500
1000
1500
8K 16K 32K 64K 128K 256K 512K 1M
Latency (usecs) x 1000
Message Size (Bytes)
32-‐Node-‐Allgather (8H + 8 M) Large Message Latency
MV2-‐MIC
MV2-‐MIC-‐Opt
0
200
400
600
800
4K 8K 16K 32K 64K 128K 256K 512K
Latency (usecs) x 1000
Message Size (Bytes)
32-‐Node-‐Alltoall (8H + 8 M) Large Message Latency
MV2-‐MIC
MV2-‐MIC-‐Opt
0
10
20
30
40
50
MV2-‐MIC-‐Opt MV2-‐MIC
ExecuP
on Tim
e (secs)
32 Nodes (8H + 8M), Size = 2K*2K*1K
P3DFFT Performance CommunicaRon
ComputaRon
76% 58%
55%
• Scalability for million to billion processors – Support for highly-‐efficient inter-‐node and intra-‐node communicaRon (both two-‐sided
and one-‐sided) – Extremely minimum memory footprint – Support with Direct Connected Transport (DCT)
• CollecRve communicaRon – Hardware-‐mulRcast-‐based – Offload and Non-‐blocking – Topology-‐aware – Power-‐aware
• Support for GPGPUs • Support for Intel MICs • Fault-‐tolerance • Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with
Unified RunRme
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-‐X for Exascale
45 HPC Advisory Council European Conference, June '14
• Component failures are common in large-‐scale clusters
• Imposes need on reliability and fault tolerance
• MulRple challenges: – Checkpoint-‐Restart vs. Process MigraRon
– Benefits of Scalable Checkpoint Restart (SCR) Support • ApplicaRon guided • ApplicaRon transparent
HPC Advisory Council European Conference, June '14 46
Fault Tolerance
HPC Advisory Council European Conference, June '14 47
Checkpoint-‐Restart vs. Process MigraPon Low Overhead Failure PredicPon with IPMI
X. Ouyang, R. Rajachandrasekar, X. Besseron, D. K. Panda, High Performance Pipelined Process MigraPon with RDMA, CCGrid 2011 X. Ouyang, S. Marcarelli, R. Rajachandrasekar and D. K. Panda, RDMA-‐Based Job MigraPon Framework for MPI over InfiniBand, Cluster 2010
10.7X
2.3X
4.3X
1
Time to migrate 8 MPI processes
• Job-‐wide Checkpoint/Restart is not scalable • Job-‐pause and Process MigraRon framework can deliver pro-‐acRve fault-‐tolerance • Also allows for cluster-‐wide load-‐balancing by means of job compacRon
0
5000
10000
15000
20000
25000
30000
Migration w/o RDMA
CR (ext3) CR (PVFS)
Exe
cutio
n Ti
me
(sec
onds
)
Job Stall
Checkpoint (Migration)
Resume
Restart
2.03x
4.49x
LU Class C Benchmark (64 Processes)
MulP-‐Level CheckpoinPng with ScalableCR (SCR)
48 HPC Advisory Council European Conference, June '14
⊕
Checkpoint Cost a
nd Resilien
cy
Low
High
• LLNL’s Scalable Checkpoint/Restart library
• Can be used for applicaRon guided and applicaRon transparent checkpoinRng
• EffecRve uRlizaRon of storage hierarchy – Local: Store checkpoint data on node’s local
storage, e.g. local disk, ramdisk
– Partner: Write to local storage and on a partner node
– XOR: Write file to local storage and small sets of nodes collecRvely compute and store parity redundancy data (RAID-‐5)
– Stable Storage: Write to parallel file system
ApplicaPon-‐guided MulP-‐Level CheckpoinPng
49 HPC Advisory Council European Conference, June '14
0
20
40
60
80
100
PFS MVAPICH2+SCR (Local)
MVAPICH2+SCR (Partner)
MVAPICH2+SCR (XOR)
Checkpoint W
riRng Tim
e (s)
RepresentaRve SCR-‐Enabled ApplicaRon
• Checkpoint wriRng phase Rmes of representaRve SCR-‐enabled MPI applicaRon
• 512 MPI processes (8 procs/node)
• Approx. 51 GB checkpoints
Transparent MulP-‐Level CheckpoinPng
50 HPC Advisory Council European Conference, June '14
• ENZO Cosmology applicaPon – RadiaRon Transport workload
• Using MVAPICH2’s CR protocol instead of the applicaRon’s in-‐built CR mechanism
• 512 MPI processes running on the Stampede Supercomputer@TACC
• Approx. 12.8 GB checkpoints: ~2.5X reducPon in checkpoint I/O costs
• Scalability for million to billion processors – Support for highly-‐efficient inter-‐node and intra-‐node communicaRon (both two-‐sided
and one-‐sided) – Extremely minimum memory footprint – Support with Direct Connected Transport (DCT)
• CollecRve communicaRon – Hardware-‐mulRcast-‐based – Offload and Non-‐blocking – Topology-‐aware – Power-‐aware
• Support for GPGPUs • Support for Intel MICs • Fault-‐tolerance • Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with
Unified RunRme
Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-‐X for Exascale
51 HPC Advisory Council European Conference, June '14
MVAPICH2-‐X for Hybrid MPI + PGAS ApplicaPons
HPC Advisory Council European Conference, June '14 52
MPI ApplicaPons, OpenSHMEM ApplicaPons, UPC ApplicaPons, Hybrid (MPI + PGAS) ApplicaPons
Unified MVAPICH2-‐X RunPme
InfiniBand, RoCE, iWARP
OpenSHMEM Calls MPI Calls UPC Calls
• Unified communicaRon runRme for MPI, UPC, OpenSHMEM available with MVAPICH2-‐X 1.9 onwards! – h<p://mvapich.cse.ohio-‐state.edu
• Feature Highlights – Supports MPI(+OpenMP), OpenSHMEM, UPC, MPI(+OpenMP) + OpenSHMEM,
MPI(+OpenMP) + UPC – MPI-‐3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard
compliant – Scalable Inter-‐node and Intra-‐node communicaRon – point-‐to-‐point and collecRves
OpenSHMEM CollecPve CommunicaPon Performance
53
Reduce (1,024 processes) Broadcast (1,024 processes)
Collect (1,024 processes) Barrier
0 50 100 150 200 250 300 350 400
128 256 512 1024 2048
Time (us)
No. of Processes
1 10
100 1000
10000 100000
1000000 10000000
Time (us)
Message Size
180X
1
10
100
1000
10000
4 16 64 256 1K 4K 16K 64K 256K
Time (us)
Message Size
18X
1
10
100
1000
10000
100000
1 4 16 64 256 1K 4K 16K 64K
Time (us)
Message Size
MV2X-‐SHMEM Scalable-‐SHMEM OMPI-‐SHMEM
12X
3X
HPC Advisory Council European Conference, June '14
OpenSHMEM ApplicaPon EvaluaPon
54
0
100
200
300
400
500
600
256 512
Time (s)
Number of Processes
UH-‐SHMEM
OMPI-‐SHMEM (FCA)
Scalable-‐SHMEM (FCA)
MV2X-‐SHMEM
0
10
20
30
40
50
60
70
80
256 512
Time (s)
Number of Processes
UH-‐SHMEM
OMPI-‐SHMEM (FCA)
Scalable-‐SHMEM (FCA)
MV2X-‐SHMEM
Heat Image Daxpy
• Improved performance for OMPI-‐SHMEM and Scalable-‐SHMEM with FCA • ExecuRon Rme for 2DHeat Image at 512 processes (sec):
- UH-‐SHMEM – 523, OMPI-‐SHMEM – 214, Scalable-‐SHMEM – 193, MV2X-‐SHMEM – 169
• ExecuRon Rme for DAXPY at 512 processes (sec): - UH-‐SHMEM – 57, OMPI-‐SHMEM – 56, Scalable-‐SHMEM – 9.2, MV2X-‐
SHMEM – 5.2 J. Jose, J. Zhang, A. Venkatesh, S. Potluri, and D. K. Panda, A Comprehensive Performance EvaluaPon of OpenSHMEM Libraries on InfiniBand Clusters, OpenSHMEM Workshop (OpenSHMEM’14), March 2014
HPC Advisory Council European Conference, June '14
HPC Advisory Council European Conference, June '14 55
Hybrid MPI+OpenSHMEM Graph500 Design ExecuPon Time
J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, InternaPonal SupercompuPng Conference (ISC’13), June 2013
0 1 2 3 4 5 6 7 8 9
26 27 28 29
Billion
s of T
raversed
Edges Per
Second
(TEP
S)
Scale
MPI-‐Simple MPI-‐CSC MPI-‐CSR Hybrid(MPI+OpenSHMEM)
0 1 2 3 4 5 6 7 8 9
1024 2048 4096 8192
Billion
s of T
raversed
Edges Per
Second
(TEP
S)
# of Processes
MPI-‐Simple MPI-‐CSC MPI-‐CSR Hybrid(MPI+OpenSHMEM)
Strong Scalability
Weak Scalability
J. Jose, K. Kandalla, M. Luo and D. K. Panda, SupporPng Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance EvaluaPon, Int'l Conference on Parallel Processing (ICPP '12), September 2012
0 5
10 15 20 25 30 35
4K 8K 16K
Time (s)
No. of Processes
MPI-‐Simple MPI-‐CSC MPI-‐CSR Hybrid (MPI+OpenSHMEM)
13X
7.6X
• Performance of Hybrid (MPI+OpenSHMEM) Graph500 Design
• 8,192 processes -‐ 2.4X improvement over MPI-‐CSR -‐ 7.6X improvement over MPI-‐Simple
• 16,384 processes -‐ 1.5X improvement over MPI-‐CSR -‐ 13X improvement over MPI-‐Simple
56
Hybrid MPI+OpenSHMEM Sort ApplicaPon ExecuPon Time Strong Scalability
• Performance of Hybrid (MPI+OpenSHMEM) Sort ApplicaRon • ExecuRon Time
-‐ 4TB Input size at 4,096 cores: MPI – 2408 seconds, Hybrid: 1172 seconds -‐ 51% improvement over MPI-‐based design
• Strong Scalability (configuraRon: constant input size of 500GB) -‐ At 4,096 cores: MPI – 0.16 TB/min, Hybrid – 0.36 TB/min -‐ 55% improvement over MPI based design
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40
512 1,024 2,048 4,096
Sort Rate (TB/min)
No. of Processes
MPI Hybrid
0
500
1000
1500
2000
2500
3000
500GB-‐512 1TB-‐1K 2TB-‐2K 4TB-‐4K
Time (secon
ds)
Input Data -‐ No. of Processes
MPI
Hybrid
51% 55%
HPC Advisory Council European Conference, June '14
• Performance and Memory scalability toward 900K-‐1M cores – Enhanced design with Dynamically Connected Transport (DCT) service and Connect-‐IB
• Enhanced OpRmizaRon for GPGPU and Coprocessor Support – Extending the GPGPU support (GPU-‐Direct RDMA) with CUDA 6.5 and Beyond – Support for Intel MIC (Knight Landing)
• Taking advantage of CollecRve Offload framework – Including support for non-‐blocking collecRves (MPI 3.0)
• RMA support (as in MPI 3.0) • Extended topology-‐aware collecRves • Power-‐aware collecRves • Support for MPI Tools Interface (as in MPI 3.0) • Checkpoint-‐Restart and migraRon support with in-‐memory checkpoinRng • Hybrid MPI+PGAS programming support with GPGPUs and Accelertors
MVAPICH2/MVPICH2-‐X – Plans for Exascale
57 HPC Advisory Council European Conference, June '14
• InfiniBand with RDMA feature is gaining momentum in HPC systems with best performance and greater usage
• As the HPC community moves to Exascale, new soluRons are needed in the MPI and Hybrid MPI+PGAS stacks for supporRng GPUs and Accelerators
• Demonstrated how such soluRons can be designed with MVAPICH2/MVAPICH2-‐X and their performance benefits
• Such designs will allow applicaRon scienRsts and engineers to take advantage of upcoming exascale systems
58
Concluding Remarks
HPC Advisory Council European Conference, June '14
HPC Advisory Council European Conference, June '14
Funding Acknowledgments
Funding Support by
Equipment Support by
59
HPC Advisory Council European Conference, June '14
Personnel Acknowledgments Current Students
– S. Chakraborthy (Ph.D.)
– N. Islam (Ph.D.)
– J. Jose (Ph.D.)
– M. Li (Ph.D.)
– R. Rajachandrasekhar (Ph.D.)
Past Students – P. Balaji (Ph.D.)
– D. BunRnas (Ph.D.)
– S. Bhagvat (M.S.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– G. Santhanaraman (Ph.D.) – A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
60
Past Research ScienCst – S. Sur
Current Post-‐Docs – X. Lu
– K. Hamidouche
Current Programmers – M. Arnold
– J. Perkins
Past Post-‐Docs – H. Wang – X. Besseron – H.-‐W. Jin
– M. Luo
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– M. Rahman (Ph.D.) – D. Shankar (Ph.D.)
– R. Shir (Ph.D.)
– A. Venkatesh (Ph.D.)
– J. Zhang (Ph.D.)
– E. Mancini – S. Marcarelli
– J. Vienne
Current Senior Research Associate – H. Subramoni
Past Programmers – D. Bureddy
• August 25-‐27, 2014; Columbus, Ohio, USA
• Keynote Talks, Invited Talks, Contributed PresentaRons
• Half-‐day tutorial on MVAPICH2, MVAPICH2-‐X and MVAPICH2-‐GDR opRmizaRon and tuning
• Speakers (Confirmed so far): – Keynote: Dan Stanzione (TACC)
– Keynote: Darren Kerbyson (PNNL)
– Sadaf Alam (CSCS, Switzerland)
– Onur Celebioglu (Dell)
– Jens Glaser (Univ. of Michigan)
– Jeff Hammond (Intel)
– Adam Moody (LLNL)
– David Race (Cray)
– Davide Rosse| (NVIDIA)
– Gilad Shainer (Mellanox)
– Sayantan Sur (Intel)
– Mahidhar TaRneni (SDSC)
– Jerome Vienne (TACC)
• Call for Contributed PresentaPons (Deadline: July 1, 2014)
• More details at: h<p://mug.mvapich.cse.ohio-‐state.edu 61
Upcoming 2nd Annual MVAPICH User Group (MUG) MeePng
HPC Advisory Council European Conference, June '14
• Looking for Bright and EnthusiasRc Personnel to join as – VisiRng ScienRsts
– Post-‐Doctoral Researchers
– PhD Students
– MPI Programmer/Sotware Engineer
– Hadoop/Big Data Programmer/Sotware Engineer
• If interested, please contact me at this conference and/or send an e-‐mail to [email protected]‐state.edu
62
MulPple PosiPons Available in My Group
HPC Advisory Council European Conference, June '14
HPC Advisory Council European Conference, June '14
Pointers
hWp://nowlab.cse.ohio-‐state.edu
[email protected]‐state.edu hWp://www.cse.ohio-‐state.edu/~panda
63
hWp://mvapich.cse.ohio-‐state.edu
hWp://hadoop-‐rdma.cse.ohio-‐state.edu