Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Introduction to MPI
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
A Presentation at HPC Advisory Council Workshop, Lugano 2011
by
Sayantan Sur
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~surs
• Trends in Designing Petaflop and Exaflop Systems
• Overview of Programming Models and MPI
• How to Use MPI
• Challenges in Designing MPI Library on Petaflop and
Exaflop Systems
• Overview of MVAPICH and MVAPICH2 MPI Stack
• Sample Performance Numbers
2
Presentation Overview
HPC Advisory Council, Lugano Switzerland '11
• Growth of High Performance Computing
– Growth in processor performance
• Chip density doubles every 18 months
– Growth in commodity networking
• Increase in speed/features + reducing cost
• Clusters: popular choice for HPC
– Scalability, Modularity and Upgradeability
Current and Next Generation Applications and Computing Systems
3 HPC Advisory Council, Lugano Switzerland '11
PetaFlop to ExaFlop Computing
4
10 PFlops
in 2011 100 PFlops
in 2015
Expected to have an ExaFlop system in 2018-2019 !
HPC Advisory Council, Lugano Switzerland '11
Trends for Computing Clusters in the Top 500 List (http://www.top500.org)
Nov. 1996: 0/500 (0%) Nov. 2001: 43/500 (8.6%) Nov. 2006: 361/500 (72.2%)
Jun. 1997: 1/500 (0.2%) Jun. 2002: 80/500 (16%) Jun. 2007: 373/500 (74.6%)
Nov. 1997: 1/500 (0.2%) Nov. 2002: 93/500 (18.6%) Nov. 2007: 406/500 (81.2%)
Jun. 1998: 1/500 (0.2%) Jun. 2003: 149/500 (29.8%) Jun. 2008: 400/500 (80.0%)
Nov. 1998: 2/500 (0.4%) Nov. 2003: 208/500 (41.6%) Nov. 2008: 410/500 (82.0%)
Jun. 1999: 6/500 (1.2%) Jun. 2004: 291/500 (58.2%) Jun. 2009: 410/500 (82.0%)
Nov. 1999: 7/500 (1.4%) Nov. 2004: 294/500 (58.8%) Nov. 2009: 417/500 (83.4%)
Jun. 2000: 11/500 (2.2%) Jun. 2005: 304/500 (60.8%) Jun. 2010: 424/500 (84.8%)
Nov. 2000: 28/500 (5.6%) Nov. 2005: 360/500 (72.0%) Nov. 2010: 415/500 (83%)
Jun. 2001: 33/500 (6.6%) Jun. 2006: 364/500 (72.8%) Jun. 2011: To be announced
5 HPC Advisory Council, Lugano Switzerland '11
• Hardware Components
– Processing Core and Memory
sub-system
– I/O Bus
– Network Adapter
– Accelerator
– Network Switch
• Software Components
– Communication software
• Memory <-> Accelerator
• Memory <-> Nw Adapter
• Memory<-> Accelerator <->
NW Adapter
Major Components in Modern Computing Systems
I O
B U S
P0
Core 0
Core 1
Core 2
Core 3
P1
Core 0
Core 1
Core 2
Core 3
Memory
Memory
Network Adapter
Network Switch
Processing Bottleneck
I/O Bottleneck
Network Bottleneck
6 HPC Advisory Council, Lugano Switzerland '11
Accelerators
7
InfiniBand in the Top500
Percentage share of InfiniBand is steadily increasing
HPC Advisory Council, Lugano Switzerland '11
• 214 IB Clusters (42.8%) in the Nov ‘10 Top500 list (top500.org)
• Installations in the Top 30 (13 systems):
Large-scale InfiniBand Installations
120,640 cores (Nebulae) in China (3rd) 15,120 cores (Loewe) in Germany (22nd)
73,278 cores (Tsubame-2.0) at Japan (4th) 26,304 cores (Juropa) in Germany (23rd)
138,368 cores (Tera-100) at France (6th) 26,232 cores (Tachyonll) in South Korea (24th)
122,400 cores (RoadRunner) at LANL (7th) 23,040 cores (Jade) at GENCI (27th)
81,920 cores (Pleiades) at NASA Ames (11th) 33,120 cores (Mole-8.5) in China (28th)
42,440 cores (Red Sky) at Sandia (14th) More are getting installed !
62,976 cores (Ranger) at TACC (15th)
35,360 cores (Lomonosov) in Russia (17th)
8 HPC Advisory Council, Lugano Switzerland '11 HPC Advisory Council, Lugano Switzerland '11
• Trends in Designing Petaflop and Exaflop Systems
• Overview of Programming Models and MPI
• How to Use MPI
• Challenges in Designing MPI Library on Petaflop and
Exaflop Systems
• Overview of MVAPICH and MVAPICH2 MPI Stack
• Sample Performance Numbers
9
Presentation Overview
HPC Advisory Council, Lugano Switzerland '11
• Parallel system offers greater compute and memory
capacity than a serial system
– Tackle problems that are too big to fit in one computer
• Different types of systems
– Uniform Shared Memory (bus based)
• Many way symmetric multi-processor machines: SGI, Sun, …
– Non uniform Shared Memory (NUMA)
• CCNUMA machines: Cray CX1000, AMD Magny Cours, Intel Westmere
– Distributed Memory Machines
• Commodity clusters, Blue Gene, Cray XT5
• Similarly, there are different types of programming models
– Shared memory, Distributed memory …
HPC Advisory Council, Lugano Switzerland '11 10
Parallel Systems - History and Overview
HPC Advisory Council, Lugano Switzerland '11 11
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Logical shared memory
Shared Memory Model
SHMEM, DSM
Distributed Memory Model
MPI (Message Passing Interface)
Partitioned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems
– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
• In this presentation series, we concentrate on MPI
Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges
Programming Models Message Passing Interface (MPI), Sockets and
PGAS (UPC, Global Arrays)
Applications
Networking Technologies (InfiniBand, 1/10/40GigE, RNICs & Intelligent NICs)
Commodity Computing System Architectures
(single, dual, quad, ..) Multi/Many-core architecture and
Accelerators
HPC Advisory Council, Lugano Switzerland '11
Point-to-point Communication QoS
Collective Communication
Synchronization & Locks
I/O & File Systems
Fault Tolerance
Library or Runtime for Programming Models
12
• Message Passing Library standardized by MPI Forum
– C, C++ and Fortran
• Goal: portable, efficient and flexible standard for writing
parallel applications
• Not IEEE or ISO standard, but widely considered “industry
standard” for HPC application
• Evolution of MPI
– MPI-1: 1994
– MPI-2: 1996
– MPI-3: on-going effort (2008 – current)
13
MPI Overview and History
HPC Advisory Council, Lugano Switzerland '11
• Primarily intended for
distributed memory machines
• P2 needs value of A
– MPI-1: P1 will have to send a
message to P2 with value of A
using MPI_Send
– MPI-2: P2 can get value of A
directly using MPI_Get
• P1, P2, P3 need sum of A+B+C
– MPI_Allreduce with SUM op
– Multi-way communication
14
What does MPI do?
P1 P2
A=5
Memory
B=6
Memory
Network
MPI_Send
MPI_Get
A=5
Memory
B=6
Memory
Network
C=4
Memory
P1 P2
P3
MPI_Allreduce
HPC Advisory Council, Lugano Switzerland '11
• Trends in Designing Petaflop and Exaflop Systems
• Overview of Programming Models and MPI
• How to Use MPI
• Challenges in Designing MPI Library on Petaflop and
Exaflop Systems
• Overview of MVAPICH and MVAPICH2 MPI Stack
• Sample Performance Numbers
15
Presentation Overview
HPC Advisory Council, Lugano Switzerland '11
• Point-to-point Two-sided Communication
• Collective Communication
• One-sided Communication
• Job Startup
• Parallel I/O
• Involvement of Network in MPI operations
Using MPI
16 HPC Advisory Council, Lugano Switzerland '11
17
Types of Point-to-Point Communication
• Synchronous (MPI_Ssend)
– Sender process blocks on send until receiver arrives
• Blocking Send / Receive (MPI_Send, MPI_Recv)
– Block until send buffer can be re-used
– Block until receive buffer is ready to read
• Non-blocking Send / Receive (MPI_Isend, MPI_Irecv)
– Start send and receive, but don’t wait until complete
• Others: buffered send, sendrecv, ready send
– Not used very frequently
HPC Advisory Council, Lugano Switzerland '11
• How does MPI know which send is for which receive?
• Programmer (i.e. you!) need to provide this information
– Sender side: tag, destination rank and communicator
– Communicator is a subset of the entire set of MPI processes
– MPI_COMM_WORLD represents all MPI processes
– Receiver side: tag, source rank and communicator
– The triples: tag, rank and communicator must match
• Some special, pre-defined values: MPI_ANY_TAG,
MPI_ANY_RANK
HPC Advisory Council, Lugano Switzerland '11 18
Message Matching
19
Buffering
• MPI library has internal
“system” buffers
– Optimize throughput (do not
wait for receiver)
– Opaque to programmer
– Finite Resource
• Blocking send may copy to
sender system buffer and
return
Courtesy: https://computing.llnl.gov/tutorials/mpi/
HPC Advisory Council, Lugano Switzerland '11
20
Blocking vs. Non-blocking
• Blocking
– Send returns when safe to re-use buffer (maybe in system buffer)
– Receive only returns when data is fully received
• Non-blocking
– Returns immediately (data may or may not be buffered)
– Simply request MPI library to transfer the data
– Need to “wait” on handle returned by call
– Benefit is that computation and communication can be overlapped
HPC Advisory Council, Lugano Switzerland '11
• Messages will not overtake each other
• If sender sends two messages M1 and M2
in succession and both match the same
receive, then M1 will be received before
M2
• Converse is also true: if receiver posts two
receives R1 and R2 and both match
message M, then R1 will be completed
first
• Note: this does not mean MPI requires in-
order delivery (although many
implementations do this for simplicity)
HPC Advisory Council, Lugano Switzerland '11 21
Ordering
Sender Receiver
Receive: R
✔
Sender Receiver
Receive: R
✗
M1
M2
M1
M2
• MPI does not guarantee fairness
• If receive R matches message M1 from P1 and M2 from P2,
MPI does not say which one will match first
HPC Advisory Council, Lugano Switzerland '11 22
Fairness
P1 P2
P3
M1 M2
Receive:
R
?
Courtesy: https://computing.llnl.gov/tutorials/mpi/
HPC Advisory Council, Lugano Switzerland '11 23
Sample code for point-to-point
#include "mpi.h"
#include <stdio.h>
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, dest, source, rc, count, tag=1;
char inmsg, outmsg='x';
MPI_Status Stat;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
dest = 1;
source = 1;
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
}
else if (rank == 1) {
dest = 0;
source = 0;
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
}
rc = MPI_Get_count(&Stat, MPI_CHAR, &count);
printf ("Task %d: Received %d char(s) f rom task %d with tag %d \n",
rank, count, Stat.MPI_SOURCE, Stat.MPI_TAG);
MPI_Finalize();
}
Courtesy: https://computing.llnl.gov/tutorials/mpi/
• Point-to-point Two-sided Communication
• Collective Communication
• One-sided Communication
• Job Startup
• Parallel I/O
• Involvement of Network in MPI operations
Using MPI
24 HPC Advisory Council, Lugano Switzerland '11
25
Types of Collective Communication
• Synchronization
– Processes wait until all of them have reached a certain point
• Data Movement
– Broadcast, Scatter, All-to-all …
• Collective Computation
– Allreduce with min, max, multiply, sum … on data
• Considerations
– Blocking, no tag required, only with pre-defined datatypes
– MPI-3 considering non-blocking versions
HPC Advisory Council, Lugano Switzerland '11
HPC Advisory Council, Lugano Switzerland '11 26
Example Collective Operation: Scatter
Rank 0
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
9.0 10.0 11.0 12.0
13.0 14.0 15.0 16.0
Rank 1 Rank 2 Rank 3
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0
• Using Scatter, an array can be distributed to multiple
processes
HPC Advisory Council, Lugano Switzerland '11 27
Example code for the Scatter Example
#include "mpi.h"
#include <stdio.h>
#def ine SIZE 4
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, sendcount, recvcount, source;
f loat sendbuf [SIZE][SIZE] = {
{1.0, 2.0, 3.0, 4.0},
{5.0, 6.0, 7.0, 8.0},
{9.0, 10.0, 11.0, 12.0},
{13.0, 14.0, 15.0, 16.0} };
f loat recvbuf [SIZE];
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
if (numtasks == SIZE) {
source = 1;
sendcount = SIZE;
recvcount = SIZE;
MPI_Scatter(sendbuf ,sendcount,MPI_FLOAT,recvbuf ,recvcount,
MPI_FLOAT,source,MPI_COMM_WORLD);
printf ("rank= %d Results: %f %f %f %f\n",rank,recvbuf [0],
recvbuf [1],recvbuf[2],recvbuf[3]);
}
else
printf ("Must specify %d processors. Terminating.\n",SIZE);
MPI_Finalize();
} Courtesy: https://computing.llnl.gov/tutorials/mpi/
• Point-to-point Two-sided Communication
• Collective Communication
• One-sided Communication
• Job Startup
• Parallel I/O
• Involvement of Network in MPI operations
Using MPI
28 HPC Advisory Council, Lugano Switzerland '11
29
Benefits of one-sided communication
• Easy to express irregular pattern of communication
– Easier than request-response pattern using two-sided
• Decouple data transfer with synchronization
– Various methods of synchronization
• Active synchronization
• Passive synchronization
• Potentially better performance with overlap of
computation and communication
HPC Advisory Council, Lugano Switzerland '11
HPC Advisory Council, Lugano Switzerland '11 30
Basic model of one-sided communication
Rank 0
Rank 2
Rank 1
Rank 3
mem
mem
mem
mem
window
• Each process can contribute part of its memory to form a
larger “window” of global memory
• Creation and destruction of “windows” are collective
operations (all processes participate)
HPC Advisory Council, Lugano Switzerland '11 31
MPI One Sided Taxonomy
MPI-2 One Sided Model
Communication Synchronization
Put Get Accumulate Active Passive
Lock/Un
lock
Collective
(Entire Window)
Group
(Subset Window)
Fence Post/Wait/Start/C
omplete
• Different modes suit different applications patterns
HPC Advisory Council, Lugano Switzerland '11 32
Synchronization Modes – Active Synchronization
Origin Target
MPI_Win_post
MPI_Win_start
Overlapped
Computation MPI_Put
Overlapped
Computation
MPI_Win_complete
MPI_Win_wait
• Window is exposed with
win_post, and access started
with win_start
• Win_complete to end access,
and win_wait to make sure
window is available to read
HPC Advisory Council, Lugano Switzerland '11 33
Synchronization Modes – Active Synchronization (Fence)
Origin Target
MPI_Win_creat
e
MPI_Win_creat
e
Overlapped
Computation
MPI_Put
Overlapped
Computation
MPI_Win_fence MPI_Win_fence
MPI_Win_fence
Local memory
access
• Collective synchronization
• Collective among members of a
window
• Updates between fences only
visible after fence is complete
MPI_Win_fence
HPC Advisory Council, Lugano Switzerland '11 34
Synchronization Modes – Passive Synchronization
Origin Target
MPI_Win_creat
e
MPI_Win_creat
e
Overlapped
Computation MPI_Put
Overlapped
Computation
MPI_Win_unlock
MPI_Win_lock
MPI_Win_lock
Local memory
access
• Billboard model
• Lock/unlock to have dedicated
access
• Lock/unlock are not blocking
• Put executed only when lock
granted on target
• Point-to-point Two-sided Communication
• Collective Communication
• One-sided Communication
• Job Startup
• Parallel I/O
• Involvement of Network in MPI operations
Using MPI
35 HPC Advisory Council, Lugano Switzerland '11
• MPI process managers provide support to launch jobs
• “mpiexec” is a utility to launch jobs
• Example usage:
– mpiexec -np 2 -machinefile mf ./a.out
• Supports SPMD (single program multiple data) model along
with MPMD (multiple program multiple data)
• Different resource management systems / MPI stacks may
do things slightly differently
– SLURM, PBS, Torque
– mpirun_rsh (fastest launcher for MVAPICH and MVAPICH2), hydra
and ORTE (Open MPI)
• Launch time should not increase with number of processes HPC Advisory Council, Lugano Switzerland '11 36
Launching MPI jobs
• Point-to-point Two-sided Communication
• Collective Communication
• One-sided Communication
• Job Startup
• Parallel I/O
• Involvement of Network in MPI operations
Using MPI
37 HPC Advisory Council, Lugano Switzerland '11
• Parallel I/O very important for scientific applications
• Parallel file systems offer high bandwidth access to large
volumes of data
– PVFS (parallel virtual file system)
– Lustre, GPFS …
• MPI applications can use MPI-IO layer for collective I/O
– Using MPI-IO optimal I/O access patterns are used to read data
from disks
– Fast communication network then helps re-arrange data in order
desired by end application
HPC Advisory Council, Lugano Switzerland '11 38
Parallel I/O
• Critical optimization in MPI I/O
• All processes must call collective function
• Basic idea: build large blocks from small requests so
requests will be large from disk point of view
– Particularly effective when accesses by different processes are
non-contiguous and interleaved
HPC Advisory Council, Lugano Switzerland '11 39
Collective I/O in MPI
Courtesy: http://www.mcs.anl.gov/research/projects/mpi/tutorial/advmpi/sc2005-advmpi.pdf
• Trends in Designing Petaflop and Exaflop Systems
• Overview of Programming Models and MPI
• Using MPI
• Challenges in Designing MPI Library on Petaflop and
Exaflop Systems
• Overview of MVAPICH and MVAPICH2 MPI Stack
• Sample Performance Numbers
40
Presentation Overview
HPC Advisory Council, Lugano Switzerland '11
Designing MPI Using InfiniBand Features
HPC Advisory Council, Lugano Switzerland '11 41
Many different design choices
RDMA Operations
Unreliable Datagram
Atomic Operations
Shared Receive Queues
(SRQ)
Static Rate Control
Multicast End-to-End
Flow Control
Major Components in MPI
Send / Receive
Multi-Path LMC
QoS (SL and VL)
InfiniBand Features
Reliable Connection
eXtended Reliable Connection
(XRC)
Optimal design choices • Performance • Scalability • Fault-Tolerance & Resiliency • Power-Aware
Protocol Mapping
Buffer Management
Flow Control
Connection Management
Communication Progress
Collective Communication
Multi-rail Support
One-sided Active/Passive
Checkpoint Restart
Process Migration
Reliability and Resiliency
• Inter-node Pt-to-pt Communication
– Challenges
• Sender memory -> IB adapter (sender) -> IB switch -> IB adapter
(receiver) -> Receiver memory
• Short message (eager) and Long message (rendezvous)
• Send/Recv vs. RDMA
– Metrics
• Latency (lowest)
• Bandwidth and Bi-Directional bandwidth (highest)
• CPU utilization (lowest)
– Maximum overlap between communication and computation
• Message Rate (highest)
HPC Advisory Council, Lugano Switzerland '11 42
Performance Issues
• Added Challenges for Intra-node Pt-to-Pt Communication
– Multi-core platforms are emerging
– Cache hierarchy (shared L2 or not, L3)
– Intra-socket and Inter-socket communication cost (latency and
bandwidth) are different
• May need different scheme for intra-socket and inter-socket
communication
– Process-core mapping plays an important role
• Concurrent Communication
– Multi-rail organizations and schemes for efficient usage of the rails
– Polling scheme within the MPI library
HPC Advisory Council, Lugano Switzerland '11 43
Performance Issues (Cont’d)
• Collective Communication
– Metrics
• Minimize latency
• Maximize throughput (example: concurrent broadcasts)
– Challenges
• Optimal algorithms to minimize
– Network contention
– Contention at the source and destination adapter(s)
– CPU involvement/overhead
• Different algorithms based on system size and message size
• Multi-core-aware algorithms for the emerging multi-core platforms
• Topology-aware algorithms to dynamically adopt based on the
underlying network topology
HPC Advisory Council, Lugano Switzerland '11 44
Performance Issues (Cont’d)
• Performance of an application should increase as system
size increases
– Strong-Scaling
• Problem size is kept constant as system size increases
– Weak-Scaling
• Problem size keeps on increasing as system size increases
• Depends on
– Structure of the application
– Underling algorithms being used
– Performance of MPI library
• All Performance Issues (as indicated earlier) matter for the MPI library
• Additional Issues
– Network topology
– CPU mapping to cores (block and cyclic across nodes and within nodes)
HPC Advisory Council, Lugano Switzerland '11 45
Obtaining Scalable Performance
• Does the memory needed for MPI library increases with System size?
• Different transport protocols with IB
– Reliable Connection (RC) is the most common
– Unreliable Datagram (UD) is used in some cases
• Buffers need to be posted at each receiver to receive message from
any sender
– Buffer requirement can increase with system size
• Connections need to be established across processes under RC
– Each connection requires certain amount of memory for handling related
data structures
– Memory required for all connections can increase with system size
• Both issues have become critical as large-scale IB deployments have
taken place
– Being addressed by IB specification (SRQ, XRC, UD/RC/XRC Hybrid) and
MPI library (Will be discussed more in Day 2)
HPC Advisory Council, Lugano Switzerland '11 46
Memory Scalability of MPI Library in large-scale systems
• Millions of cores and components in next-generation Multi-PetaFlop
and Exaflop systems
• Components are bound to fail
• Mean Time Between Failure (MTBF) has to remain high so that
Exascale applications can run efficiently
• Two broad kinds of failures
– Network failure (adapter, link and switch)
– Node or Process failure
• InfiniBand provides multiple schemes like CRC, end-to-end reliability,
reliable connection (RC) mode, Automatic Path Migration (APM) to
handle network related errors
• Can MPI library be made Resilient? (Day 2)
• Can MPI library support efficient checkpoint-restart and process
migration for process/node failure? (Day 2)
HPC Advisory Council, Lugano Switzerland '11 47
Fault-Tolerance and Resiliency
• Power consumption is becoming a significant issue for the
design and deployment of Multi-Petaflop and Exaflop
systems
• All different hardware components (CPU, memory, storage,
network adapter, switch and links) are being re-designed
with less power consumption in mind
• Targeted goal is 20MW for an Exaflop system in 2018-2020
• Can we make MPI-library Power-Aware?
– Polling-based schemes are common in MPI library to receive
messages and act upon these quickly
– Continuous polling by CPU consumes a lot of power
– Can the CPUs be made running at lower speed when large collective
operations are taking place
– Can we design power-aware collective schemes (Day 2) HPC Advisory Council, Lugano Switzerland '11 48
Power-Aware Designs
• Trends in Designing Petaflop and Exaflop Systems
• Overview of Programming Models and MPI
• Using MPI
• Challenges in Designing MPI Library on Petaflop and
Exaflop Systems
• Overview of MVAPICH and MVAPICH2 MPI Stack
• Sample Performance Numbers
49
Presentation Overview
HPC Advisory Council, Lugano Switzerland '11
• High Performance MPI Library for IB, 10GE/iWARP & RoCE
– MVAPICH (MPI-1) and MVAPICH2 (MPI-2)
– Latest Releases: MVAPICH 1.2 and MVAPICH2 1.6
– Used by more than 1,500 organizations in 60 countries
• Registered at the OSU site voluntarily
– More than 57,000 downloads from OSU site directly
– Empowering many TOP500 production clusters during the last eight years
– Available with software stacks of many IB, 10GE and server vendors including
Open Fabrics Enterprise Distribution (OFED) and Linux Distros
– Also supports uDAPL device to work with any network supporting uDAPL
– http://mvapich.cse.ohio-state.edu/
50
MVAPICH/MVAPICH2 Software
HPC Advisory Council, Lugano Switzerland '11
MVAPICH-1 Architecture
MVAPICH (MPI-1) (1.2)
OpenFabrics/ Gen2
(Single-rail)
InfiniBand (Mellanox)
#1
PCI-X, PCIe, PCIe-Gen2 (SDR, DDR & QDR)
Major Computing Platforms: IA-32, EM64T, Nehalem, Westmere, Opteron, Magny, ..
#2
OpenFabrics/ Gen2-Hybrid (Single-rail)
PSM
#3
Shared- Memory
InfiniBand (QLogic)
PCIe & HT (SDR, DDR & QDR)
#4
TCP/IP
#5
Single Node/ Laptops with
Multi-core
VAPI Gen2-Multirail
uDAPL (deprecated)
51 HPC Advisory Council, Lugano Switzerland '11
Major Features of MVAPICH 1.2
• OpenFabrics-Gen2 – Scalable job start-up with mpirun_rsh, support for SLURM – RC and XRC support – Flexible message coalescing – Multi-core-aware pt-to-pt communication – User-defined processor affinity for multi-core platforms – Multi-core-optimized collective communication – Asynchronous and scalable on-demand connection management – RDMA Write and RDMA Read-based protocols – Lock-free Asynchronous Progress for better overlap between
computation and communication – Polling and blocking support for communication progress – Multi-pathing support leveraging LMC mechanism on large fabrics – Network-level fault tolerance with Automatic Path Migration
(APM) – Mem-to-mem reliable data transfer mode (for detection of I/O
error with 32-bit CRC) – Network Fault Resiliency
52
HPC Advisory Council, Lugano Switzerland '11
Major Features of MVAPICH 1.2 (Continued)
• OpenFabrics-Gen2-Hybrid – Introduced interface in 1.1 – Replaces UD interface in 1.0 – Targeted for emerging multi-thousand-core clusters to
achieve the best performance with minimal memory footprint
– Most of the features as in Gen2 – Adaptive selection during run-time (based on
application and systems characteristics) to switch between
• RC and UD (or between XRC and UD) transports
– Multiple buffer organization with XRC support
53 HPC Advisory Council, Lugano Switzerland '11
MVAPICH2 Architecture (Latest Release 1.6)
Major Computing Platforms: IA-32, EM64T, Nehalem, Westmere, Opteron, Magny, ..
54 HPC Advisory Council, Lugano Switzerland '11
All Different PCI interfaces
MVAPICH2 1.6 Features • Support for GPUDirect
• Using LiMIC2 for true one-sided intra-node RMA transfer to avoid extra memory copies
• Upgraded to LiMIC2 version 0.5.4
• Removing the limitation on number of concurrent windows in RMA operations
• Support for InfiniBand Quality of Service (QoS) with multiple virtual lanes
• Support for 3D Torus Topology
• Enhanced support for multi-threaded applications
• Fast Checkpoint-Restart support with aggregation scheme
• Job Pause-Migration-Restart Framework for Pro-active Fault-Tolerance
• Support for new standardized Fault-Tolerance Backplane (FTB) Events for CR and Migration Frameworks
• Dynamic detection of multiple InfiniBand adapters and using these by default in multi-rail configurations
• Support for process-to-rail binding policy (bunch, scatter and user-defined) in multi-rail configurations
• Enhanced and optimized algorithms for MPI_Reduce and MPI_AllReduce operations for small and medium message sizes
• XRC support with Hydra Process Manager
55 HPC Advisory Council, Lugano Switzerland '11
56
Support for Multiple Interfaces/Adapters
• OpenFabrics/Gen2-IB and OpenFabrics/Gen2-Hybrid – All IB adapters supporting OpenFabrics/Gen2
• Qlogic/PSM • Qlogic adapters
• OpenFabrics/Gen2-iWARP • Chelsio and Intel-NetEffect
• RoCE • ConnectX-EN
• uDAPL – Linux-IB – Solaris-IB – Any other adapter supporting uDAPL
• TCP/IP – Any adapter supporting TCP/IP interface
• Shared Memory Channel • for running applications in a node with multi-core processors (laptop,
SMP systems)
HPC Advisory Council, Lugano Switzerland '11
• Trends in Designing Petaflop and Exaflop Systems
• Overview of Programming Models and MPI
• Using MPI
• Challenges in Designing MPI Library on Petaflop and
Exaflop Systems
• Overview of MVAPICH and MVAPICH2 MPI Stack
• Sample Performance Numbers
57
Presentation Overview
HPC Advisory Council, Lugano Switzerland '11
MVAPICH2 Inter-Node Performance Ping Pong Latency
58 HPC Advisory Council, Lugano Switzerland '11
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Late
ncy
(u
s)
Message Size (Bytes)
Small Messages
MVAPICH2-1.6
1.56 us
0
50
100
150
200
250
300
350
Late
ncy
(u
s)
Message Size (Bytes)
Large Messages
MVAPICH2-1.6
Intel Westmere 2.53 GHz with Mellanox ConnectX-2 QDR Adapter
MVAPICH2 Inter-Node Performance
59 HPC Advisory Council, Lugano Switzerland '11
0
500
1000
1500
2000
2500
3000
3500
4000
Ban
dw
idth
(M
B/s
)
Message Size (Bytes)
Bandwidth
MVAPICH2-1.6
0
1000
2000
3000
4000
5000
6000
7000
Bi-
Dir
ect
ion
al B
and
wid
th (
MB
/s)
Message Size (Bytes)
Bi-Directional Bandwidth
MVAPICH2-1.63394 MB/s
6539 MB/s
Intel Westmere 2.53 GHz with Mellanox ConnectX-2 QDR Adapter
Performance of HPC Applications on TACC Ranger using MVAPICH + IB
• Rob Farber’s facial
recognition
application was run
up to 60K cores using
MVAPICH
• Ranges from 84% of
peak at low end to
65% of peak at high
end
http://www.tacc.utexas.edu/research/users/features/index.php?m_b_c=farber
HPC Advisory Council, Lugano Switzerland '11 60
Performance of HPC Applications on TACC Ranger: DNS/Turbulence
Courtesy: P.K. Yeung, Diego Donzis, TG 2008
HPC Advisory Council, Lugano Switzerland '11 61
Application Example: Blast Simulations
• Researchers from the
University of Utah have
developed a simulation
framework, called Uintah
• Combines advanced
mechanical, chemical and
physical models into a
novel computational
framework
• Have run > 32K MPI tasks
on Ranger
• Uses asynchronous
communication
http://www.tacc.utexas.edu/news/feature-stories/2009/explosive-science/
Courtesy: J. Luitjens, M. Bertzins, Univ of Utah
HPC Advisory Council, Lugano Switzerland '11 62
Application Example: OMEN
• OMEN is a two- and
three-dimensional
Schrodinger-Poisson
solver based
• Used in semi-conductor
modeling
• Run to almost 60K tasks
Courtesy: Mathieu Luisier, Gerhard Klimeck, Purde
http://www.tacc.utexas.edu/RangerImpact/pdf/Save_Our_Semiconductors.pdf
HPC Advisory Council, Lugano Switzerland '11 63
• Presented trends in Petaflop an Exaflop systems
• Presented an overview of MPI
• Discussed how to use the basic features of MPI
• Discussed challenges in designing MPI libraries
• Overview of MVAPICH and MVAPICH2 stack with sample
performance numbers
• MPI has long standing reputation for portability and
performance; likely going to remain a critical component
for future Exascale machines
Concluding Remarks
64 HPC Advisory Council, Lugano Switzerland '11
• Day 2 (MPI Performance and Optimizations)
– Major components of MVAPICH and MVAPICH2 stacks
• Job start-up, Connection Management, Pt-to-pt (inter-node, intra-
node) communication, LiMIC2, One-sided communication, Collective
communication, Multi-rail, Scalability (SRQ, XRC, UD/RC/XRC Hybrid),
QoS and 3D Torus and Fault-tolerance (network-level and process-
level)
– How to use these components and carry out runtime optimizations
• Day 3 (Future of MPI)
– Advanced and Upcoming Features of MVAPICH2 stack
• Collective Offload, Topology-aware collectives, Power-aware
collectives, GPUDirect support and PGAS (UPC) support
– Upcoming MPI-3 standard and Features
Preview of Day 2 and Day 3 Presentations
65 HPC Advisory Council, Lugano Switzerland '11
Web Pointers
http://www.cse.ohio-state.edu/~panda
http://www.cse.ohio-state.edu/~surs
http://nowlab.cse.ohio-state.edu
MVAPICH Web Page
http://mvapich.cse.ohio-state.edu
66 HPC Advisory Council, Lugano Switzerland '11