66
Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda A Presentation at HPC Advisory Council Workshop, Lugano 2011 by Sayantan Sur The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~surs

Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

Introduction to MPI

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

A Presentation at HPC Advisory Council Workshop, Lugano 2011

by

Sayantan Sur

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~surs

Page 2: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Trends in Designing Petaflop and Exaflop Systems

• Overview of Programming Models and MPI

• How to Use MPI

• Challenges in Designing MPI Library on Petaflop and

Exaflop Systems

• Overview of MVAPICH and MVAPICH2 MPI Stack

• Sample Performance Numbers

2

Presentation Overview

HPC Advisory Council, Lugano Switzerland '11

Page 3: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Growth of High Performance Computing

– Growth in processor performance

• Chip density doubles every 18 months

– Growth in commodity networking

• Increase in speed/features + reducing cost

• Clusters: popular choice for HPC

– Scalability, Modularity and Upgradeability

Current and Next Generation Applications and Computing Systems

3 HPC Advisory Council, Lugano Switzerland '11

Page 4: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

PetaFlop to ExaFlop Computing

4

10 PFlops

in 2011 100 PFlops

in 2015

Expected to have an ExaFlop system in 2018-2019 !

HPC Advisory Council, Lugano Switzerland '11

Page 5: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

Trends for Computing Clusters in the Top 500 List (http://www.top500.org)

Nov. 1996: 0/500 (0%) Nov. 2001: 43/500 (8.6%) Nov. 2006: 361/500 (72.2%)

Jun. 1997: 1/500 (0.2%) Jun. 2002: 80/500 (16%) Jun. 2007: 373/500 (74.6%)

Nov. 1997: 1/500 (0.2%) Nov. 2002: 93/500 (18.6%) Nov. 2007: 406/500 (81.2%)

Jun. 1998: 1/500 (0.2%) Jun. 2003: 149/500 (29.8%) Jun. 2008: 400/500 (80.0%)

Nov. 1998: 2/500 (0.4%) Nov. 2003: 208/500 (41.6%) Nov. 2008: 410/500 (82.0%)

Jun. 1999: 6/500 (1.2%) Jun. 2004: 291/500 (58.2%) Jun. 2009: 410/500 (82.0%)

Nov. 1999: 7/500 (1.4%) Nov. 2004: 294/500 (58.8%) Nov. 2009: 417/500 (83.4%)

Jun. 2000: 11/500 (2.2%) Jun. 2005: 304/500 (60.8%) Jun. 2010: 424/500 (84.8%)

Nov. 2000: 28/500 (5.6%) Nov. 2005: 360/500 (72.0%) Nov. 2010: 415/500 (83%)

Jun. 2001: 33/500 (6.6%) Jun. 2006: 364/500 (72.8%) Jun. 2011: To be announced

5 HPC Advisory Council, Lugano Switzerland '11

Page 6: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Hardware Components

– Processing Core and Memory

sub-system

– I/O Bus

– Network Adapter

– Accelerator

– Network Switch

• Software Components

– Communication software

• Memory <-> Accelerator

• Memory <-> Nw Adapter

• Memory<-> Accelerator <->

NW Adapter

Major Components in Modern Computing Systems

I O

B U S

P0

Core 0

Core 1

Core 2

Core 3

P1

Core 0

Core 1

Core 2

Core 3

Memory

Memory

Network Adapter

Network Switch

Processing Bottleneck

I/O Bottleneck

Network Bottleneck

6 HPC Advisory Council, Lugano Switzerland '11

Accelerators

Page 7: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

7

InfiniBand in the Top500

Percentage share of InfiniBand is steadily increasing

HPC Advisory Council, Lugano Switzerland '11

Page 8: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• 214 IB Clusters (42.8%) in the Nov ‘10 Top500 list (top500.org)

• Installations in the Top 30 (13 systems):

Large-scale InfiniBand Installations

120,640 cores (Nebulae) in China (3rd) 15,120 cores (Loewe) in Germany (22nd)

73,278 cores (Tsubame-2.0) at Japan (4th) 26,304 cores (Juropa) in Germany (23rd)

138,368 cores (Tera-100) at France (6th) 26,232 cores (Tachyonll) in South Korea (24th)

122,400 cores (RoadRunner) at LANL (7th) 23,040 cores (Jade) at GENCI (27th)

81,920 cores (Pleiades) at NASA Ames (11th) 33,120 cores (Mole-8.5) in China (28th)

42,440 cores (Red Sky) at Sandia (14th) More are getting installed !

62,976 cores (Ranger) at TACC (15th)

35,360 cores (Lomonosov) in Russia (17th)

8 HPC Advisory Council, Lugano Switzerland '11 HPC Advisory Council, Lugano Switzerland '11

Page 9: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Trends in Designing Petaflop and Exaflop Systems

• Overview of Programming Models and MPI

• How to Use MPI

• Challenges in Designing MPI Library on Petaflop and

Exaflop Systems

• Overview of MVAPICH and MVAPICH2 MPI Stack

• Sample Performance Numbers

9

Presentation Overview

HPC Advisory Council, Lugano Switzerland '11

Page 10: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Parallel system offers greater compute and memory

capacity than a serial system

– Tackle problems that are too big to fit in one computer

• Different types of systems

– Uniform Shared Memory (bus based)

• Many way symmetric multi-processor machines: SGI, Sun, …

– Non uniform Shared Memory (NUMA)

• CCNUMA machines: Cray CX1000, AMD Magny Cours, Intel Westmere

– Distributed Memory Machines

• Commodity clusters, Blue Gene, Cray XT5

• Similarly, there are different types of programming models

– Shared memory, Distributed memory …

HPC Advisory Council, Lugano Switzerland '11 10

Parallel Systems - History and Overview

Page 11: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

HPC Advisory Council, Lugano Switzerland '11 11

Parallel Programming Models Overview

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory

Logical shared memory

Shared Memory Model

SHMEM, DSM

Distributed Memory Model

MPI (Message Passing Interface)

Partitioned Global Address Space (PGAS)

Global Arrays, UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models

• Models can be mapped on different types of systems

– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.

• In this presentation series, we concentrate on MPI

Page 12: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges

Programming Models Message Passing Interface (MPI), Sockets and

PGAS (UPC, Global Arrays)

Applications

Networking Technologies (InfiniBand, 1/10/40GigE, RNICs & Intelligent NICs)

Commodity Computing System Architectures

(single, dual, quad, ..) Multi/Many-core architecture and

Accelerators

HPC Advisory Council, Lugano Switzerland '11

Point-to-point Communication QoS

Collective Communication

Synchronization & Locks

I/O & File Systems

Fault Tolerance

Library or Runtime for Programming Models

12

Page 13: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Message Passing Library standardized by MPI Forum

– C, C++ and Fortran

• Goal: portable, efficient and flexible standard for writing

parallel applications

• Not IEEE or ISO standard, but widely considered “industry

standard” for HPC application

• Evolution of MPI

– MPI-1: 1994

– MPI-2: 1996

– MPI-3: on-going effort (2008 – current)

13

MPI Overview and History

HPC Advisory Council, Lugano Switzerland '11

Page 14: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Primarily intended for

distributed memory machines

• P2 needs value of A

– MPI-1: P1 will have to send a

message to P2 with value of A

using MPI_Send

– MPI-2: P2 can get value of A

directly using MPI_Get

• P1, P2, P3 need sum of A+B+C

– MPI_Allreduce with SUM op

– Multi-way communication

14

What does MPI do?

P1 P2

A=5

Memory

B=6

Memory

Network

MPI_Send

MPI_Get

A=5

Memory

B=6

Memory

Network

C=4

Memory

P1 P2

P3

MPI_Allreduce

HPC Advisory Council, Lugano Switzerland '11

Page 15: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Trends in Designing Petaflop and Exaflop Systems

• Overview of Programming Models and MPI

• How to Use MPI

• Challenges in Designing MPI Library on Petaflop and

Exaflop Systems

• Overview of MVAPICH and MVAPICH2 MPI Stack

• Sample Performance Numbers

15

Presentation Overview

HPC Advisory Council, Lugano Switzerland '11

Page 16: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Point-to-point Two-sided Communication

• Collective Communication

• One-sided Communication

• Job Startup

• Parallel I/O

• Involvement of Network in MPI operations

Using MPI

16 HPC Advisory Council, Lugano Switzerland '11

Page 17: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

17

Types of Point-to-Point Communication

• Synchronous (MPI_Ssend)

– Sender process blocks on send until receiver arrives

• Blocking Send / Receive (MPI_Send, MPI_Recv)

– Block until send buffer can be re-used

– Block until receive buffer is ready to read

• Non-blocking Send / Receive (MPI_Isend, MPI_Irecv)

– Start send and receive, but don’t wait until complete

• Others: buffered send, sendrecv, ready send

– Not used very frequently

HPC Advisory Council, Lugano Switzerland '11

Page 18: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• How does MPI know which send is for which receive?

• Programmer (i.e. you!) need to provide this information

– Sender side: tag, destination rank and communicator

– Communicator is a subset of the entire set of MPI processes

– MPI_COMM_WORLD represents all MPI processes

– Receiver side: tag, source rank and communicator

– The triples: tag, rank and communicator must match

• Some special, pre-defined values: MPI_ANY_TAG,

MPI_ANY_RANK

HPC Advisory Council, Lugano Switzerland '11 18

Message Matching

Page 19: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

19

Buffering

• MPI library has internal

“system” buffers

– Optimize throughput (do not

wait for receiver)

– Opaque to programmer

– Finite Resource

• Blocking send may copy to

sender system buffer and

return

Courtesy: https://computing.llnl.gov/tutorials/mpi/

HPC Advisory Council, Lugano Switzerland '11

Page 20: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

20

Blocking vs. Non-blocking

• Blocking

– Send returns when safe to re-use buffer (maybe in system buffer)

– Receive only returns when data is fully received

• Non-blocking

– Returns immediately (data may or may not be buffered)

– Simply request MPI library to transfer the data

– Need to “wait” on handle returned by call

– Benefit is that computation and communication can be overlapped

HPC Advisory Council, Lugano Switzerland '11

Page 21: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Messages will not overtake each other

• If sender sends two messages M1 and M2

in succession and both match the same

receive, then M1 will be received before

M2

• Converse is also true: if receiver posts two

receives R1 and R2 and both match

message M, then R1 will be completed

first

• Note: this does not mean MPI requires in-

order delivery (although many

implementations do this for simplicity)

HPC Advisory Council, Lugano Switzerland '11 21

Ordering

Sender Receiver

Receive: R

Sender Receiver

Receive: R

M1

M2

M1

M2

Page 22: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• MPI does not guarantee fairness

• If receive R matches message M1 from P1 and M2 from P2,

MPI does not say which one will match first

HPC Advisory Council, Lugano Switzerland '11 22

Fairness

P1 P2

P3

M1 M2

Receive:

R

?

Courtesy: https://computing.llnl.gov/tutorials/mpi/

Page 23: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

HPC Advisory Council, Lugano Switzerland '11 23

Sample code for point-to-point

#include "mpi.h"

#include <stdio.h>

int main(argc,argv)

int argc;

char *argv[]; {

int numtasks, rank, dest, source, rc, count, tag=1;

char inmsg, outmsg='x';

MPI_Status Stat;

MPI_Init(&argc,&argv);

MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == 0) {

dest = 1;

source = 1;

rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);

rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);

}

else if (rank == 1) {

dest = 0;

source = 0;

rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);

rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);

}

rc = MPI_Get_count(&Stat, MPI_CHAR, &count);

printf ("Task %d: Received %d char(s) f rom task %d with tag %d \n",

rank, count, Stat.MPI_SOURCE, Stat.MPI_TAG);

MPI_Finalize();

}

Courtesy: https://computing.llnl.gov/tutorials/mpi/

Page 24: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Point-to-point Two-sided Communication

• Collective Communication

• One-sided Communication

• Job Startup

• Parallel I/O

• Involvement of Network in MPI operations

Using MPI

24 HPC Advisory Council, Lugano Switzerland '11

Page 25: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

25

Types of Collective Communication

• Synchronization

– Processes wait until all of them have reached a certain point

• Data Movement

– Broadcast, Scatter, All-to-all …

• Collective Computation

– Allreduce with min, max, multiply, sum … on data

• Considerations

– Blocking, no tag required, only with pre-defined datatypes

– MPI-3 considering non-blocking versions

HPC Advisory Council, Lugano Switzerland '11

Page 26: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

HPC Advisory Council, Lugano Switzerland '11 26

Example Collective Operation: Scatter

Rank 0

1.0 2.0 3.0 4.0

5.0 6.0 7.0 8.0

9.0 10.0 11.0 12.0

13.0 14.0 15.0 16.0

Rank 1 Rank 2 Rank 3

1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0

• Using Scatter, an array can be distributed to multiple

processes

Page 27: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

HPC Advisory Council, Lugano Switzerland '11 27

Example code for the Scatter Example

#include "mpi.h"

#include <stdio.h>

#def ine SIZE 4

int main(argc,argv)

int argc;

char *argv[]; {

int numtasks, rank, sendcount, recvcount, source;

f loat sendbuf [SIZE][SIZE] = {

{1.0, 2.0, 3.0, 4.0},

{5.0, 6.0, 7.0, 8.0},

{9.0, 10.0, 11.0, 12.0},

{13.0, 14.0, 15.0, 16.0} };

f loat recvbuf [SIZE];

MPI_Init(&argc,&argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

if (numtasks == SIZE) {

source = 1;

sendcount = SIZE;

recvcount = SIZE;

MPI_Scatter(sendbuf ,sendcount,MPI_FLOAT,recvbuf ,recvcount,

MPI_FLOAT,source,MPI_COMM_WORLD);

printf ("rank= %d Results: %f %f %f %f\n",rank,recvbuf [0],

recvbuf [1],recvbuf[2],recvbuf[3]);

}

else

printf ("Must specify %d processors. Terminating.\n",SIZE);

MPI_Finalize();

} Courtesy: https://computing.llnl.gov/tutorials/mpi/

Page 28: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Point-to-point Two-sided Communication

• Collective Communication

• One-sided Communication

• Job Startup

• Parallel I/O

• Involvement of Network in MPI operations

Using MPI

28 HPC Advisory Council, Lugano Switzerland '11

Page 29: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

29

Benefits of one-sided communication

• Easy to express irregular pattern of communication

– Easier than request-response pattern using two-sided

• Decouple data transfer with synchronization

– Various methods of synchronization

• Active synchronization

• Passive synchronization

• Potentially better performance with overlap of

computation and communication

HPC Advisory Council, Lugano Switzerland '11

Page 30: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

HPC Advisory Council, Lugano Switzerland '11 30

Basic model of one-sided communication

Rank 0

Rank 2

Rank 1

Rank 3

mem

mem

mem

mem

window

• Each process can contribute part of its memory to form a

larger “window” of global memory

• Creation and destruction of “windows” are collective

operations (all processes participate)

Page 31: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

HPC Advisory Council, Lugano Switzerland '11 31

MPI One Sided Taxonomy

MPI-2 One Sided Model

Communication Synchronization

Put Get Accumulate Active Passive

Lock/Un

lock

Collective

(Entire Window)

Group

(Subset Window)

Fence Post/Wait/Start/C

omplete

• Different modes suit different applications patterns

Page 32: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

HPC Advisory Council, Lugano Switzerland '11 32

Synchronization Modes – Active Synchronization

Origin Target

MPI_Win_post

MPI_Win_start

Overlapped

Computation MPI_Put

Overlapped

Computation

MPI_Win_complete

MPI_Win_wait

• Window is exposed with

win_post, and access started

with win_start

• Win_complete to end access,

and win_wait to make sure

window is available to read

Page 33: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

HPC Advisory Council, Lugano Switzerland '11 33

Synchronization Modes – Active Synchronization (Fence)

Origin Target

MPI_Win_creat

e

MPI_Win_creat

e

Overlapped

Computation

MPI_Put

Overlapped

Computation

MPI_Win_fence MPI_Win_fence

MPI_Win_fence

Local memory

access

• Collective synchronization

• Collective among members of a

window

• Updates between fences only

visible after fence is complete

MPI_Win_fence

Page 34: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

HPC Advisory Council, Lugano Switzerland '11 34

Synchronization Modes – Passive Synchronization

Origin Target

MPI_Win_creat

e

MPI_Win_creat

e

Overlapped

Computation MPI_Put

Overlapped

Computation

MPI_Win_unlock

MPI_Win_lock

MPI_Win_lock

Local memory

access

• Billboard model

• Lock/unlock to have dedicated

access

• Lock/unlock are not blocking

• Put executed only when lock

granted on target

Page 35: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Point-to-point Two-sided Communication

• Collective Communication

• One-sided Communication

• Job Startup

• Parallel I/O

• Involvement of Network in MPI operations

Using MPI

35 HPC Advisory Council, Lugano Switzerland '11

Page 36: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• MPI process managers provide support to launch jobs

• “mpiexec” is a utility to launch jobs

• Example usage:

– mpiexec -np 2 -machinefile mf ./a.out

• Supports SPMD (single program multiple data) model along

with MPMD (multiple program multiple data)

• Different resource management systems / MPI stacks may

do things slightly differently

– SLURM, PBS, Torque

– mpirun_rsh (fastest launcher for MVAPICH and MVAPICH2), hydra

and ORTE (Open MPI)

• Launch time should not increase with number of processes HPC Advisory Council, Lugano Switzerland '11 36

Launching MPI jobs

Page 37: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Point-to-point Two-sided Communication

• Collective Communication

• One-sided Communication

• Job Startup

• Parallel I/O

• Involvement of Network in MPI operations

Using MPI

37 HPC Advisory Council, Lugano Switzerland '11

Page 38: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Parallel I/O very important for scientific applications

• Parallel file systems offer high bandwidth access to large

volumes of data

– PVFS (parallel virtual file system)

– Lustre, GPFS …

• MPI applications can use MPI-IO layer for collective I/O

– Using MPI-IO optimal I/O access patterns are used to read data

from disks

– Fast communication network then helps re-arrange data in order

desired by end application

HPC Advisory Council, Lugano Switzerland '11 38

Parallel I/O

Page 39: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Critical optimization in MPI I/O

• All processes must call collective function

• Basic idea: build large blocks from small requests so

requests will be large from disk point of view

– Particularly effective when accesses by different processes are

non-contiguous and interleaved

HPC Advisory Council, Lugano Switzerland '11 39

Collective I/O in MPI

Courtesy: http://www.mcs.anl.gov/research/projects/mpi/tutorial/advmpi/sc2005-advmpi.pdf

Page 40: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Trends in Designing Petaflop and Exaflop Systems

• Overview of Programming Models and MPI

• Using MPI

• Challenges in Designing MPI Library on Petaflop and

Exaflop Systems

• Overview of MVAPICH and MVAPICH2 MPI Stack

• Sample Performance Numbers

40

Presentation Overview

HPC Advisory Council, Lugano Switzerland '11

Page 41: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

Designing MPI Using InfiniBand Features

HPC Advisory Council, Lugano Switzerland '11 41

Many different design choices

RDMA Operations

Unreliable Datagram

Atomic Operations

Shared Receive Queues

(SRQ)

Static Rate Control

Multicast End-to-End

Flow Control

Major Components in MPI

Send / Receive

Multi-Path LMC

QoS (SL and VL)

InfiniBand Features

Reliable Connection

eXtended Reliable Connection

(XRC)

Optimal design choices • Performance • Scalability • Fault-Tolerance & Resiliency • Power-Aware

Protocol Mapping

Buffer Management

Flow Control

Connection Management

Communication Progress

Collective Communication

Multi-rail Support

One-sided Active/Passive

Checkpoint Restart

Process Migration

Reliability and Resiliency

Page 42: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Inter-node Pt-to-pt Communication

– Challenges

• Sender memory -> IB adapter (sender) -> IB switch -> IB adapter

(receiver) -> Receiver memory

• Short message (eager) and Long message (rendezvous)

• Send/Recv vs. RDMA

– Metrics

• Latency (lowest)

• Bandwidth and Bi-Directional bandwidth (highest)

• CPU utilization (lowest)

– Maximum overlap between communication and computation

• Message Rate (highest)

HPC Advisory Council, Lugano Switzerland '11 42

Performance Issues

Page 43: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Added Challenges for Intra-node Pt-to-Pt Communication

– Multi-core platforms are emerging

– Cache hierarchy (shared L2 or not, L3)

– Intra-socket and Inter-socket communication cost (latency and

bandwidth) are different

• May need different scheme for intra-socket and inter-socket

communication

– Process-core mapping plays an important role

• Concurrent Communication

– Multi-rail organizations and schemes for efficient usage of the rails

– Polling scheme within the MPI library

HPC Advisory Council, Lugano Switzerland '11 43

Performance Issues (Cont’d)

Page 44: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Collective Communication

– Metrics

• Minimize latency

• Maximize throughput (example: concurrent broadcasts)

– Challenges

• Optimal algorithms to minimize

– Network contention

– Contention at the source and destination adapter(s)

– CPU involvement/overhead

• Different algorithms based on system size and message size

• Multi-core-aware algorithms for the emerging multi-core platforms

• Topology-aware algorithms to dynamically adopt based on the

underlying network topology

HPC Advisory Council, Lugano Switzerland '11 44

Performance Issues (Cont’d)

Page 45: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Performance of an application should increase as system

size increases

– Strong-Scaling

• Problem size is kept constant as system size increases

– Weak-Scaling

• Problem size keeps on increasing as system size increases

• Depends on

– Structure of the application

– Underling algorithms being used

– Performance of MPI library

• All Performance Issues (as indicated earlier) matter for the MPI library

• Additional Issues

– Network topology

– CPU mapping to cores (block and cyclic across nodes and within nodes)

HPC Advisory Council, Lugano Switzerland '11 45

Obtaining Scalable Performance

Page 46: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Does the memory needed for MPI library increases with System size?

• Different transport protocols with IB

– Reliable Connection (RC) is the most common

– Unreliable Datagram (UD) is used in some cases

• Buffers need to be posted at each receiver to receive message from

any sender

– Buffer requirement can increase with system size

• Connections need to be established across processes under RC

– Each connection requires certain amount of memory for handling related

data structures

– Memory required for all connections can increase with system size

• Both issues have become critical as large-scale IB deployments have

taken place

– Being addressed by IB specification (SRQ, XRC, UD/RC/XRC Hybrid) and

MPI library (Will be discussed more in Day 2)

HPC Advisory Council, Lugano Switzerland '11 46

Memory Scalability of MPI Library in large-scale systems

Page 47: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Millions of cores and components in next-generation Multi-PetaFlop

and Exaflop systems

• Components are bound to fail

• Mean Time Between Failure (MTBF) has to remain high so that

Exascale applications can run efficiently

• Two broad kinds of failures

– Network failure (adapter, link and switch)

– Node or Process failure

• InfiniBand provides multiple schemes like CRC, end-to-end reliability,

reliable connection (RC) mode, Automatic Path Migration (APM) to

handle network related errors

• Can MPI library be made Resilient? (Day 2)

• Can MPI library support efficient checkpoint-restart and process

migration for process/node failure? (Day 2)

HPC Advisory Council, Lugano Switzerland '11 47

Fault-Tolerance and Resiliency

Page 48: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Power consumption is becoming a significant issue for the

design and deployment of Multi-Petaflop and Exaflop

systems

• All different hardware components (CPU, memory, storage,

network adapter, switch and links) are being re-designed

with less power consumption in mind

• Targeted goal is 20MW for an Exaflop system in 2018-2020

• Can we make MPI-library Power-Aware?

– Polling-based schemes are common in MPI library to receive

messages and act upon these quickly

– Continuous polling by CPU consumes a lot of power

– Can the CPUs be made running at lower speed when large collective

operations are taking place

– Can we design power-aware collective schemes (Day 2) HPC Advisory Council, Lugano Switzerland '11 48

Power-Aware Designs

Page 49: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Trends in Designing Petaflop and Exaflop Systems

• Overview of Programming Models and MPI

• Using MPI

• Challenges in Designing MPI Library on Petaflop and

Exaflop Systems

• Overview of MVAPICH and MVAPICH2 MPI Stack

• Sample Performance Numbers

49

Presentation Overview

HPC Advisory Council, Lugano Switzerland '11

Page 50: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• High Performance MPI Library for IB, 10GE/iWARP & RoCE

– MVAPICH (MPI-1) and MVAPICH2 (MPI-2)

– Latest Releases: MVAPICH 1.2 and MVAPICH2 1.6

– Used by more than 1,500 organizations in 60 countries

• Registered at the OSU site voluntarily

– More than 57,000 downloads from OSU site directly

– Empowering many TOP500 production clusters during the last eight years

– Available with software stacks of many IB, 10GE and server vendors including

Open Fabrics Enterprise Distribution (OFED) and Linux Distros

– Also supports uDAPL device to work with any network supporting uDAPL

– http://mvapich.cse.ohio-state.edu/

50

MVAPICH/MVAPICH2 Software

HPC Advisory Council, Lugano Switzerland '11

Page 51: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

MVAPICH-1 Architecture

MVAPICH (MPI-1) (1.2)

OpenFabrics/ Gen2

(Single-rail)

InfiniBand (Mellanox)

#1

PCI-X, PCIe, PCIe-Gen2 (SDR, DDR & QDR)

Major Computing Platforms: IA-32, EM64T, Nehalem, Westmere, Opteron, Magny, ..

#2

OpenFabrics/ Gen2-Hybrid (Single-rail)

PSM

#3

Shared- Memory

InfiniBand (QLogic)

PCIe & HT (SDR, DDR & QDR)

#4

TCP/IP

#5

Single Node/ Laptops with

Multi-core

VAPI Gen2-Multirail

uDAPL (deprecated)

51 HPC Advisory Council, Lugano Switzerland '11

Page 52: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

Major Features of MVAPICH 1.2

• OpenFabrics-Gen2 – Scalable job start-up with mpirun_rsh, support for SLURM – RC and XRC support – Flexible message coalescing – Multi-core-aware pt-to-pt communication – User-defined processor affinity for multi-core platforms – Multi-core-optimized collective communication – Asynchronous and scalable on-demand connection management – RDMA Write and RDMA Read-based protocols – Lock-free Asynchronous Progress for better overlap between

computation and communication – Polling and blocking support for communication progress – Multi-pathing support leveraging LMC mechanism on large fabrics – Network-level fault tolerance with Automatic Path Migration

(APM) – Mem-to-mem reliable data transfer mode (for detection of I/O

error with 32-bit CRC) – Network Fault Resiliency

52

HPC Advisory Council, Lugano Switzerland '11

Page 53: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

Major Features of MVAPICH 1.2 (Continued)

• OpenFabrics-Gen2-Hybrid – Introduced interface in 1.1 – Replaces UD interface in 1.0 – Targeted for emerging multi-thousand-core clusters to

achieve the best performance with minimal memory footprint

– Most of the features as in Gen2 – Adaptive selection during run-time (based on

application and systems characteristics) to switch between

• RC and UD (or between XRC and UD) transports

– Multiple buffer organization with XRC support

53 HPC Advisory Council, Lugano Switzerland '11

Page 54: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

MVAPICH2 Architecture (Latest Release 1.6)

Major Computing Platforms: IA-32, EM64T, Nehalem, Westmere, Opteron, Magny, ..

54 HPC Advisory Council, Lugano Switzerland '11

All Different PCI interfaces

Page 55: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

MVAPICH2 1.6 Features • Support for GPUDirect

• Using LiMIC2 for true one-sided intra-node RMA transfer to avoid extra memory copies

• Upgraded to LiMIC2 version 0.5.4

• Removing the limitation on number of concurrent windows in RMA operations

• Support for InfiniBand Quality of Service (QoS) with multiple virtual lanes

• Support for 3D Torus Topology

• Enhanced support for multi-threaded applications

• Fast Checkpoint-Restart support with aggregation scheme

• Job Pause-Migration-Restart Framework for Pro-active Fault-Tolerance

• Support for new standardized Fault-Tolerance Backplane (FTB) Events for CR and Migration Frameworks

• Dynamic detection of multiple InfiniBand adapters and using these by default in multi-rail configurations

• Support for process-to-rail binding policy (bunch, scatter and user-defined) in multi-rail configurations

• Enhanced and optimized algorithms for MPI_Reduce and MPI_AllReduce operations for small and medium message sizes

• XRC support with Hydra Process Manager

55 HPC Advisory Council, Lugano Switzerland '11

Page 56: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

56

Support for Multiple Interfaces/Adapters

• OpenFabrics/Gen2-IB and OpenFabrics/Gen2-Hybrid – All IB adapters supporting OpenFabrics/Gen2

• Qlogic/PSM • Qlogic adapters

• OpenFabrics/Gen2-iWARP • Chelsio and Intel-NetEffect

• RoCE • ConnectX-EN

• uDAPL – Linux-IB – Solaris-IB – Any other adapter supporting uDAPL

• TCP/IP – Any adapter supporting TCP/IP interface

• Shared Memory Channel • for running applications in a node with multi-core processors (laptop,

SMP systems)

HPC Advisory Council, Lugano Switzerland '11

Page 57: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Trends in Designing Petaflop and Exaflop Systems

• Overview of Programming Models and MPI

• Using MPI

• Challenges in Designing MPI Library on Petaflop and

Exaflop Systems

• Overview of MVAPICH and MVAPICH2 MPI Stack

• Sample Performance Numbers

57

Presentation Overview

HPC Advisory Council, Lugano Switzerland '11

Page 58: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

MVAPICH2 Inter-Node Performance Ping Pong Latency

58 HPC Advisory Council, Lugano Switzerland '11

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Late

ncy

(u

s)

Message Size (Bytes)

Small Messages

MVAPICH2-1.6

1.56 us

0

50

100

150

200

250

300

350

Late

ncy

(u

s)

Message Size (Bytes)

Large Messages

MVAPICH2-1.6

Intel Westmere 2.53 GHz with Mellanox ConnectX-2 QDR Adapter

Page 59: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

MVAPICH2 Inter-Node Performance

59 HPC Advisory Council, Lugano Switzerland '11

0

500

1000

1500

2000

2500

3000

3500

4000

Ban

dw

idth

(M

B/s

)

Message Size (Bytes)

Bandwidth

MVAPICH2-1.6

0

1000

2000

3000

4000

5000

6000

7000

Bi-

Dir

ect

ion

al B

and

wid

th (

MB

/s)

Message Size (Bytes)

Bi-Directional Bandwidth

MVAPICH2-1.63394 MB/s

6539 MB/s

Intel Westmere 2.53 GHz with Mellanox ConnectX-2 QDR Adapter

Page 60: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

Performance of HPC Applications on TACC Ranger using MVAPICH + IB

• Rob Farber’s facial

recognition

application was run

up to 60K cores using

MVAPICH

• Ranges from 84% of

peak at low end to

65% of peak at high

end

http://www.tacc.utexas.edu/research/users/features/index.php?m_b_c=farber

HPC Advisory Council, Lugano Switzerland '11 60

Page 61: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

Performance of HPC Applications on TACC Ranger: DNS/Turbulence

Courtesy: P.K. Yeung, Diego Donzis, TG 2008

HPC Advisory Council, Lugano Switzerland '11 61

Page 62: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

Application Example: Blast Simulations

• Researchers from the

University of Utah have

developed a simulation

framework, called Uintah

• Combines advanced

mechanical, chemical and

physical models into a

novel computational

framework

• Have run > 32K MPI tasks

on Ranger

• Uses asynchronous

communication

http://www.tacc.utexas.edu/news/feature-stories/2009/explosive-science/

Courtesy: J. Luitjens, M. Bertzins, Univ of Utah

HPC Advisory Council, Lugano Switzerland '11 62

Page 63: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

Application Example: OMEN

• OMEN is a two- and

three-dimensional

Schrodinger-Poisson

solver based

• Used in semi-conductor

modeling

• Run to almost 60K tasks

Courtesy: Mathieu Luisier, Gerhard Klimeck, Purde

http://www.tacc.utexas.edu/RangerImpact/pdf/Save_Our_Semiconductors.pdf

HPC Advisory Council, Lugano Switzerland '11 63

Page 64: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Presented trends in Petaflop an Exaflop systems

• Presented an overview of MPI

• Discussed how to use the basic features of MPI

• Discussed challenges in designing MPI libraries

• Overview of MVAPICH and MVAPICH2 stack with sample

performance numbers

• MPI has long standing reputation for portability and

performance; likely going to remain a critical component

for future Exascale machines

Concluding Remarks

64 HPC Advisory Council, Lugano Switzerland '11

Page 65: Introduction to MPI - HPC Advisory Council 1/3...Introduction to MPI Dhabaleswar K. (DK) Panda The Ohio State University ... implementations do this for simplicity) HPC Advisory Council,

• Day 2 (MPI Performance and Optimizations)

– Major components of MVAPICH and MVAPICH2 stacks

• Job start-up, Connection Management, Pt-to-pt (inter-node, intra-

node) communication, LiMIC2, One-sided communication, Collective

communication, Multi-rail, Scalability (SRQ, XRC, UD/RC/XRC Hybrid),

QoS and 3D Torus and Fault-tolerance (network-level and process-

level)

– How to use these components and carry out runtime optimizations

• Day 3 (Future of MPI)

– Advanced and Upcoming Features of MVAPICH2 stack

• Collective Offload, Topology-aware collectives, Power-aware

collectives, GPUDirect support and PGAS (UPC) support

– Upcoming MPI-3 standard and Features

Preview of Day 2 and Day 3 Presentations

65 HPC Advisory Council, Lugano Switzerland '11