Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI...

Lecture 9: MPI continued

David Bindel

27 Sep 2011

Logistics

I Matrix multiply is done! Still have to run.I Small HW 2 will be up before lecture on Thursday, due next

Tuesday.I Project 2 will be posted next Tuesday.I Email me if interested in Sandia recruitingI Also email me if interested in MEng projects.

Previously on Parallel Programming

Can write a lot of MPI code with 6 operations we’ve seen:I MPI_Init

I MPI_Finalize

I MPI_Comm_size

I MPI_Comm_rank

I MPI_Send

I MPI_Recv

... but there are sometimes better ways. Decide oncommunication style using simple performance models.

Communication performance

I Basic info: latency and bandwidthI Simplest model: tcomm = α+ βMI More realistic: distinguish CPU overhead from “gap”

(∼ inverse bw)I Different networks have different parametersI Can tell a lot via a simple ping-pong experiment

OpenMPI on crocus

I Two quad-core chips per nodes, five nodesI Heterogeneous network:

I Crossbar switch between cores (?)I Bus between chipsI Gigabit ehternet between nodes

I Default process layout (16 process example)I Processes 0-3 on first chip, first nodeI Processes 4-7 on second chip, first nodeI Processes 8-11 on first chip, second nodeI Processes 12-15 on second chip, second node

I Test ping-pong from 0 to 1, 7, and 8.

Approximate α-β parameters (on node)

0 2 4 6 8 10 12 14 16 18

Message size (kilobytes)

Measured (1)Model (1)

α1 ≈ 1.0× 10−6, β1 ≈ 5.7× 10−10

α2 ≈ 8.4× 10−7, β2 ≈ 6.8× 10−10

Approximate α-β parameters (cross-node)

0 2 4 6 8 10 12 14 16 18

Message size (kilobytes)

α3 ≈ 7.1× 10−5, β3 ≈ 9.7× 10−9

Not all links are created equal!I Might handle with mixed paradigm

I OpenMP on node, MPI acrossI Have to worry about thread-safety of MPI calls

I Can handle purely within MPII Can ignore the issue completely?

For today, we’ll take the last approach.

Reminder: basic send and recv

MPI_Send(buf, count, datatype,dest, tag, comm);

MPI_Recv(buf, count, datatype,source, tag, comm, status);

MPI_Send and MPI_Recv are blockingI Send does not return until data is in systemI Recv does not return until data is ready

Blocking and buffering

buffer

P0 OS Net OS P1

Block until data “in system” — maybe in a buffer?

Blocking and buffering

P0 OS Net OS P1

Alternative: don’t copy, block until done.

Problem 1: Potential deadlock

... blocked ...

Send Send

Both processors wait to finish send before they can receive!May not happen if lots of buffering on both sides.

Solution 1: Alternating order

SendRecv

Could alternate who sends and who receives.

Solution 2: Combined send/recv

SendrecvSendrecv

Common operations deserve explicit support!

Combined sendrecv

MPI_Sendrecv(sendbuf, sendcount, sendtype,dest, sendtag,recvbuf, recvcount, recvtype,source, recvtag,comm, status);

Blocking operation, combines send and recv to avoid deadlock.

Problem 2: Communication overhead

..waiting...

Sendrecv Sendrecv

..waiting...

Partial solution: nonblocking communication

Blocking vs non-blocking communication

I MPI_Send and MPI_Recv are blockingI Send does not return until data is in systemI Recv does not return until data is readyI Cons: possible deadlock, time wasted waiting

I Why blocking?I Overwrite buffer during send =⇒ evil!I Read buffer before data ready =⇒ evil!

I Alternative: nonblocking communicationI Split into distinct initiation/completion phasesI Initiate send/recv and promise not to touch bufferI Check later for operation completion

Overlap communication and computation

Start recv

Start send

Start recv

End send

End recv

End send

End recv

Compute, but don’t touch buffers!

Start send

Nonblocking operations

Initiate message:

MPI_Isend(start, count, datatype, desttag, comm, request);

MPI_Irecv(start, count, datatype, desttag, comm, request);

Wait for message completion:

MPI_Wait(request, status);

Test for message completion:

MPI_Wait(request, status);

Multiple outstanding requests

Sometimes useful to have multiple outstanding messages:

MPI_Waitall(count, requests, statuses);MPI_Waitany(count, requests, index, status);MPI_Waitsome(count, requests, indices, statuses);

Multiple versions of test as well.

Other send/recv variants

Other variants of MPI_SendI MPI_Ssend (synchronous) – do not complete until receive

has begunI MPI_Bsend (buffered) – user provides buffer (viaMPI_Buffer_attach)

I MPI_Rsend (ready) – user guarantees receive has alreadybeen posted

I Can combine modes (e.g. MPI_Issend)MPI_Recv receives anything.

Another approach

I Send/recv is one-to-one communicationI An alternative is one-to-many (and vice-versa):

I Broadcast to distribute data from one processI Reduce to combine data from all processorsI Operations are called by all processes in communicator

Broadcast and reduce

MPI_Bcast(buffer, count, datatype,root, comm);

MPI_Reduce(sendbuf, recvbuf, count, datatype,op, root, comm);

I buffer is copied from root to othersI recvbuf receives result only at rootI op ∈ { MPI_MAX, MPI_SUM, . . . }

Example: basic Monte Carlo#include <stdio.h>#include <mpi.h>int main(int argc, char** argv) {

int nproc, myid, ntrials;MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &nproc);MPI_Comm_rank(MPI_COMM_WORLD, &my_id);if (myid == 0) {

printf("Trials per CPU:\n");scanf("%d", &ntrials);

}MPI_Bcast(&ntrials, 1, MPI_INT,

0, MPI_COMM_WORLD);run_trials(myid, nproc, ntrials);MPI_Finalize();return 0;

Example: basic Monte Carlo

Let sum[0] =∑

i Xi and sum[1] =∑

i X 2i .

void run_mc(int myid, int nproc, int ntrials) {double sums[2] = {0,0};double my_sums[2] = {0,0};/* ... run ntrials local experiments ... */MPI_Reduce(my_sums, sums, 2, MPI_DOUBLE,

MPI_SUM, 0, MPI_COMM_WORLD);if (myid == 0) {

int N = nproc*ntrials;double EX = sums[0]/N;double EX2 = sums[1]/N;printf("Mean: %g; err: %g\n",

EX, sqrt((EX*EX-EX2)/N));}

Collective operations

I Involve all processes in communicatorI Basic classes:

I Synchronization (e.g. barrier)I Data movement (e.g. broadcast)I Computation (e.g. reduce)

Barrier

MPI_Barrier(comm);

Not much more to say. Not needed that often.

Broadcast

Scatter/gather

Gather

AScatter

Allgather

Alltoall

A0 A1 A2 A3

B0 B1 B2 B3

C0 C1 C2 C3

D0 D1 D2 D3

Reduce

ABCDP0

Reduce

D ABCD

The kitchen sink

I In addition to above, have vector variants (v suffix), moreAll variants (Allreduce), Reduce_scatter, ...

I MPI3 adds one-sided communication (put/get)I MPI is not a small library!I But a small number of calls goes a long way

I Init/FinalizeI Get_comm_rank, Get_comm_sizeI Send/Recv variants and WaitI Allreduce, Allgather, Bcast

Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI...

Documents

Overview of MPI Overview of ISP’s Handling of MPI

MPI and Stommel Model - Cyber Infrastructure and Advanced ...geco.mines.edu/prototype/Show_me_some_local_HPC... · MPI and Stommel Model (Continued) Timothy H. Kaiser, PH.D.! tkaiser@mines.edu

PACX-MPI LAM/MPI A High Performance Message Passing …icl.cs.utk.edu/graphics/posters/files/OpenMPI.pdfOpen MPI integrates technologies and resources from several other projects (HARNESS/FT-MPI,LA-MPI,LAM/MPI,

What is [Open] MPI?open]-mpi-1up.pdfMay 2008 Screencast: What is [Open] MPI? 3 MPI Forum • Published MPI-1 spec in 1994 • Published MPI-2 spec in 1996 Additions to MPI-1 • Recently

Computing at a million laptops per second · 12288 24576 49152 98304 MPI-basic MPI-olap MPI+OMP (MPI+OMP)-olap Bamboo-basic MPI-nocomm Stencil application performance • Solve 3D

Open MPI - Indico · MpiGridUserCourse07, Dublin Open MPI – History in 2003 the developers of FT-MPI, LA-MPI, LAM/MPI decided to focus their experience and efforts on one MPI implementation

The MPI+MPI programming model and why we need shared ... · The MPI+MPI programming model and why we need shared-memory MPI libraries Je Hammond Extreme Scalability Group & Parallel

CONTROLS Programming adapter MPI-200 · Product Information MPI-200 Programming adapter MPI-200 MPI-200 user interface Programming adapter MPI-200 1. Changing the user interface language

MPI and Stommel Model Continuedgeco.mines.edu/workshop/aug2011/03wed/stomc.pdf · MPI and Stommel Model (Continued) Timothy H. Kaiser, PH.D. tkaiser@mines.edu 1 Tuesday, August 16,

Assignment 3 MPI Tutorial Compiling and running MPI programs

MPI-3.0 and MPI-3.1 Overviewcaxapa.ru/thumbs/692677/MPI.pdf · MPI-3.0 & 3.1 Overview MPI-3.0 and MPI-3.1 Overview Rolf Rabenseifner Lecture at “Recent Advances in Parallel Programming

LAB 7 -Parallel Programming Models -MPI Continued

Advanced Parallel Programming with MPI-1, MPI-2, and MPI-3 · Following MPI Standards MPI-2 was released in 2000 –Several additional features including MPI + threads, MPI-I/O, remote

What is [Open] MPI?open]-mpi-2up.pdf2 May 2008 Screencast: What is [Open] MPI? 3 MPI Forum • Published MPI-1 spec in 1994 • Published MPI-2 spec in 1996 Additions to MPI-1 •

IBHLink S7++ Ethernet / MPI / PROFIBUS · PDF fileIBHLink S7++ Ethernet / MPI / PROFIBUS Gateway PLC-PLC Communication ... MPI to MPI over Ethernet Variant 2 (S7 Communication)

MPI + MPI: a new hybrid approach to parallel programming ...cacs.usc.edu/education/cs596/Hoefler-MPI+MPI-Comp13.pdf · implementation of the new interface in the MPICH2 and Open MPI

MPI Bechmarking Tool Questions - mpi-group.commpi-group.com/.../2013/10/MPI-Bechmarking-Tool-Questions-Sheet1… · MPI Manufacturers Benchmarking Toolkit ... Not important Minor

Message Passing Interface (MPI) - Cornell University– MPI-1 was released in 1994, MPI-2 in 1996, and MPI-3 in 2012. • MPI applications can be fairly portable • MPI is a good

Advanced Message Passing in MPI - Using MPI Datatypes with

FORTRAN and MPI Message Passing Interface (MPI)...FORTRAN and MPI Message Passing Interface (MPI) Day 3 Maya Neytcheva, IT, Uppsala University maya@it.uu.se 1 Course plan: MPI - General