Introduction to MPI Programming (Part II) Michael Griffiths, Deniz Savas & Alan Real

Introduction to MPI Programming

(Part II)

Michael Griffiths, Deniz Savas & Alan Real

January 2006

OverviewReview point to point communications

Data typesData packing

Collective CommunicationBroadcast, Scatter & Gather of dataReduction OperationsBarrier Synchronisation

Blocking operations

Relate to when the operation has completedOnly return from the subroutine call when the

operation has completed

Non-blocking operations

Return straight away and allow the sub-program to return to perform other work.At some time later the sub-program should test or wait for the

completion of the non-blocking operation.A non-blocking operation immediately followed by a matching

wait is equivalent to a blocking operation.Non-blocking operations are not the same as sequential

subroutine calls as the operation continues after the call has returned.

Always completes (unless an error occurs), irrespective of receiver.MPI_Isend

Begins a nonblocking send

Blocking

Send and Receive

Ready send

Standard send

Buffered send

Synchronous send

Sender mode

Completes when a message arrives and received by pair of processors.MPI_Sendrecv

Always completes (unless an error occurs), irrespective of whether the receive has completed.MPI_Rsend

Can be synchronous or buffered (often implementation dependent).MPI_Send

Always completes (unless an error occurs), irrespective of receiver.MPI_Bsend

Only completes when the receive has completed.MPI_Ssend

Completion statusMPI Call (F/C)

Non-blocking communication

Separate communication into three phases:Initiate non-blocking communicationDo some work:

Perhaps involving other communications

Wait for non-blocking communication to complete.

Non-blocking send

Send is initiated and returns straight away.Sending process can do other thingsCan test later whether operation has completed.

0

4

1

35

2

MPI_COMM_WORLD

Send req

Receive

Wait

44

222

Non-blocking receive

Receive is initiated and returns straight away.Receiving process can do other thingsCan test later whether operation has completed.

0 1

354

MPI_COMM_WORLD

Rec req

Send

Wait

The Request Handle

Same arguments as non-blocking callAdditional request handle

In C/C++ is of type MPI_Request/MPI::RequestIn Fortran is an INTEGER

Request handle is allocated when a communication is initiated

Can query to test whether non-blocking operation has completed

Non-blocking synchronous send

Fortran:CALL MPI_ISSEND(buf, count, datatype, dest, tag,

comm, request, error)CALL MPI_WAIT(request, status, error)

C:MPI_Issend(&buf, count, datatype, dest, tag, comm,

&request);MPI_Wait(&request, &status);

C++:request = comm.Issend(&buf, count, datatype, dest,

tag);request.Wait();

Non-blocking synchronous receive

Fortran:CALL MPI_IRECV(buf, count, datatype, src, tag,

comm, request, error)CALL MPI_WAIT(request, status, error)

C:MPI_Irecv(&buf, count, datatype, src, tag, comm,

&request);MPI_Wait(&request, &status);

C++:request = comm.Irecv(&buf, count, datatype, src,

tag);request.Wait(status);

Blocking v Non-blocking

Send and receive can be blocking or non-blocking.A blocking send can be used with a non-blocking receive, and vice

versa.Non-blocking sends can use any mode:

Synchronous mode affects completion, not initiation.A non-blocking call followed by an explicit wait is identical to the

corresponding blocking communication.

Comm.Recv(…)MPI_Recv(…)Receive

Comm.Rsend(…)MPI_Rsend(…)Ready send

Comm.Bsend(…)MPI_Bsend(…)Buffered send

Comm.Ssend(…)MPI_Ssend(…)Synchronous send

Comm.Send(…)MPI_Send(…)Standard send

C++MPI call Fortran/COperation

CompletionCan either wait or test for completion:Fortran (LOGICAL flag):

CALL MPI_WAIT(request, status, ierror)CALL MPI_TEST(request, flag, status, ierror)

C (int flag):MPI_Wait(&request, &status)MPI_Test(&request, &flag, &status)

C++ (bool flag):request.Wait()flag = request.Test(); (for sends)request.Wait(status);flag = request.Test(status); (for receives)

Other related wait and test routines

If multiple non-blocking calls are issued …MPI_TESTANY : Tests if any one of ‘a list of requests’ (they could

be send or receive requests) have been completed.MPI_WAITANY : Waits until any one of the list of requests have

been completed.

MPI_TESTALL : Test if all the requests in a list are completed.MPI_WAITALL : Waits until all the requests in a list are completed.

MPI_PROBE , MPI_IPROBE : Allows for the incoming messages to be checked for without actually receiving them. Note that MPI_PROBE is blocking. It waits until there is

something to probe for.MPI_CANCEL : Cancels pending communication. Last resort, clean-

up operation !All routines take an array of requests and can return an array of

statuses.‘any’ routines return an index of the completed operation

Merging send and receive operations into a single unit

The following is the syntax of the MPI_Sendrecv command:

IN C:int MPI_Sendrecv( void * sendbuf, int sendcount, MPI_Datatype sendtype, int dest,

int sendtag, void* recvbuf, int recvcount, MPI_Datatye recvtype ,int source , int recvtag, MPI_Comm comm, MPI_Status *status )

IN FORTRAN

<sendtype> sendbuf(:) <recvtype> recvbuf(:)INTEGER sendcount,sendtype, dest, sendtag, recvcount, recvtype, INTEGER source, recvtag, comm, status(MPI_STATUS_SIZE), ierrorMPI_SENDRECV( sendbuf,sendcount,sendtype, dest, sendtag, recvbuf, recvcount ,

recvtype , source, recvtag , comm , status , ierror )

Important Notes about MPI_Sendrecv

Beware! A message sent by MPI_sendrecv is receivable by a regular receive operation if the destination and tag match.

For the destination and source MPI_PROC_NULL can be specified to allow one directional working. (Useful in non-circular communication for the very end-nodes).Any communication with MPI_PROC_NULL returns immediately with no effect but as if the operation has been successful. This can make programming easier.

The send and receive buffers must not overlap, they must be separate memory locations. This restriction can be avoided by using the MPI_Sendrecv_replace routine

Data Packing

Up until now we have only seen contiguous data of pre-defined data-types being communicated by MPI calls. This can be rather restricting if what we are intending to transfer involves structures of data made up of mixtures of primitive data types, such as integer count followed by a sequence of real numbers. One solution to this problem is to use the MPI_PACK and MPI_UNPACK routines. The philosophy used is similar to the Fortran write/read to/from internal buffers and the scanf function in C.MPI_PACK routine can be called consecutively to compress the data into a send_buffer, the resulting buffer of data can then be sent by using MPI_SEND ‘or equivalent’ with the data_type set to MPI_PACKED.

At the receiving-end it can be received by using MPI_RECV with the data type MPI_PACKED. The received data can then be unpacked by using MPI_UNPACK to recover the original packed data. This method of working can also improve communications efficiency by reducing the number of data transfer ‘send-receive’ calls. There are usually fixed overheads associated with setting up the communications that would cause inefficiencies if the sent/received messages are just too small.

MPI_Pack

Fortran :<type> INBUF(:) , OUTBUF(:)INTEGER INCOUNT,DATATYPE,OUTSIZE,POSITION,COMM,IERRORMPI_PACK(INBUF,INCOUNT,DATATYPE,OUTBUF, OUTSIZE,POSITION,

COMM,IERROR )

C :int MPI_Pack(void* inbuf, int incount, MPI_Datatype datatype, void *outbuf ,int outsize, int *position, MPI_Comm comm )

Packs the message in inbuf of type datatype and length=incount and stores it in outbuf . Outbuf size is specified in bytes. Outsize being the maximum length of outbuf ’in bytes’, rather than its actuaL size.

On entry position indicates the starting location at the outbuf where data will be written. On exit position points to the first free position in outbuf following the location occupied by the packed message. This can then be readily used as the position parameter for the next mpi_pack call.

MPI_Unpack

Fortran :<type> INBUF(:) , OUTBUF(:)

INTEGER INSIZE, POSITION, OUTCOUNT,DATATYPE, COMM,IERRORMPI_UNPACK(INBUF,INSIZE,POSITION, OUTBUF,OUTCOUNT,DATATYPE,

,COMM,IERROR )

C :int MPI_Unpack(void* inbuf, int insize, int *position, void *outbuf ,int outcount, MPI_Datatype datatype, MPI_Comm comm )

Unpacks the message which is in inbuf as data of type datatype and length of outcounts and stores it in outbuf .

On entry, position indicates the starting location of data in inbuf where data will be read from. On exit position points to the first position of the next set of data in inbuf. This can then be readily used as the position parameter for the next mpi_unpack call.

Collective Communication

Overview

Introduction & characteristicsBarrier SynchronisationGlobal reduction operations

Predefined operations

BroadcastScatterGatherPartial sumsExercise:

Collective communications

Are higher-level routines involving several processes at a time.

Can be built out of point-to-point communications.

Examples are:BarriersBroadcastReduction operations

Collective Communication

Communications involving a group of processes.Called by all processes in a communicator.Examples:

Broadcast, scatter, gather (Data Distribution)Global sum, global maximum, etc. (Reduction Operations)Barrier synchronisation

CharacteristicsCollective communication will not interfere with point-to-point communication

and vice-versa.All processes must call the collective routine.Synchronization not guaranteed (except for barrier)No non-blocking collective communicationNo tagsReceive buffers must be exactly the right size

Collective Communications(one for all, all for one!!!)

Collective communication is defined as that which involves all the processes in a group. Collective communication routines can be divided into the following broad categories:

Barrier synchronisationBroadcast from one to all.Scatter from one to allGather from all to one.Scatter/Gather. From all to all.Global reduction (distribute elementary operations)IMPORTANT NOTE: Collective Communication operations and point-to-

point operations we have seen earlier are invisible to each other and hence do not interfere with each other.This is important to avoid dead-locks due to interference.

BARRIER SYNCHRONIZATIONT I M

E

B A R R I E R STATEMENT

Here, there are seven processes running and three of them are waiting idle at the barrier statement for the other four to catch up.

Graphic Representations of Collective Communication Types

E

D

C

B

AEDCBA

A

tojeE

snidD

rmhcC

qlgbB

pkfaA

EDCBA

EDCBA

EDCBA

EDCBA

EDCBA

A

A

A

A

A

E

D

C

B

A

BROADCAST

SCATTERALLTOALL

ALLGATHER

GATHERtsrqp

onmlk

jihgf

edcba

EDCBA

P

R

O

C

E

S

S

E

S

D A T A

D A T A

D A T A

D A T A

Barrier Synchronisation

Each processes in communicator waits at barrier until all processes encounter the barrier.

Fortran:INTEGER comm, errorCALL MPI_BARRIER(comm, error)

C:MPI_Barrier(MPI_Comm comm);

C++:Comm.Barrier();E.g.:

MPI::COMM_WORLD.Barrier();

Global reduction operations

Used to compute a result involving data distributed over a group of processes:

Global sum or productGlobal maximum or minimumGlobal user-defined operation

Predefined operations

MPI_MINLOC

MPI_MAXLOC

MPI_BXOR

MPI_LXOR

MPI_BOR

MPI_LOR

MPI_BAND

MPI_LAND

MPI_PROD

MPI_SUM

MPI_MIN

MPI_MAX

MPI name (F/C)

Minimum and locationMPI::MINLOC

Maximum and locationMPI::MAXLOC

Bitwise exclusive ORMPI::BXOR

Logical exclusive ORMPI::LXOR

Bitwise ORMPI::BOR

Logical ORMPI::LOR

Bitwise ANDMPI::BAND

Logical ANDMPI::LAND

ProductMPI::PROD

SumMPI::SUM

MinimumMPI::MIN

MaximumMPI::MAX

FunctionMPI name (C++)

MPI_Reduce

Performs count operations (o) on individual elements of sendbuf between processes

A B C

D E F

G H I

AoDoG

MPI_REDUCE

0

1

2

D E F

G H I BoEoH

A B C

Rank

MPI_Reduce syntax

FortranINTEGER count, type, count, rtype, root, comm,

errorCALL MPI_REDUCE(sbuf, rbuf, count, rtype, op, root,

comm, error)

C:MPI_Reduce(void *sbuf, void *rbuf, int count,

MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm);

C++:Comm::Reduce(const void* sbuf, void* recvbuf, int

count, const MPI::Datatype& datatype, const MPI::Op& op, int root);

MPI_Reduce example

Integer global sum:Fortran

INTEGER x, result, error

CALL MPI_REDUCE(x, result, 1, MPI_INTEGER, MPI_SUM, 0, MPI_COMM_WORLD, error)

C:int x, result;

MPI_Reduce(&x, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

C++: int x, result;

MPI::COMM_WORLD.Reduce(&x, &result, 1, MPI::INT, MPI::SUM);

MPI_Allreduce

No root processAll processes get results of reduction operation

A B C

D E F

G H I

A B C

D E F

G H I

MPI_ALLREDUCE

AoDoG

0

1

2

Rank

MPI_Allreduce syntaxFortran

INTEGER count, type, count, rtype, comm, errorCALL MPI_ALLREDUCE(sbuf, rbuf, count, rtype, op,

comm, error)

C:MPI_Allreduce(void *sbuf, void *rbuf, int count,

MPI_Datatype datatype, MPI_Op op, MPI_Comm comm);

C++:Comm.Allreduce(const void* sbuf, void* recvbuf, int

count, const MPI::Datatype& datatype, const MPI::Op& op);

Practice Session 3

Using reduction operationsThis example shows the use of the continued

fraction method of calculating pi and makes each processor calculate a different portion of the expansion series.

Broadcast

Duplicates data from root process to other processes in communicator

A

A

AA A A

Broadcast

0 1 2 3Rank

Broadcast syntax

Fortran:INTEGER count, datatype, root, comm, error

CALL MPI_BCAST(buffer, count, datatype, root, comm, error)

C:MPI_Bcast (void *buffer, int count, MPI_Datatype

datatype, int root, MPI_Comm comm);

C++:Comm.Bcast(void* buffer, int count, const MPI::Datatype&

datatype, int root);

E.g broadcasting 10 integers from rank 0int tenints[10];

MPI::COMM_WORLD.Bcast(&tenints, 10, MPI::INT, 0);

Scatter

Distributes data from root process amongst processors within communicator.

A B C D

A B C D

BA DC

Scatter

0 1 2 3Rank

Scatter syntax

scount (and rcount) is number of elements each process is sent (i.e. = no received)

FortranINTEGER scount, stype, rcount, rtype, root, comm, errorCALL MPI_SCATTER(sbuf, scount, stype, rbuf, rcount,

rtype, root, comm, error)

C:MPI_Scatter(void *sbuf, int scount, MPI_Datatype stype,

void *rbuf, int rcount, MPI_Datatype rtype, root, comm);

C++:Comm.Scatter(const void* sbuf, int scount, const

MPI::Datatype& stype, void* rbuf, int rcount, const MPI::Datatype& rtype, int root);

Gather

Collects data distributed amongst processes in communicator onto root process ( Collection done in rank order ) .

BA DC

A B C D

BA DC

Gather

0 1 2 3Rank

Gather syntax

Takes same arguments as Scatter operationFortran

INTEGER scount, stype, rcount, rtype, root, comm, error

CALL MPI_GATHER(sbuf, scount, stype, rbuf, rcount, rtype, root, comm, error)

C:MPI_Gather(void *sbuf, int scount, MPI_Datatype stype,

void *rbuf, int rcount, MPI_Datatype rtype, root, comm);

C++:Comm.Gather(const void* sbuf, int scount, const

MPI::Datatype& stype, void* rbuf, int rcount, const MPI::Datatype& rtype, int root);

All Gather

Collects all data on all processes in communicator

BA DC

A B C D

BA DC

Gather

0 1 2 3Rank

A B C D A B C D A B C D

All Gather syntax

As Gather but no root defined.Fortran

INTEGER scount, stype, rcount, rtype, comm, error

CALL MPI_GATHER(sbuf, scount, stype, rbuf, rcount, rtype, comm, error)

C:MPI_Gather(void *sbuf, int scount, MPI_Datatype stype,

void *rbuf, int rcount, MPI_Datatype rtype, comm);

C++:Comm.Gather(const void* sbuf, int scount, const

MPI::Datatype& stype, void* rbuf, int rcount, const MPI::Datatype& rtype);

Documents

Introduction to MPI Programming (Part II) Michael Griffiths, Deniz Savas & Alan Real