CS 484. Message Passing Based on multi-processor Set of independent processors Connected via some communication net All communication between processes

CS 484

Message Passing

Based on multi-processor Set of independent processors Connected via some communication

net

All communication between processes is done via a message sent from one to the other

MPI

Message Passing InterfaceComputation is made of: One or more processes Communicate by calling library

routines

MIMD programming modelSPMD most common.

MPI

Processes use point-to-point communication operationsCollective communication operations are also available.Communication can be modularized by the use of communicators. MPI_COMM_WORLD is the base. Used to identify subsets of processors

MPI

Complex, but most problems can be solved using the 6 basic functions. MPI_Init MPI_Finalize MPI_Comm_size MPI_Comm_rank MPI_Send MPI_Recv

MPI Basics

Most all calls require a communicator handle as an argument. MPI_COMM_WORLD

MPI_Init and MPI_Finalize don’t require a communicator handle used to begin and end and MPI program MUST be called to begin and end

MPI Basics

MPI_Comm_size determines the number of processors

in the communicator group

MPI_Comm_rank determines the integer identifier

assigned to the current process zero based

MPI Basics

#include <stdio.h>#include <mpi.h>

main(int argc, char *argv[]){

int iproc, nproc;MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &nproc);MPI_Comm_rank(MPI_COMM_WORLD, &iproc);printf("I am processor %d of %d\n", iproc, nproc);MPI_Finalize();

}

MPI Communication

MPI_Send Sends an array of a given type Requires a destination node, size, and

type

MPI_Recv Receives an array of a given type Same requirements as MPI_Send Extra parameter

MPI_Status variable.

MPI Basics

Made for both FORTRAN and CStandards for C MPI_ prefix to all calls First letter of function name is capitalized Returns MPI_SUCCESS or error code MPI_Status structure MPI data types for each C type OUT parameters passed using & operator

Using MPI

Based on rsh or ssh requires a .rhosts file or ssh key setup

hostname login

Path to compiler (CS open labs) MPI_HOME /users/faculty/snell/mpich MPI_CC MPI_HOME/bin/mpicc

Marylou5 Use mpicc

mpicc hello.c –o hello

Using MPI

Write programCompile using mpiccWrite process file (linux cluster)

host nprocs full_path_to_prog 0 for nprocs on first line, 1 for all others

Run program (linux cluster) prog -p4pg process_file args mpirun –np #procs –machinefile machines prog

Run program (scheduled on marylou5 using pbs) mpirun -np #procs -machinefile $PBS_NODEFILE prog mpiexec prog

#include “mpi.h”#include <stdio.h>#include <math.h>#define MAXSIZE 1000

void main(int argc, char *argv){

int myid, numprocs;int data[MAXSIZE], i, x, low, high, myresult, result;char fn[255];char *fp;MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);if (myid == 0) { /* Open input file and initialize data */

strcpy(fn,getenv(“HOME”));strcat(fn,”/MPI/rand_data.txt”);if ((fp = fopen(fn,”r”)) == NULL) {

printf(“Can’t open the input file: %s\n\n”, fn);exit(1);

}for(i = 0; i < MAXSIZE; i++) fscanf(fp,”%d”, &data[i]);

}

/* broadcast data */MPI_Bcast(data, MAXSIZE, MPI_INT, 0, MPI_COMM_WORLD);

/* Add my portion Of data */x = n/nproc;low = myid * x;high = low + x;for(i = low; i < high; i++)

myresult += data[i];printf(“I got %d from %d\n”, myresult, myid);

/* Compute global sum */MPI_Reduce(&myresult, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);if (myid == 0) printf(“The sum is %d.\n”, result);MPI_Finalize();

}

MPI

Message Passing programs are non-deterministic because of concurrency Consider 2 processes sending messages to

third

MPI only guarantees that 2 messages sent from a single process to another will arrive in order.It is the programmer's responsibility to ensure computation determinism

MPI & Determinism

MPI A Process may specify the source of the

message A Process may specify the type of message

Non-Determinism MPI_ANY_SOURCE or MPI_ANY_TAG

Example

for (n = 0; n < nproc/2; n++){ MPI_Send(buff, BSIZE, MPI_FLOAT, rnbor, 1, MPI_COMM_WORLD); MPI_Recv(buff, BSIZE, MPI_FLOAT, MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, &status); /* Process the data */}

Global Operations

Coordinated communication involving multiple processes.Can be implemented by the programmer using sends and receivesFor convenience, MPI provides a suite of collective communication functions.All participating processes must call the same function.

Collective Communication

Barrier Synchronize all processes

BroadcastGather Gather data from all processes to one process

ScatterReduction Global sums, products, etc.

Collective Communication

DistributeProblem Size

DistributeInput data

ExchangeBoundary values

Find Max Error

Collect Results

MPI_ReduceMPI_Reduce(inbuf, outbuf, count, type, op, root, comm)

MPI_Reduce

MPI_AllreduceMPI_Allreduce(inbuf, outbuf, count, type, op, comm)

MPI Collective RoutinesSeveral routines:MPI_ALLGATHER MPI_ALLGATHERV MPI_BCAST MPI_ALLTOALL MPI_ALLTOALLV MPI_REDUCE

MPI_GATHER MPI_GATHERV MPI_SCATTER

MPI_REDUCE_SCATTER MPI_SCAN

MPI_SCATTERV MPI_ALLREDUCE

All versions deliver results to all participating processes“V” versions allow the chunks to have different sizesMPI_ALLREDUCE, MPI_REDUCE, MPI_REDUCE_SCATTER, and MPI_SCAN take both built-in and user-defined combination functions

Built-In Collective Computation Operations

MPI Name Operation

MPI_MAX Maximum

MPI_MIN Minimum

MPI_PROD Product

MPI_SUM Sum

MPI_LAND Logical and

MPI_LOR Logical or

MPI_LXOR Logical exclusive or ( xor )

MPI_BAND Bitwise and

MPI_BOR Bitwise or

MPI_BXOR Bitwise xor

MPI_MAXLOC Maximum value and location

MPI_MINLOC Minimum value and location

27

Example: PI in C -1#include "mpi.h"#include <math.h>int main(int argc, char *argv[])

{int done = 0, n, myid, numprocs, i, rc;double PI25DT = 3.141592653589793238462643;double mypi, pi, h, sum, x, a;MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);while (!done) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break;

28

Example: PI in C - 2 h = 1.0 / (double) n;

sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += 4.0 / (1.0 + x*x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT));}MPI_Finalize();

return 0;

}

Some other things

MPI Datatypes

Data in messages are described by: Address, Count, Datatype

MPI predefines many datatypes MPI_INT, MPI_FLOAT, MPI_DOUBLE, etc. There is an analog for each primitive

type

Can also construct custom data types for structured data

MPI_Recv

Blocks until message is received Message is matched based on source & tag

The MPI_Status argument gets filled with information about the message Source & Tag

Receiving fewer elements than specified is OK Receiving more elements is an error Use MPI_Get_count to get number of elements

received

MPI_Recv

int recvd_tag, recvd_from, recvd_count;

MPI_Status status;

MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG, ..., &status )

recvd_tag = status.MPI_TAG;

recvd_from = status.MPI_SOURCE;

MPI_Get_count( &status, datatype, &recvd_count );

Non-blocking communication

MPI_Send and MPI_Recv are blocking MPI_Send does not complete until the buffer is

available to be modified MPI_Recv does not complete until the buffer is filled

Blocking communication can lead to deadlocks

for(int p = 0; p < nproc; p++){

MPI_Send(… p ….)MPI_Recv(… p ….)

}

Non-blocking communiction

MPI_Isend & MPI_Irecv return immediately (non-blocking)MPI_Request request; MPI_Status status;

MPI_Isend( start, count, datatype, dest, tag, comm, &request )

MPI_Irecv( start, count, datatype, src, tag, comm, &request )

MPI_WAIT( &request, &status )

Used to overlap communication with computationAnywhere you use MPI_Send or MPI_Recv, you can use the pair of MPI_Isend/MPI_Wait or MPI_Irecv/MPI_WaitAlso can use MPI_Waitall, MPI_Waitany, MPI_WaitsomeCan also check to see if you have any messages without actually receiving them – MPI_Probe & MPI_Iprobe

MPI_Probe blocks until there is a message – MPI_Iprobe sets a flag

Communicators

All MPI communication is based on a communicator which contains a context and a groupContexts define a safe communication space for message-passingContexts can be viewed as system-managed tagsContexts allow different libraries to co-existThe group is just a set of processesProcesses are always referred to by unique rank in group

Uses of MPI_COMM_WORLD

Contains all processes available at the time the program was startedProvides initial safe communication spaceSimple programs communicate with MPI_COMM_WORLD

Even complex programs will use MPI_COMM_WORLD for most communications

Complex programs duplicate and subdivide copies of MPI_COMM_WORLD

Provides a global communicator for forming smaller groups or subsets of processors for specific tasks

0 1 2 3 4 5 6 7

MPI_COMM_WORLD

int MPI_Comm_split( MPI_Comm comm, int color, int key, MPI_Comm *newcomm)

MPI_COMM_SPLIT( COMM, COLOR, KEY, NEWCOMM, IERR )INTEGER COMM, COLOR, KEY, NEWCOMM, IERR

Subdividing a Communicator with MPI_COMM_SPLIT

MPI_COMM_SPLIT partitions the group associated with the given communicator into disjoint subgroupsEach subgroup contains all processes having the same value for the argument colorWithin each subgroup, processes are ranked in the order defined by the value of the argument key, with ties broken according to their rank in old communicator

Subdividing a CommunicatorTo divide a communicator into two non-overlapping groups

color = (rank < size/2) ? 0 : 1 ;MPI_Comm_split(comm, color, 0, &newcomm) ;

0 1 2 3 4 5 6 7

0 1 2 3 0 1 2 3

comm

newcomm newcomm

Subdividing a Communicator

To divide a communicator such that all processes with even ranks are in one group all processes with odd ranks are in the other group maintain the reverse order by rank

color = (rank % 2 == 0) ? 0 : 1 ;key = size - rank ;MPI_Comm_split(comm, color, key, &newcomm) ;

0 1 2 3 4 5 6 7

0 1 2 3 0 1 2 3

comm

newcomm newcomm

6 4 2 0 7 5 3 1

program main include 'mpif.h'

integer ierr, row_comm, col_comm integer myrank, size, P, Q, myrow, mycol

P = 4 Q = 3 call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr)

C Determine row and column position myrow = myrank/Q mycol = mod(myrank,Q)C Split comm into row and column comms call MPI_Comm_split(MPI_COMM_WORLD, myrow, mycol, row_comm, ierr) call MPI_Comm_split(MPI_COMM_WORLD, mycol, myrow, col_comm, ierr)

print*, "My coordinates are[",myrank,"] ",myrow. mycol call MPI_Finalize(ierr) stop end

0(0,0)

1(0,1)

2(0,2)

3(1,0)

4(1,1)

5(1,2)

6(2,0)

7(2,1)

8(2,2)

9(3,0)

10(3,1)

11(3,2)

MPI_COMM_WORLD

row_comm

col_comm

Debugging

An ounce of prevention…

Defensive programming Check function return codes Verify send and receive sizes

Incremental programmingModular programming Test modules – keep test code in place

Identify all shared data and think carefully about how it is accessedCorrectness first – then speed

Debugging

Characterize the bug Run code serially Run in parallel on one core (2-4 processes) Run in parallel (2-4 processes on 2-4 cores) Play around with inputs and other data and data sizes Find smallest data size that exposes the bug

Remove as much non-determinism as you canPrint statements – use stderr (non buffered)

Before and after communication or shared variable access Print all information – source, sizes, data, tag, etc.

Identify process number – first thing in print (helps sorting) Leave the prints in your code - #ifdef

Debugging

Learn about C constructs __FILE__, __LINE__, and __FUNCTION__Make one logical change at a time and then testLearn how to attach debuggers You will probably need some sort of stall

code – ie. Wait for input on master then do a barrier – all others just do barrier

Common problems

Not all processes call collective call Be very careful about putting collective calls

inside conditionals Be sure the communicator is correct

Deadlock (everybody on recv) Use non-blocking calls Use MPI_Sendrecv

Process waiting for data that is never sent Use collective calls where you can Use simple communication patterns

Best Advice

Program incrementally and modularlyCharacterize the bug and leave yourself time to walk away from it and think about itNever underestimate the value of a second set of eyes Sometimes just explaining your code to

someone else helps you help yourself

Documents

CS 484. Message Passing Based on multi-processor Set of independent processors Connected via some communication net All communication between processes