Upload
virgil-lindsey
View
217
Download
3
Tags:
Embed Size (px)
Citation preview
CS 484
Message Passing
Based on multi-processor Set of independent processors Connected via some communication
net
All communication between processes is done via a message sent from one to the other
MPI
Message Passing InterfaceComputation is made of: One or more processes Communicate by calling library
routines
MIMD programming modelSPMD most common.
MPI
Processes use point-to-point communication operationsCollective communication operations are also available.Communication can be modularized by the use of communicators. MPI_COMM_WORLD is the base. Used to identify subsets of processors
MPI
Complex, but most problems can be solved using the 6 basic functions. MPI_Init MPI_Finalize MPI_Comm_size MPI_Comm_rank MPI_Send MPI_Recv
MPI Basics
Most all calls require a communicator handle as an argument. MPI_COMM_WORLD
MPI_Init and MPI_Finalize don’t require a communicator handle used to begin and end and MPI program MUST be called to begin and end
MPI Basics
MPI_Comm_size determines the number of processors
in the communicator group
MPI_Comm_rank determines the integer identifier
assigned to the current process zero based
MPI Basics
#include <stdio.h>#include <mpi.h>
main(int argc, char *argv[]){
int iproc, nproc;MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &nproc);MPI_Comm_rank(MPI_COMM_WORLD, &iproc);printf("I am processor %d of %d\n", iproc, nproc);MPI_Finalize();
}
MPI Communication
MPI_Send Sends an array of a given type Requires a destination node, size, and
type
MPI_Recv Receives an array of a given type Same requirements as MPI_Send Extra parameter
MPI_Status variable.
MPI Basics
Made for both FORTRAN and CStandards for C MPI_ prefix to all calls First letter of function name is capitalized Returns MPI_SUCCESS or error code MPI_Status structure MPI data types for each C type OUT parameters passed using & operator
Using MPI
Based on rsh or ssh requires a .rhosts file or ssh key setup
hostname login
Path to compiler (CS open labs) MPI_HOME /users/faculty/snell/mpich MPI_CC MPI_HOME/bin/mpicc
Marylou5 Use mpicc
mpicc hello.c –o hello
Using MPI
Write programCompile using mpiccWrite process file (linux cluster)
host nprocs full_path_to_prog 0 for nprocs on first line, 1 for all others
Run program (linux cluster) prog -p4pg process_file args mpirun –np #procs –machinefile machines prog
Run program (scheduled on marylou5 using pbs) mpirun -np #procs -machinefile $PBS_NODEFILE prog mpiexec prog
#include “mpi.h”#include <stdio.h>#include <math.h>#define MAXSIZE 1000
void main(int argc, char *argv){
int myid, numprocs;int data[MAXSIZE], i, x, low, high, myresult, result;char fn[255];char *fp;MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);if (myid == 0) { /* Open input file and initialize data */
strcpy(fn,getenv(“HOME”));strcat(fn,”/MPI/rand_data.txt”);if ((fp = fopen(fn,”r”)) == NULL) {
printf(“Can’t open the input file: %s\n\n”, fn);exit(1);
}for(i = 0; i < MAXSIZE; i++) fscanf(fp,”%d”, &data[i]);
}
/* broadcast data */MPI_Bcast(data, MAXSIZE, MPI_INT, 0, MPI_COMM_WORLD);
/* Add my portion Of data */x = n/nproc;low = myid * x;high = low + x;for(i = low; i < high; i++)
myresult += data[i];printf(“I got %d from %d\n”, myresult, myid);
/* Compute global sum */MPI_Reduce(&myresult, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);if (myid == 0) printf(“The sum is %d.\n”, result);MPI_Finalize();
}
MPI
Message Passing programs are non-deterministic because of concurrency Consider 2 processes sending messages to
third
MPI only guarantees that 2 messages sent from a single process to another will arrive in order.It is the programmer's responsibility to ensure computation determinism
MPI & Determinism
MPI A Process may specify the source of the
message A Process may specify the type of message
Non-Determinism MPI_ANY_SOURCE or MPI_ANY_TAG
Example
for (n = 0; n < nproc/2; n++){ MPI_Send(buff, BSIZE, MPI_FLOAT, rnbor, 1, MPI_COMM_WORLD); MPI_Recv(buff, BSIZE, MPI_FLOAT, MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, &status); /* Process the data */}
Global Operations
Coordinated communication involving multiple processes.Can be implemented by the programmer using sends and receivesFor convenience, MPI provides a suite of collective communication functions.All participating processes must call the same function.
Collective Communication
Barrier Synchronize all processes
BroadcastGather Gather data from all processes to one process
ScatterReduction Global sums, products, etc.
Collective Communication
DistributeProblem Size
DistributeInput data
ExchangeBoundary values
Find Max Error
Collect Results
MPI_ReduceMPI_Reduce(inbuf, outbuf, count, type, op, root, comm)
MPI_Reduce
MPI_AllreduceMPI_Allreduce(inbuf, outbuf, count, type, op, comm)
MPI Collective RoutinesSeveral routines:MPI_ALLGATHER MPI_ALLGATHERV MPI_BCAST MPI_ALLTOALL MPI_ALLTOALLV MPI_REDUCE
MPI_GATHER MPI_GATHERV MPI_SCATTER
MPI_REDUCE_SCATTER MPI_SCAN
MPI_SCATTERV MPI_ALLREDUCE
All versions deliver results to all participating processes“V” versions allow the chunks to have different sizesMPI_ALLREDUCE, MPI_REDUCE, MPI_REDUCE_SCATTER, and MPI_SCAN take both built-in and user-defined combination functions
Built-In Collective Computation Operations
MPI Name Operation
MPI_MAX Maximum
MPI_MIN Minimum
MPI_PROD Product
MPI_SUM Sum
MPI_LAND Logical and
MPI_LOR Logical or
MPI_LXOR Logical exclusive or ( xor )
MPI_BAND Bitwise and
MPI_BOR Bitwise or
MPI_BXOR Bitwise xor
MPI_MAXLOC Maximum value and location
MPI_MINLOC Minimum value and location
27
Example: PI in C -1#include "mpi.h"#include <math.h>int main(int argc, char *argv[])
{int done = 0, n, myid, numprocs, i, rc;double PI25DT = 3.141592653589793238462643;double mypi, pi, h, sum, x, a;MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);while (!done) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break;
28
Example: PI in C - 2 h = 1.0 / (double) n;
sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += 4.0 / (1.0 + x*x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT));}MPI_Finalize();
return 0;
}
Some other things
MPI Datatypes
Data in messages are described by: Address, Count, Datatype
MPI predefines many datatypes MPI_INT, MPI_FLOAT, MPI_DOUBLE, etc. There is an analog for each primitive
type
Can also construct custom data types for structured data
MPI_Recv
Blocks until message is received Message is matched based on source & tag
The MPI_Status argument gets filled with information about the message Source & Tag
Receiving fewer elements than specified is OK Receiving more elements is an error Use MPI_Get_count to get number of elements
received
MPI_Recv
int recvd_tag, recvd_from, recvd_count;
MPI_Status status;
MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG, ..., &status )
recvd_tag = status.MPI_TAG;
recvd_from = status.MPI_SOURCE;
MPI_Get_count( &status, datatype, &recvd_count );
Non-blocking communication
MPI_Send and MPI_Recv are blocking MPI_Send does not complete until the buffer is
available to be modified MPI_Recv does not complete until the buffer is filled
Blocking communication can lead to deadlocks
for(int p = 0; p < nproc; p++){
MPI_Send(… p ….)MPI_Recv(… p ….)
}
Non-blocking communiction
MPI_Isend & MPI_Irecv return immediately (non-blocking)MPI_Request request; MPI_Status status;
MPI_Isend( start, count, datatype, dest, tag, comm, &request )
MPI_Irecv( start, count, datatype, src, tag, comm, &request )
MPI_WAIT( &request, &status )
Used to overlap communication with computationAnywhere you use MPI_Send or MPI_Recv, you can use the pair of MPI_Isend/MPI_Wait or MPI_Irecv/MPI_WaitAlso can use MPI_Waitall, MPI_Waitany, MPI_WaitsomeCan also check to see if you have any messages without actually receiving them – MPI_Probe & MPI_Iprobe
MPI_Probe blocks until there is a message – MPI_Iprobe sets a flag
Communicators
All MPI communication is based on a communicator which contains a context and a groupContexts define a safe communication space for message-passingContexts can be viewed as system-managed tagsContexts allow different libraries to co-existThe group is just a set of processesProcesses are always referred to by unique rank in group
Uses of MPI_COMM_WORLD
Contains all processes available at the time the program was startedProvides initial safe communication spaceSimple programs communicate with MPI_COMM_WORLD
Even complex programs will use MPI_COMM_WORLD for most communications
Complex programs duplicate and subdivide copies of MPI_COMM_WORLD
Provides a global communicator for forming smaller groups or subsets of processors for specific tasks
0 1 2 3 4 5 6 7
MPI_COMM_WORLD
int MPI_Comm_split( MPI_Comm comm, int color, int key, MPI_Comm *newcomm)
MPI_COMM_SPLIT( COMM, COLOR, KEY, NEWCOMM, IERR )INTEGER COMM, COLOR, KEY, NEWCOMM, IERR
Subdividing a Communicator with MPI_COMM_SPLIT
MPI_COMM_SPLIT partitions the group associated with the given communicator into disjoint subgroupsEach subgroup contains all processes having the same value for the argument colorWithin each subgroup, processes are ranked in the order defined by the value of the argument key, with ties broken according to their rank in old communicator
Subdividing a CommunicatorTo divide a communicator into two non-overlapping groups
color = (rank < size/2) ? 0 : 1 ;MPI_Comm_split(comm, color, 0, &newcomm) ;
0 1 2 3 4 5 6 7
0 1 2 3 0 1 2 3
comm
newcomm newcomm
Subdividing a Communicator
To divide a communicator such that all processes with even ranks are in one group all processes with odd ranks are in the other group maintain the reverse order by rank
color = (rank % 2 == 0) ? 0 : 1 ;key = size - rank ;MPI_Comm_split(comm, color, key, &newcomm) ;
0 1 2 3 4 5 6 7
0 1 2 3 0 1 2 3
comm
newcomm newcomm
6 4 2 0 7 5 3 1
program main include 'mpif.h'
integer ierr, row_comm, col_comm integer myrank, size, P, Q, myrow, mycol
P = 4 Q = 3 call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr)
C Determine row and column position myrow = myrank/Q mycol = mod(myrank,Q)C Split comm into row and column comms call MPI_Comm_split(MPI_COMM_WORLD, myrow, mycol, row_comm, ierr) call MPI_Comm_split(MPI_COMM_WORLD, mycol, myrow, col_comm, ierr)
print*, "My coordinates are[",myrank,"] ",myrow. mycol call MPI_Finalize(ierr) stop end
0(0,0)
1(0,1)
2(0,2)
3(1,0)
4(1,1)
5(1,2)
6(2,0)
7(2,1)
8(2,2)
9(3,0)
10(3,1)
11(3,2)
MPI_COMM_WORLD
row_comm
col_comm
Debugging
An ounce of prevention…
Defensive programming Check function return codes Verify send and receive sizes
Incremental programmingModular programming Test modules – keep test code in place
Identify all shared data and think carefully about how it is accessedCorrectness first – then speed
Debugging
Characterize the bug Run code serially Run in parallel on one core (2-4 processes) Run in parallel (2-4 processes on 2-4 cores) Play around with inputs and other data and data sizes Find smallest data size that exposes the bug
Remove as much non-determinism as you canPrint statements – use stderr (non buffered)
Before and after communication or shared variable access Print all information – source, sizes, data, tag, etc.
Identify process number – first thing in print (helps sorting) Leave the prints in your code - #ifdef
Debugging
Learn about C constructs __FILE__, __LINE__, and __FUNCTION__Make one logical change at a time and then testLearn how to attach debuggers You will probably need some sort of stall
code – ie. Wait for input on master then do a barrier – all others just do barrier
Common problems
Not all processes call collective call Be very careful about putting collective calls
inside conditionals Be sure the communicator is correct
Deadlock (everybody on recv) Use non-blocking calls Use MPI_Sendrecv
Process waiting for data that is never sent Use collective calls where you can Use simple communication patterns
Best Advice
Program incrementally and modularlyCharacterize the bug and leave yourself time to walk away from it and think about itNever underestimate the value of a second set of eyes Sometimes just explaining your code to
someone else helps you help yourself