View
218
Download
0
Category
Tags:
Preview:
Citation preview
Basics of Message-passing• Mechanics of message-passing
– A means of creating separate processes on different computers– A way to send and receive messages
• Single program multiple data (SPMD) model– Logic for multiple processes merged into one program– Control Statements separate processor blocks of logic– A compiled program is stored on each processor– All executables are started together statically– Example: MPI
• Multiple program multiple data (MPMD) model– Each processor has a separate master program– Master program spawns child processes dynamically– Example: PVM
Point-to-point Communication
• General syntaxSend(data, destination, message tag)Receive(data, source, message tag)
• Synchronous – Completes after data safely
transferred– No copying between message
buffers
• Asynchronous – Completes when transmission
begins– Local buffers are free for
application use
Process 1 Process 2
send(&x, 2);
recv(&y, 1);
x y
Generic syntax (actual formats later)
Synchronized sends and receivesProcess 1 Process 2
send();
recv();Suspend
Time
processAcknowledgment
MessageBoth processescontinue
(a) send() occurs before recv()
Process 1 Process 2
recv();
send();Suspend
Time
process
Acknowledgment
MessageBoth processescontinue
Request to send
Request to send
(b) recv() occurs before send()
Point to Point MPI calls• Synchronous Send - completes when data is successfully received• Buffered Send - completes after data is copied to a user supplied buffer
– Becomes synchronous if no buffers are available• Ready Send – synchronous; matching receive must precede the send
– Completion occurs when remote processor receives the data– A matching receive must precede the send
• Receive - completes when the data becomes available.• Standard Send
– If receive posted, completes if data is on its way (asynchronous)– If no receive posted, completes when data is buffered by MPI
• Becomes synchronous if no buffers are available
• Blocking - Return occurs when the call completes• Non-Blocking - Return occurs immediately
– Application is responsible to properly poll or wait for completion– Allows more parallel processing
Buffered Send ExampleApplications supply a data buffer area using
MPI_Buffer_attach() to hold the data during transmission
Process 1 Process 2
send();
recv();
Message buffer
Readmessage buffer
Continueprocess
Time
Message Tags• Differentiates between types of messages• The message tag is carried within message.• Wild card receive operations
– MPI_ANY_TAG: matches any message type– MPI_ANY_SOURCE: matches messages from any sender
Send message type 5 from buffer x to buffer y in process 2
Process 1 Process 2
send(&x,2,5);
recv(&y,1,5);
x y
Movementof data
Waits for a message from process 1 with a tag of 5
Collective Communication
• MPI_Bcast()): Broadcast or Multicast data to processors in a group• Scatter (MPI_Scatter()): Send part of an array to separate processes• Gather (MPI_Gather()): Collect array elements from separate processes• AlltoAll (MPI_Alltoall()): A combination of gather and scatter• MPI_Reduce(): Combine values from all processes to a single value• MPI_Reduce_scatter(): Combination of reduce and then scatter result• MPI_Scan(): Perform a prefix reduction on data on all processors• MPI_Barrier(): Pause until all processors reach the barrier call
Route a message to a communicator (group of processors)
Advantages: •MPI can use the processor hierarchy to improve efficiency•Less programming needed for collective operations
BroadcastBroadcast - Sending the same message to all processesMulticast - Sending the same message to a defined group of processes.
bcast();
buf
bcast();
data
bcast();
datadata
Process 0 Process p 1Process 1
Action
Code
MPI f or m
Scatter
Distributing each element of an array to separate processesContents of the ith location of the array transmits to process i
scatter();
buf
scatter();
data
scatter();
datadata
Process 0 Process p 1Process 1
Action
Code
MPI f or m
GatherOne process collects individual values from set of processes.
gather();
buf
gather();
data
gather();
datadata
Process 0 Process p 1Process 1
Action
Code
MPI for m
ReducePerform a distributed calculation
Example: Perform addition over a distributed array
reduce();
buf
reduce();
data
reduce();
datadata
Process 0 Process p 1Process 1
+
Action
Code
MPI form
PVM (Parallel Virtual Machine)
• Oak Ridge National Laboratories, Free distribution• Host process controls the environment• Parent process spawns other processes• Daemon processes control message passing• Get Buffer, Pack, Send, Unpack• Non-blocking send, blocking or non-blocking receive• Wild card tags and process source• Sample Program: Page 52
mpij and MpiJava• Overview
– MpiJava is a wrapper sitting on mpich or lamMpi– mpij is a native Java implementation of mpi
• Documentation – MpiJava (http://www.hpjava.org/mpiJava.html)– mpij (uses the same API as MpiJava)
• Java Grande consortium (http://www.javagrande.org)– Sponsors conferences & encourages Java for Parallel
Programming– Maintains Java based paradigms (mpiJava, HPJava, and mpiJ)
• Other Java based implementations– JavaMpi is another less popular MPI Java wrapper
SPMD Computationmain (int argc, char *argv[]){ MPI_Init(&argc, &argv);
.
. MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) master(); else slave();
.
. MPI_Finalize();}
The master process executes master()
The slave processes execute slave()
Unsafe message passing
lib()
lib()
send(…,1,…);
recv(…,0,…);
Process 0 Process 1
send(…,1,…);
recv(…,0,…);(a) Intended behavior
(b) Possible behaviorlib()
lib()
send(…,1,…);
recv(…,0,…);
Process 0 Process 1
send(…,1,…);
recv(…,0,…);
Destination
Source
Delivery order is a function of timing among processors
Communicators
• Communicators allow:– Collective communication to groups of processors– Give mechanism to identify processors for point to point transfers
• The default communicator is MPI_COMM_WORLD– A unique rank corresponds to each executing process– The rank is an integer from 0 to p – 1– The number of processors executing is p
• Applications can create subset communicators– Each processor has a unique rank in each sub-communicator– The rank is an integer from 0 to g-1– The number of processors in the group is g
A collection of processes
Point-to-point Message Transfer
MPI_Send(buf, count, datatype, dest, tag, comm)
Address of
Number of items
Datatype of
Rank of destination
Message tag
Communicator
send buffer
to send
each item
processMPI_Recv(buf, count, datatype, src, tag, comm, status)
Address of
Maximum number
Datatype of
Rank of source
Message tag
Communicator
receive buffer
of items to receive
each item
process
Statusafter operation
MPI_Comm_rank(MPI_COMM_WORLD,&myrank); int x;MPI_Status *stat;if (myrank == 0) { MPI_Send(&x,1,MPI_INT,1,99,MPI_COMM_WORLD);} else if (myrank == 1) { MPI_Recv(&x,1,MPI_INT,0,99,MPI_COMM_WORLD,stat);}
Non-blocking Point-to-point Transfer
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);int x;MPI_Request *io;MPI_STATUS *stat;if (myrank == 0)
{ MPI_Isend(&x,1,MPI_INT,1,99,MPI_COMM_WORLD,io); doSomeProcessing();
MPI_Wait(io, stat);} else if (myrank == 1) { MPI_Recv(&x,1,MPI_INT,0,99,MPI_COMM_WORLD,stat);}
• MPI_Isend() and MPI_Irecv() return immediately• MPI_Wait() returns after the operation completes• MPI_Test() returns a non-zero if the operation is complete
Collective Communication Example
• Processor 0 gather items from a group of processes• The master processor allocates memory to hold the data• The remote processors initialize the data array• All processors execute the MPI_Gather() function
int data[10]; /*data to gather from processes*/MPI_Comm_rank(MPI_COMM_WORLD, &myrank);if (myrank == 0){ MPI_Comm_size(MPI_COMM_WORLD, &grp_size); buf = (int *)malloc(grp_size*10*sizeof (int));}else { for (i=0; i<10; i++) data[i] = myrank; }
MPI_Gather(data, 10, MPI_INT, buf, grp_size*10 , MPI_INT, 0,MPI_COMM_WORLD) ;
Calculate Parallel Run Time
Sequential execution time: ts
– ts = # compute steps of best sequential algorithm (Big Oh)Parallel execution time: tp = tcomp + tcomm
– Communication overhead: Tcomm = m(tstartup + ntdata) where
tstartup is message latency = time to send a message with no datatdata is transmission time to send one data elementn is the number of data elements, m is the number of messages
– Computation overhead tcomp=f (n, p))
• Assumptions– All processors are homogeneous and run at the same speed– Tp = worst case execution time over all processors– Tstartup, tdata, and tcomp are measured in computational step units so
they can be added together
Estimating Scalability
Notes:•p and n respectively indicate number of processors and data elements•The above formulae help estimate scalability with respect to p and n
Parallel Visualization Tools
Process 1
Process 2
Process 3
TimeComputingWaitingMessage-passing system routine
Message
Observe using a space-time diagram (or process-time diagram)
Parallel Program Development• Cautions
– Parallel program programming is harder than sequential programming– Some algorithms don’t lend themselves to running in parallel
• Advised Steps of Development– Step 1: Program and test as much as possible sequentially– Step 2: Code the Parallel version– Step 3: Run in parallel; one processor with few threads– Step 4: Add more threads as confidence grows– Step 5: Run in parallel with a small number of processors– Step 6: Add more processes as confidence grows
• Tools– There are parallel debuggers that can help– Insert assertion error checks within the code– Instrument the code (add print statements)– Timing: time(), gettimeofday(), clock(), MPI_Wtime()
Recommended