Basics of Message-passing Mechanics of message-passing –A means of creating separate processes on...

Basics of Message-passing• Mechanics of message-passing

– A means of creating separate processes on different computers– A way to send and receive messages

• Single program multiple data (SPMD) model– Logic for multiple processes merged into one program– Control Statements separate processor blocks of logic– A compiled program is stored on each processor– All executables are started together statically– Example: MPI

• Multiple program multiple data (MPMD) model– Each processor has a separate master program– Master program spawns child processes dynamically– Example: PVM

Point-to-point Communication

• General syntaxSend(data, destination, message tag)Receive(data, source, message tag)

• Synchronous – Completes after data safely

transferred– No copying between message

buffers

• Asynchronous – Completes when transmission

begins– Local buffers are free for

application use

Process 1 Process 2

send(&x, 2);

recv(&y, 1);

Generic syntax (actual formats later)

Synchronized sends and receivesProcess 1 Process 2

send();

recv();Suspend

processAcknowledgment

MessageBoth processescontinue

(a) send() occurs before recv()

Process 1 Process 2

recv();

send();Suspend

process

Acknowledgment

MessageBoth processescontinue

Request to send

(b) recv() occurs before send()

Point to Point MPI calls• Synchronous Send - completes when data is successfully received• Buffered Send - completes after data is copied to a user supplied buffer

– Becomes synchronous if no buffers are available• Ready Send – synchronous; matching receive must precede the send

– Completion occurs when remote processor receives the data– A matching receive must precede the send

• Receive - completes when the data becomes available.• Standard Send

– If receive posted, completes if data is on its way (asynchronous)– If no receive posted, completes when data is buffered by MPI

• Becomes synchronous if no buffers are available

• Blocking - Return occurs when the call completes• Non-Blocking - Return occurs immediately

– Application is responsible to properly poll or wait for completion– Allows more parallel processing

Buffered Send ExampleApplications supply a data buffer area using

MPI_Buffer_attach() to hold the data during transmission

Process 1 Process 2

send();

recv();

Message buffer

Readmessage buffer

Continueprocess

Message Tags• Differentiates between types of messages• The message tag is carried within message.• Wild card receive operations

– MPI_ANY_TAG: matches any message type– MPI_ANY_SOURCE: matches messages from any sender

Send message type 5 from buffer x to buffer y in process 2

Process 1 Process 2

send(&x,2,5);

recv(&y,1,5);

Movementof data

Waits for a message from process 1 with a tag of 5

Collective Communication

• MPI_Bcast()): Broadcast or Multicast data to processors in a group• Scatter (MPI_Scatter()): Send part of an array to separate processes• Gather (MPI_Gather()): Collect array elements from separate processes• AlltoAll (MPI_Alltoall()): A combination of gather and scatter• MPI_Reduce(): Combine values from all processes to a single value• MPI_Reduce_scatter(): Combination of reduce and then scatter result• MPI_Scan(): Perform a prefix reduction on data on all processors• MPI_Barrier(): Pause until all processors reach the barrier call

Route a message to a communicator (group of processors)

Advantages: •MPI can use the processor hierarchy to improve efficiency•Less programming needed for collective operations

BroadcastBroadcast - Sending the same message to all processesMulticast - Sending the same message to a defined group of processes.

bcast();

datadata

Process 0 Process p 1Process 1

Action

MPI f or m

Scatter

Distributing each element of an array to separate processesContents of the ith location of the array transmits to process i

scatter();

datadata

Action

MPI f or m

GatherOne process collects individual values from set of processes.

gather();

datadata

Action

MPI for m

ReducePerform a distributed calculation

Example: Perform addition over a distributed array

reduce();

datadata

Action

MPI form

PVM (Parallel Virtual Machine)

• Oak Ridge National Laboratories, Free distribution• Host process controls the environment• Parent process spawns other processes• Daemon processes control message passing• Get Buffer, Pack, Send, Unpack• Non-blocking send, blocking or non-blocking receive• Wild card tags and process source• Sample Program: Page 52

mpij and MpiJava• Overview

– MpiJava is a wrapper sitting on mpich or lamMpi– mpij is a native Java implementation of mpi

• Documentation – MpiJava (http://www.hpjava.org/mpiJava.html)– mpij (uses the same API as MpiJava)

• Java Grande consortium (http://www.javagrande.org)– Sponsors conferences & encourages Java for Parallel

Programming– Maintains Java based paradigms (mpiJava, HPJava, and mpiJ)

• Other Java based implementations– JavaMpi is another less popular MPI Java wrapper

SPMD Computationmain (int argc, char *argv[]){ MPI_Init(&argc, &argv);

. MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) master(); else slave();

. MPI_Finalize();}

The master process executes master()

The slave processes execute slave()

Unsafe message passing

send(…,1,…);

recv(…,0,…);

Process 0 Process 1

send(…,1,…);

recv(…,0,…);(a) Intended behavior

(b) Possible behaviorlib()

send(…,1,…);

recv(…,0,…);

Process 0 Process 1

send(…,1,…);

recv(…,0,…);

Destination

Source

Delivery order is a function of timing among processors

Communicators

• Communicators allow:– Collective communication to groups of processors– Give mechanism to identify processors for point to point transfers

• The default communicator is MPI_COMM_WORLD– A unique rank corresponds to each executing process– The rank is an integer from 0 to p – 1– The number of processors executing is p

• Applications can create subset communicators– Each processor has a unique rank in each sub-communicator– The rank is an integer from 0 to g-1– The number of processors in the group is g

A collection of processes

Point-to-point Message Transfer

MPI_Send(buf, count, datatype, dest, tag, comm)

Address of

Number of items

Datatype of

Rank of destination

Message tag

Communicator

send buffer

to send

each item

processMPI_Recv(buf, count, datatype, src, tag, comm, status)

Address of

Maximum number

Datatype of

Rank of source

Message tag

Communicator

receive buffer

of items to receive

each item

process

Statusafter operation

MPI_Comm_rank(MPI_COMM_WORLD,&myrank); int x;MPI_Status *stat;if (myrank == 0) { MPI_Send(&x,1,MPI_INT,1,99,MPI_COMM_WORLD);} else if (myrank == 1) { MPI_Recv(&x,1,MPI_INT,0,99,MPI_COMM_WORLD,stat);}

Non-blocking Point-to-point Transfer

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);int x;MPI_Request *io;MPI_STATUS *stat;if (myrank == 0)

{ MPI_Isend(&x,1,MPI_INT,1,99,MPI_COMM_WORLD,io); doSomeProcessing();

MPI_Wait(io, stat);} else if (myrank == 1) { MPI_Recv(&x,1,MPI_INT,0,99,MPI_COMM_WORLD,stat);}

• MPI_Isend() and MPI_Irecv() return immediately• MPI_Wait() returns after the operation completes• MPI_Test() returns a non-zero if the operation is complete

Collective Communication Example

• Processor 0 gather items from a group of processes• The master processor allocates memory to hold the data• The remote processors initialize the data array• All processors execute the MPI_Gather() function

int data[10]; /*data to gather from processes*/MPI_Comm_rank(MPI_COMM_WORLD, &myrank);if (myrank == 0){ MPI_Comm_size(MPI_COMM_WORLD, &grp_size); buf = (int *)malloc(grp_size*10*sizeof (int));}else { for (i=0; i<10; i++) data[i] = myrank; }

MPI_Gather(data, 10, MPI_INT, buf, grp_size*10 , MPI_INT, 0,MPI_COMM_WORLD) ;

Calculate Parallel Run Time

Sequential execution time: ts

– ts = # compute steps of best sequential algorithm (Big Oh)Parallel execution time: tp = tcomp + tcomm

– Communication overhead: Tcomm = m(tstartup + ntdata) where

tstartup is message latency = time to send a message with no datatdata is transmission time to send one data elementn is the number of data elements, m is the number of messages

– Computation overhead tcomp=f (n, p))

• Assumptions– All processors are homogeneous and run at the same speed– Tp = worst case execution time over all processors– Tstartup, tdata, and tcomp are measured in computational step units so

they can be added together

Estimating Scalability

Notes:•p and n respectively indicate number of processors and data elements•The above formulae help estimate scalability with respect to p and n

Parallel Visualization Tools

Process 1

Process 2

Process 3

TimeComputingWaitingMessage-passing system routine

Message

Observe using a space-time diagram (or process-time diagram)

Parallel Program Development• Cautions

– Parallel program programming is harder than sequential programming– Some algorithms don’t lend themselves to running in parallel

• Advised Steps of Development– Step 1: Program and test as much as possible sequentially– Step 2: Code the Parallel version– Step 3: Run in parallel; one processor with few threads– Step 4: Add more threads as confidence grows– Step 5: Run in parallel with a small number of processors– Step 6: Add more processes as confidence grows

• Tools– There are parallel debuggers that can help– Insert assertion error checks within the code– Instrument the code (add print statements)– Timing: time(), gettimeofday(), clock(), MPI_Wtime()

Basics of Message-passing Mechanics of message-passing –A means of creating separate processes on...

Documents

Introduction to MPI. What is Message Passing Interface (MPI)? Portable standard for communication Processes can communicate through messages. Each

Just Keep Passing The Messages

AdvancedAdvanced Message-Passing Interface (MPI)Message ... · Message-Passing Interface (MPI)Message-Passing Interface (MPI) Bart Oldeman, Calcul Qu ebec { McGill HPC Bart.Oldeman@mcgill.ca

Message Passing - educlash

Verifying Message-Passing Programswith Dependent ...mrg.doc.ic.ac.uk/publications/verifying-message-passing-programs-w… · Verifying Message-Passing Programs with Dependent Behavioural

MPI Message Passing Interface · A message-passing library specifications: • Extended message-passing model • Not a language or compiler specification • Not a specific implementation

Message passing interface

Message Passing

Multi GPU Programming with MPI (Part I+II+III)€¦ · MESSAGE PASSING INTERFACE - MPI Standard to exchange data between processes via messages —Defines API to exchanges messages

Message Passing Workloads in KVM - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides/Message Passi… · Message Passing Even loopback message passing

EECC756 - Shaaban #1 lec # 6 Spring2002 4-4-2002 Message-Passing Programming Deals with parallel programming by passing messages among processing nodes

Message Passing and MPI · MPI: the Message Passing Interface • Standard library for message-passing —portable —almost ubiquitously available —high performance —C and Fortran

Message-Passing Programming

Message Passing Computing 1 iCSC2015,Helvi Hartmann, FIAS Message Passing Computing Lecture 2 Message Passing Helvi Hartmann FIAS Inverted CERN School

Message Passing Basics

Variational Message Passing

Message-Passing Computingpeople.scs.carleton.ca/~achan/teaching/2002-comp... · Message Tag Used to differentiate between different types of messages being sent. Message tag is carried

Message Passing Interfacedheerajb/mpi_part2.pdf · Message Passing Interface Dheeraj Bhardwaj 9 MPI Non- Blocking Send and Receive Message Passing

Intro to CUDA-Aware MPI and NVIDIA GPUDirect | GTC 2013 · Message Passing Interface - MPI Standard to exchange data between processes via messages —Defines API to exchanges messages

MPI: MESSAGE PASSING INTERFACEsubodh/courses/COL730/pdfslides/MPI.pdf · MPI: MESSAGE PASSING INTERFACE. Message Passing Model • Multiple “threads” of execution – Do not share