Introduction to MPIteaching.csse.uwa.edu.au/units/CITS3402/lectures/2PointtoPoint.pdfRemainder of...

Preview:

Citation preview

Introduction to MPI Continued

Remainder of the Course

1. Why bother with HPC

2. What is MPI

3. Point to point communication (today)

4. User-defined datatypes / Writing parallel code / How to use a super-computer (today)

5. Collective communication

6. Communicators

7. Process topologies

8. File/IO and Parallel profiling

9. Hadoop/Spark

10. More Hadoop / Spark

11. Alternatives to MPI

12. The largest computations in history / general interest / exam prep

Last Time

• Introduced distributed parallelism

• Introduced MPI

• General setup for MPI-based programming

Very brief, I hope you have enjoyed the not-break-break

Today

• Deriving MPI

• Blocking / Non-blocking send and receive

• Datatypes in MPI

• How to use a super-computer

Get the basics right and everything else should fall into place

Deriving MPI

Advantages of Message Passing

• Why would we want to use a message passing model?• And sometimes you don’t (Map/Reduce, ZeroMQ etc.)

• Universality

• Expressivity

• Ease of Debugging

• Performance

Advantages of Message Passing

• Why would we want to use a message passing model?• And sometimes you don’t (Map/Reduce, ZeroMQ etc.)

• Universality• No special hardware requirements• Works with special hardware

• Expressivity• Message passing is a complete model for parallel algorithms• Useful for automatic load balancing

• Ease of Debugging • Still difficult• Thinking w.r.t. messages fairly intuitive

• Performance• Associates data with a processor• Allows compiler and cache manager to work well together

Introduction to MPI (for real)

• Simple goals• Portability

• Efficiency

• Functionality

• So we want a message passing model• Each process has separate address space

• A message is one process copying some of its address space to another• Send

• Receive

Minimal MPI

• What does the sender send?• Data starting address + length(bytes)

• Destination destination address (an int will do)

• What does the receiver receive?• Data starting addres + length(bytes)

• Source source address (filled when received)

Minimal MPI

• So we can send and receive messages

• Might be enough for some applications but there’s something missing

Message selection

• Currently all processes receive all messages

• If we add a tag field processes will be able to ignore messages not intended for them

Minimal MPI

• Our model now becomes

• Send(address, length, destination, tag)

• Receive (address, length, source, tag, actual length)

• We can make the source and tag arguments wildcards to go back to our original model

• This is a complete model for Message-Passing HPC

• Most exotic MPI functions are built by combining these two

Minimal MPI – Problems

• There are still some issues that MPI solves

1. Describing message buffers

2. Separating families of messages

3. Naming processes

4. Communicators

MPI – Describing Buffers

• (address, length) is not sufficient for two main reasons

• Assumes data is contiguous• Often not the case• E.g. sending the row of a matrix stored column-wise

• Assumes data representation is always known• Does not handle heterogenous clusters• E.g. CPU + GPU machines for example

• MPI’s solution• MPI_datatypes→ Abstract one layout up → Allow users to specify their own• (address, length, datatype)

MPI – Separating families of messages

• Consider using a library written with MPI• They will have their own naming of tags and such

• Your code may interact with this

• MPI’s solution• Contexts → Think if this as super-tags

• Provides one more layer of separation between codes running in one application

MPI – Naming Processes

• Processes belong to groups

• A rank is associates with each group

• Using an int is actually sufficient in this case

MPI – Communicators

• Combines contexts and groups into a single structure

• Destination and source ranks are specified relative to a communicator

MPI_Send(start, count, datatype, dest, tag, comm)

• Message buffer described by• Start

• Count

• Data types

• Target process given by• Dest

• Comm

• Tag can be used to create different ‘types’ of messages

MPI_Recv(start, count, datatype, source, tag, comm, status)

• Waits until a matching (source, tag) message is available

• Reads into the buffer• Start

• Count

• Datatype

• Target process specified by• Source

• Comm

• Status contains more information

• Receiving fewer than count occurrences of datatype is okay, more is an error

MPI – Other Interesting Features

• Collective communication• Get all of your friends involved – light up the group chat

• Two flavours• Data movement – E.g. broadcast

• Collective computation – Min, max, average, logical OR etc.

MPI – Other Interesting Features

• Virtual topologies• Allow graphs and grid connections to be imposed on processes• ‘Send to my neighbours’

• Debugging and profiling help• MPI requires ‘hooks’ to be available for debugging implementation

• Communication modes• Blocking vs. Non-blocking

• Support for Libraries• Communicators allow libraries to exist in their own space

• Support for heterogenous networks• MPI_Send/Recv implementation independent

MPI – Other Interesting Features

• Processes vs. Processors• A process is a software concept

• A processor is a rock we tricked into thinking

• Some implementations limit one process per processors

Is MPI Large?

• There are many functions available and many strange idiosyncrasies

• The core is rather tight however

• A full MPI-specification (fundamentally) requires• Init

• Comm_rank

• Comm_size

• Send

• Recv

• Finalize

Point-to-Point Communication

What’s the point?

• The fundamental mechanism in MPI is the transmission of data between a pair of processes• One sender

• One receiver

• Almost all other MPI constructs are short-hand versions of tasks you could achieve with point-to-point methods

• We will linger on this topic a little longer than may first seem necessary• Many idiosyncrasies in MPI come from how point-to-point communication is

achieved in-code

What’s the point?

• Remember• Rank → ID of each process in a communicator

• Communicator → Collection of processes

• MPI_COMM_WORLD → The communicator for all processes

Example: Knock-Knock

What we want to do:

• Find our rank

• If process 0• Send “Knock, knock”

• If process 1• Receive a string from process 0

• Otherwise• Do nothing

Example: Knock-Knock

char msg[20];int myrank, tag = 99;MPI_Status status;...MPI_Comm_rank(MPI_COMM_WORLD, &myrank);if(myrank == 0){

strcpy(msg, "Knock knock");MPI_Send(msg, strlen(msg)+1, MPI_CHAR, 1, tag, MPI_COMM_WORLD);

} else if (myrank == 1){MPI_Recv(msg, 20, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status);

}

This code sends a single string from process 0 to process 1Some things we’ll discuss in detail:• MPI_Send• MPI_Recv• Tags

MPI_Send

1. Msg (in)• The buffer (location in memory) to send from

2. Strlen(msg)+1 (in)• The number of items to send

• + 1 to include the null-byte ‘\0’ → Only relevant when sending strings

3. MPI_CHAR (in)• The MPI datatype (more on this later) indicates the size of each element in

your buffer

if(myrank == 0){strcpy(msg, "Knock knock");status = MPI_Send(msg, strlen(msg)+1, MPI_CHAR, 1, tag, MPI_COMM_WORLD);

MPI_Send

4. 1 (in)• The rank of the destination process

5. Tag (in)• The ‘topic’ of the message (will only be received if process 1 Recv’s on tag 99)

6. MPI_COMM_WORLD (in)• The communicator on which we are sending through

• Each communicator (with two processes) has a rank 0 process and a rank 1 process

if(myrank == 0){strcpy(msg, "Knock knock");status = MPI_Send(msg, strlen(msg)+1, MPI_CHAR, 1, tag, MPI_COMM_WORLD);

MPI_Recv1. Msg (out)

• The buffer to receive from

2. 20 (in)• The maximum number of elements we want

3. MPI_CHAR (in)• The size of each element

} else if (myrank == 1){MPI_Recv(msg, 20, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status);

}

MPI_Recv4. 0 (in)

• The process we want to receive from

5. Tag (in)• The ‘topic’ we want to receive on (more on this later)

6. MPI_COMM_WORLD (in)• The communicator we are communicating on

7. &status (inout)• The error code, in this case passed to the function itself (since it returns how

many elements was received)

} else if (myrank == 1){MPI_Recv(msg, 20, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status);

}

Tags

• A rather good idea at the time

• Rarely used in practice

• Allow processes to provide ‘topics’ for communication• E.g. ‘42’ refers to all communication for a particular sub-task etc.

• MPI_ANY_TAG renders specifying tags useless

Generally, we write our own code

and we assume we know what we’re doing.

MPI Datatypes

• Back to the MPI_CHAR

• Generally, data in a message (sent or received) is described as a triple• Address

• Count

• Datatype

MPI Datatypes

• Back to the MPI_CHAR• Generally, data in a message (sent or received) is described as a triple

• Address• Count• Datatype

• An MPI datatype can be• Predefined, corresponding to a language primitive (MPI_INT, MPI_CHAR)• A contiguous array of MPI datatypes• A strided block of datatypes• An indexed array of blocks of datatypes• An arbitrary structure of datatypes

• There are MPI functions to construct custom datatypes (int, float) tuples for example

MPI Datatypes

• Using MPI datatypes specifies messages as data-points not bytes• Machine independent

• Implementation independent

• Portable between machines

• Portable between languages

We have some more information

Some really, really important points

• Communication requires cooperation; you need to know:• Who you are sending/receiving from/to

• What you are sending/receiving

• When you want to send/receive

• Very specific, requires careful reasoning about algorithms

• All nodes (in general) will run the same executable

Some really, really important points

• Communication requires cooperation; you need to know:• Who you are sending/receiving from/to

• What you are sending/receiving

• When you want to send/receive

• Very specific, requires careful reasoning about algorithms

• All nodes (in general) will run the same executable• Very different style of programming

• The ‘root’ (usually rank 0) may have very different tasks to all other nodes

• Rank becomes very important to dividing the bounds of a problem

Example 2: Knock-knock, who’s there

What we want to do

• Find our rank

• If process 0 • Send ‘knock knock’• Receive from process 1

• If process 1• Send ‘who’s there’ • Receive from process 2

• Else• Do nothing

Example 2: Knock-knock, who’s therechar msg[20];int myrank, tag = 99;MPI_Status status;...MPI_Comm_rank(MPI_COMM_WORLD, &myrank);if(myrank == 0){

strcpy(msg, "Knock knock");MPI_Send(msg, strlen(msg)+1, MPI_CHAR, 1, tag, MPI_COMM_WORLD);MPI_Recv(msg, 20, MPI_CHAR, 1, tag, MPI_COMM_WORLD, &status);

}

Example 2: Knock-knock, who’s therechar msg[20];int myrank, tag = 99;MPI_Status status;...MPI_Comm_rank(MPI_COMM_WORLD, &myrank);if(myrank == 0){

strcpy(msg, "Knock knock");MPI_Send(msg, strlen(msg)+1, MPI_CHAR, 1, tag, MPI_COMM_WORLD);MPI_Recv(msg, 20, MPI_CHAR, 1, tag, MPI_COMM_WORLD, &status);

} else if (myrank == 1){strcpy(msg, "Who's there?");MPI_Send(msg, strlen(msg)+1, MPI_CHAR, 1, tag, MPI_COMM_WORLD);MPI_Recv(msg, 20, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status);

}

Example 2: Knock-knock, who’s there

char msg[20];int myrank, tag = 99;MPI_Status status;...MPI_Comm_rank(MPI_COMM_WORLD, &myrank);if(myrank == 0){

strcpy(msg, "Knock knock");MPI_Send(msg, strlen(msg)+1, MPI_CHAR, 1, tag, MPI_COMM_WORLD);MPI_Recv(msg, 20, MPI_CHAR, 1, tag, MPI_COMM_WORLD, &status);

} else if (myrank == 1){strcpy(msg, "Who's there?");MPI_Send(msg, strlen(msg)+1, MPI_CHAR, 1, tag, MPI_COMM_WORLD);MPI_Recv(msg, 20, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status);

}

There may be a problem here

Blocking vs. Non-blocking

• Depending on the implementation you use this may cause a deadlock• If you have enough buffer space it might be okay (but don’t rely on this)

• We’ve been using the blocking send/receive functions• Halt execution until completed

• There exist non-blocking versions of send/recv• MPI_ISend – Same arguments

• MPI_IRecv – Same arguments but replace MPI_Status with MPI_Request

• Return immediately and continue with computation

When to use Non-blocking

• Should only be used where performance improves• E.g. sending a large amount of data when a large amount of compute is also

available

• Using non-blocking communication will parallelise a little more

• To check for a communication’s success, need to use• MPI_Wait()

• MPI_Test()

• An alternate interpretation• MPI_Send/Recv is just MPI_Isend/Irecv + MPI_WAIT()

Example 3: Knock-Who’s Knock-There

int main(int argc, char *argv[]) {int myrank, numprocs;int msg[20], msg2[20];MPI_Request request;MPI_Status status;

MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD, &numprocs);MPI_Comm_rank(MPI_COMM_WORLD, &myid);//… Next slide

}

Example 3: Knock-Who’s Knock-There

if(rank == 0){strcpy(msg, "Knock knock");MPI_Irecv(msg2, 20, MPI_CHAR, 1, MPI_ANY_TAG, MPI_COMM_WORLD, &request);MPI_Send(msg, strlen(msg)+1, MPI_CHAR, 1, MPI_ANY_TAG, MPI_COMM_WORLD);MPI_Wait(&request, &status);

} else if (rank == 1){strcpy(msg, "Who's there?");MPI_Irecv(msg2, 20, MPI_CHAR, 1, MPI_ANY_TAG, MPI_COMM_WORLD, &request);MPI_Send(msg, strlen(msg)+1, MPI_CHAR, 1, MPI_ANY_TAG, MPI_COMM_WORLD);MPI_Wait(&request, &status);

}MPI_Finalize();return 0;

On Multi-threading + MPI

• MPI does not specify the interaction of blocking communication calls and a thread scheduler

• A good implementation would block only the sending thread

• It is the user’s (your) responsibility to ensure other threads do not edit the communicating (possibly shared) buffer

On Message Ordering

• Messages are non-overtaking

• The order a process sends messages is the order another process receives them

• The order multiple processes send messages in does not matter

P0

P1

P2

M02 M0

1

M11

Can be received at P2 as:• M1

1,M01,M0

2

• M01, M1

1, M02

• M01, M0

2, M11

But not:• M1

1,M02,M0

1

• M02, M1

1, M01

• M02, M0

1, M11

On Message Ordering

• Another important note: Ordering is not transitive• Sounds goofy, but easy to make this mistake

• Be careful when using MPI_ANY_SOURCE

P0Send(P1)Send(P2)

P1Recv(P0)Send(P2)

P2Recv(P?)Recv(P?)

On Message Ordering

• One goal of MPI is to encourage deterministic communication patterns

• Using exact addresses, exact buffer sizes, enforced ordering etc.

• Makes code predictable

• Sources of non-determinism• MPI_ANY_SOURCE as source argument

• MPI_CANCEL()

• MPI_WAITANY()

• Threading

Extended Example : Computing Pi (*groans)

• Your favourite example is back again

• This example is ‘perfect’ computationally• Automatic load balancing → All processes do as much work as possible

• Verifiable answer

• Minimal communication

• This time, we use numerical integration

0

11

1 + 𝑥𝑑𝑥 = arctan 𝑥 0

1 = arctan 1 − arctan 0 = arctan 1 =𝜋

4

Extended Example – Computing PI

• So we integrate the function 𝑓 𝑥 =4

1+𝑥2

• Our approach is very simple• Divide [0,1] by some value n

• Each forms a rectangle of height 𝑓(𝑛) and width 1

𝑛

• Add up all the rectangles to get a (not very good) approximation of the integration

• This gives us an approximation to 𝜋

Extended Example – Computing PI

• Our parallelism will also be quite simple• One process (the root, rank 0) will obtain n from the user and broadcast this

value to all others

• All other processes will determine how many points they each compute

• All other processes will compute their sub-approximations

• All other processes will send back their approximations

• The root will display the final result

Code Time

Extended Example 2 – Matrix Vector Multiplication• Previous example does not have any ‘actual’ message passing

• We introduce one of the most common structures for a parallel program• Self-scheduling

• Formerly ‘Master-slave’

• I will use ‘Master-node’

• Matrix-Vector multiplication is a good example but there are many scenarios where this approach is applicable• The nodes do not need to communicate with each other

• The amount of work for each node is difficult to predict

Matrix-Vector Multiplication

cbA

Matrix-Vector Multiplication

• All processes receive a copy of the vector b

• Each unit of work is a dot-product of one row of matrix A

• The root sends rows to each of the nodes

• When the result is sent back, another row is sent if available

• After all rows are processed, termination messages are sent

Code Time IIThere’s a git repo with all in-lecture code examples

https://github.com/pritchardn/teachingCITS3402

Studying Parallel Performance

Studying Parallel Performance

• We can do a little better than timing programs

• Goal is to estimate• Computation

• Communication

• Scaling w.r.t. problem size

Studying Parallel Performance

• We can do a little better than timing programs

• Goal is to estimate• Computation

• Communication

• Scaling w.r.t. problem size

• Consider Matrix-vector multiplication• Square, dense matrix 𝑛 × 𝑛

• Each element of c requires 𝑛 multiplications and 𝑛 − 1 additions

• There are 𝑛 elements in c so our FLOP requirements are

𝑛 𝑛 + 𝑛 − 1 = 2𝑛2 − 𝑛

Studying Parallel Performance

• We also consider communication costs

• We assume all processes have the original vector already

• Need to send 𝑛 + 1 values (sending to and back)

• 𝑛 times (for each row)

𝑛 𝑛 + 1 = 𝑛2 + 𝑛

Studying Parallel Performance

• A ratio of communication to computation is𝑛2 + 1

2𝑛2 − 𝑛× (

𝑇𝑐𝑜𝑚𝑚

𝑇𝑐𝑎𝑙𝑐)

• Computation is usually cheaper than communication • We try to minimise this ratio

• Often making the problem larger makes communication overhead insignificant

• Here, this is not the case

• For large 𝑛 the ratio gets closer to 1

2

Studying Parallel Performance

• We could easily adapt our approach for matrix-matrix multiplication

• Instead of a vector b we have another square matrix B

• Each round sees a vector sent back instead of a single value

Studying Parallel Performance

• Computation requirements• The operations for each element of C is 𝑛 multiplications and 𝑛-1 adds• Now 𝑛2 elements to compute

𝑛2 2𝑛 − 1 = 2𝑛3 − 𝑛2

• Communication requirements• 𝑛 (to send each row) + 𝑛 (to send a row back) and there are 𝑛 rows

𝑛 × 2𝑛 = 2𝑛2

• Comm/Calc ratio2𝑛2

2𝑛3 − 𝑛2× (

𝑇𝑐𝑜𝑚𝑚

𝑇𝑐𝑎𝑙𝑐)

• Which scales to 1

𝑛for large 𝑛

User Defined DatatypesWriting Parallel Code

How to use a super-computer

Today

• Introduction to User-Defined Datatypes• Datatype constructors

• Use of derived datatypes

• Addressing

• Packing and Unpacking

• Writing parallel code• Some guidance on getting started

• Reasoning with parallel algorithms

User Defined Datatypes

• We’ve seen MPI communicate a sequence of identical elements, contiguous in memory

• We sometimes want to send other things• Structures containing setup parameters• Non-contiguous array subsections

• We’d like to send this data with a single communication call rather than several smaller ones• Communication speed is network bound

• MPI provides two mechanisms to help

We’ll look at how this works (relatively) briefly, most applications can be dealt with primitive communication

User Defined Datatypes

• Derived datatypes• Specifying your own data layouts

• Useful for sending structs for instance

• Data-packing• An extra routine before/after sending/receiving to compress non-contiguous

data into a dense block

• Often, we can achieve the same data transfer with either method• Obviously, there are pros and cons to either

Derived Datatypes

• All MPI communication functions take a datatype argument

• MPI provides functions to construct ‘types’ of our own (only relevant to MPI of course) to provide to these functions

• They describe layouts of memory of complex-data structures or non-contiguous memory

Derived Datatypes

• Constructed from basic datatypes

• The writers of MPI recursively defined datatypes – allowing us to use them for our own nefarious tasks

• A derived datatype is an opaque object (we can’t edit it after construction) specifying two things• A sequence of primitive datatypes - Type signature

• A sequence of integer (byte) displacements - Type map

Type Signature

• A list of datatypes

• E.g. Typesig = {MPI_CHAR, MPI_INT}

• This describes some datatype which has one or more MPI_CHARs followed by one or more MPI_INTs

Type map

• A list of pairs

• The first elements are your type signature

• The second elements are the displacement in memory (in bytes) from the first location

• E.g. MPI_INT has the type map {(int), 0} – A single integer beginning at the start

• Type maps define the size of the buffer you are sending

This will make more sense once we introduce datatype-constructors

Some utility functions (for reference)

• MPI_TYPE_EXTENT(datatype, extent)• datatype IN datatype (e.g. MPI_CHAR, MPI_MYTHINGY)

• extent OUT the extent of the datatype i.e. the largest possible size the datatype can be (factoring for memory bound requirements)

• MPI_TYPE_SIZE(datatype, size)• datatype IN datatype

• size OUT An array of ints specifying the size (in bytes) of each entry in the type signature

Some utility functions (for reference)

• MPI_Commit(datatype)• ‘compiles’ or flattens the datatype into an internal representation for MPI to

use

• Must be done before user a derived datatype

Importantly, all processes in a parallel program must have the same datatypes committed

MPI_TYPE_CONTIGUOUS(count, oldtype, newtype)

• count IN replication count

• oldtype IN the old datatype

• newtype OUT the new datatype

• Constructs a typemap of count copies of oldtype in contiguous memory

• Example, oldtype = {(double, 0), (char, 8)}

• MPI_TYPE_CONTIGUOUS(3, oldtype, newtype) yields the datatype• {(double, 0),(char, 8),(double, 16),(char, 24), (double, 32),(char, 40)}

Constructors – Contiguous

• MPI_TYPE_CONTIGUOUS(3, oldtype, newtype) yields the datatype• {(double, 0),(char, 8),(double, 16),(char, 24), (double, 32),(char, 40)}

oldtype

count 3

newtype

MPI_TYPE_INDEXED(count, array_of_blocklengths, array_of_displacements, oldtype, newtype)

• Allows one to specify a non-contiguous data layout

• The displacements of each block can differ (and can be repeated)

• count IN the number of blocks

• array_of_blocklengths IN # of elements for each block

• array_of_displacements IN displacement for each block (measured in number of elements)

• oldtype IN the old datatype

• newtype OUT the new datatype

MPI_TYPE_INDEXED(count, array_of_blocklengths, array_of_displacements, oldtype, newtype)

• E.g. oldtype = {(double, 0), (char, 8)}• Blocklengths B = (3, 1)

• Displacements D = (4, 0)

• MPI_TYPE_INDEXED(2, B, D, oldtype, newtype) yields

oldtype

count = 3, blocklength = (2,3,1), displacement = (0, 3, 8)

newtype

Example – Upper Triangle of a Matrix

[0][0] [0][1] [0][2] …

[1][1] [1][2] …

[2][2] [2][3] …

[3][3] [3][4] …

Example – Upper Triangle of a Matrix

double values[100][100];double disp[100];double blocklen[100];MPI_Datatype upper;

/* find the start and end of each row */for(int i = 0; i < 100; ++i){

disp[i] = 100 * i + i; // The i-th row starts at the i-thelement

blocklen[i] = 100 - i; // There are i elements in each row}/* Create datatype */MPI_Type_indexed(100, blocklen, disp, MPI_DOUBLE, &upper);MPI_Type_commit(&upper);/* Send it */MPI_Send(values, 1, upper, dest, tag, MPI_COMM_WORLD);

Packing and Unpacking

• Based on how previous libraries achieved this

• In most cases one can avoid packing and unpacking in favour of derived datatypes• More portable

• More descriptive

• Generally simpler

• Sometimes for simple use cases it can be desireable

• This is where we make use of the MPI_PACKED datatype mentioned last time

MPI_PACK(inbuf, incount, datatype, outbuf, outsize, position, comm)

• inbuf IN input buffer

• incount IN number of elements in the buffer

• datatype IN the type of each elements

• outbuf OUT output buffer

• outsize IN output buffer in size (bytes)

• position INOUT current position in buffer (bytes)

• comm IN communicator for packed message

Used by repeatedly calling MPI_PACK with changed inbuf and outbufvalues

MPI_UNPACK(inbuf, insize, position, outbuf, outcount, datatype, comm)

• inbuf IN input buffer

• insize IN size of input buffer (bytes)

• position INOUT current position (bytes)

• outbuf OUT output buffer

• outcount IN number of components to be unpacked

• datatype IN datatype of each output component

• comm IN communicator

The exact inverse of MPI_PACK. Used by repeatedly calling unpack, extracting each subsequent element

Finally

There are many other functions that are helpful.

Please have a look at the documentation and MPI spec

when you come across a bizarre communication requirement.

How to write Parallel Code

Why bother with this section

• Reasoning with parallel algorithms is bizarre relative to serial algorithms. Subsequently writing and reading parallel code can be painful if you are not prepared

• This is not helped by running code on a remote machine using a job-system.

This section is simply some guidance on where to start, everyone finds some things more useful than others.

Writing parallel code – Quick tips

• Install MPI locally • Cannot be stressed enough

• You can simulate running multiple nodes on a single machine with mpiexec/mpirun calls

• Write a serial version of your solution first• Tight and clean serial code is vastly easier to parallelise

• Especially if you don’t know how to parallelise the code in the first place

• I would not advise trying to do both simultaneously

Writing parallel code – Quick tips

• Write a ‘dummy’ parallelised version of your code first• E.g. decomposing a matrix

• Have each process compute its bounds first and print them out

• Test your scheme for many different configurations and check you’re correct

• Then start adding actual computation

• Read documentation• Most of the time, there will be a function to help make your life easier

• Test your parallel code in serial• A parallelised piece of code should still work if only one node is available

Writing parallel code – Quick tips

• Use the sysadmins• If you end up using a commercial HPC system (e.g. Pawsey Supercomputing

Centre) use the helpdesk – they want to help

• Invest in some tests• Writing a few small examples and making them easy to run will make testing

changes easier

• Good printouts are invaluable• It helps to quickly find out where your code may be bugged and what values

are changing

Writing parallel code – Quick tips

• Make writing code easy for you • Many IDEs / editors (CLion, VSCode etc.) allow for a full remote mode. Write

code directly on the supercomputer using your own editor

• Otherwise • write code locally

• test locally

• commit with git

• pull the edits on the supercomputer and run

Write and run code – you’ll only improve with practice

Common Errors Writing Parallel Code

• Expected argc and argv to be passed to all processors

• Doing things before MPI_Init or after MPI_Finalize

• Matching MPI_Bcast with MPI_Recv

• Assuming your MPI implementation is thread-safe

Summary

• Looked at point to point communication

• MPI_Send

• MPI_Recv

• MPI_ISend

• MPI_Irecv

• MPI_Wait

Bonus material: Tour-de-force of most useful point-to-point function descriptions (might be helpful later)

Quick reminder: https://www.mpi-forum.org/docs/→ Explanation of all function calls

MPI Datatypes

MPI Datatypes

• Since all data is given an MPI type, an MPI implementation can communicate between very different machines

• Specifying application-oriented data layout• Reduces memory-to-memory copies in implementation

• Allows the use of special hardware where available

A list of MPI Datatypes (C)MPI Datatype C Datatype

MPI_CHAR signed char

MPI_SHORT signed short int

MPI_INT signed int

MPI_LONG signed long int

MPI_UNSIGNED_CHAR unsigned char

MPI_UNSIGNED_SHORT unsigned short int

MPI_UNSIGNED unsigned int

MPI_UNSIGNED_LONG unsigned long int

MPI_FLOAT float

MPI_DOUBLE double

MPI_LONG_DOUBLE long double

MPI_BYTE n/a

MPI_PACKED n/a

MPI_BYTE / MPI_PACKED

• MPI_BYTE is precisely a byte (eight bits)

• Un-interpreted and may be different to a character • Some machines may use two bytes for a character for instance

• MPI_PACKED is a much more complicated beastie (we’ll get to it later)• For now, it’s just ‘any-data’

• Used to send structs through MPI

Point-to-Point-Pointers

MPI_Send(start, count, datatype, dest, tag, comm)

• start (IN) initial address of send buffer

• count (IN) number of elements in buffer

• datatype (IN) datatype of each entry

• dest (IN) rank of destination

• tag (IN) message tag

• comm (IN) communicator

Performs a blocking send, returns a status flag (int)

MPI_Recv(start, count, datatype, source, tag, comm, status)

• start

• count

• datatype

• source

• tag

• comm

• status

Performs a blocking receive, return an error flag. Status can be inspected for more information

MPI_Sendrecv(sendbuf, sendcount, sendtype, dest, sendtag, recvbuf, recvcount, recvtype, source, recvtag, comm, status)

• sendbuf (IN) start of send buffer

• sendcount (IN) number of entries to send

• sendtype (IN) datatype of each entry

• dest (IN) rank of destination

• sendtag (IN) message tag

• recvbuf (OUT) start of receive buffer (must be a different buffer)

• recvcount (IN) number of entries to receive

• recvtype (IN) datatype of receive entries

• source (IN) rank of source

• recvtag(IN) message tag

• comm (IN) communicator

• status (OUT) return status

Performs a standard send and receive as if executed on two separate threads

MPI_ISend(buf, count, datatype, dest, tag, comm, request)

• buf (IN) initial address of send buffer

• count (IN) number of entries to send

• datatype (IN) datatype of each entry

• dest (IN) rank of destination in comm

• tag (IN) message tag

• comm (IN) communicator

• request (OUT) request handle (MPI_REQUEST)

Posts a standard nonblocking send

MPI_IRecv(buf, count, datatype, source, tag, comm, request)

• buf (OUT) start of receive buffer

• count (IN) number of entries to receive

• datatype (IN) type of each entry

• source (IN) rank of source

• tag (IN) message tag

• comm (IN) communicator

• request (OUT) request handle

Posts a nonblocking receive

MPI_Wait(request, status)

• request (INOUT) request handle

• status (OUT) status object

Returns when the operation in request is complete

MPI_Test(request, flag, status)

• request (INOUT) request handle

• flag (OUT) true if operation completed

• status (OUT) status object

Returns true if the operation defined by request is complete. This function can allows single-threaded applications to schedule alternate tasks while waiting for communication to complete.

MPI_Waitall(count, requests, statuses)

• count (IN) list length

• requests (INOUT) array of request handles

• statuses (OUT) array of status objects

Blocks until all communications associated with the array of requests have resolved.

MPI_Testall(count, requests, statuses)

• count (IN) list length

• requests (INOUT) array of request handles

• statuses (OUT) array of statuses

Tests for completion of all communications specified in the array of requests. If all have completed, returns true, false otherwise.

MPI Constants

Communicators

• MPI_ANY_TAG Specifies no tag

• MPI_ANY_SOURCE If passed when sending, any process can recv

• MPI_COMM_NULL A return value occurring when processes are not in a given communicator

• MPI_COMM_SELF A communicator containing only itself

• MPI_COMM_WORLD Communicator containing all processes

Error codes

• MPI _SUCCESS Status code of a successful call

• MPI _ERR Can be used to check if any error has occurred

• MPI _ERR _ARG Indicates an error with a passed argument

• MPI _ERR _COMM Indicates an invalid communicator

• MPI _ERR _IN _STATUS Indicates an error with the error code

• MPI _ERR _ROOT Indicates an invalid root node argument

• MPI _ERR _TAG Indicates an invalid argument for tag

• MPI _ERR _UNKNOWN An error code not matching anything.