39
Parallel Processing 1 Parallel Processing (CS 676) Lecture 8: Grouping Data and Communicators in MPI Jeremy R. Johnson *Parts of this lecture was derived from chapters 6,7 in Pacheco

Parallel Processing (CS 676) Lecture 8: Grouping Data and Communicators in MPI

  • Upload
    hall

  • View
    23

  • Download
    0

Embed Size (px)

DESCRIPTION

Parallel Processing (CS 676) Lecture 8: Grouping Data and Communicators in MPI. Jeremy R. Johnson *Parts of this lecture was derived from chapters 6,7 in Pacheco. Introduction. - PowerPoint PPT Presentation

Citation preview

Page 1: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 1

Parallel Processing (CS 676)

Lecture 8: Grouping Data and Communicators in MPI

Jeremy R. Johnson

*Parts of this lecture was derived from chapters 6,7 in Pacheco

Page 2: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 2

Introduction

• Objective: To introduce MPI commands for creating types and communicators. To discuss performance models and considerations in MPI and ways of reducing communication.

• Topics– MPI Datatypes and packing

• Revised version of Get_Data• Matrix transposition

– Creating communicators• Topologies • Grids

– Matrix Multiplication• Fox’s algorithm

– Performance Model

Page 3: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 3

Derived Types

• Due to latency of communication it is usually a good idea to package up several elements into a single message

• MPI_Send and MPI_Recv, allow message to be given by a start address, basic type, and a count.

– This allows multiple data elements to be sent in one message– Requires elements to be of the same type – Must be contiguous

• A generalized type– {(t0,d0),…,(tn-1,dn-1)}

– ti is an existing type

– di is a displacement

Page 4: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 4

Functions for Creating MPI Types

• int MPI_Type_struct(int count, int block_lengths[], MPI_Aint displacements[], MPI_Datatype typelist[], MPI_Datatype new_mpi_t)

• MPI_Address( void* location, MPI_Aint* address)

• int MPI_Type_commit(MPI_Datatype* new_mpi_t)•

Page 5: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 5

Other Derived Datatype Constructors

• int MPI_Type_vector(int count, int block_length, int stride, MPI_Datatype element_type, MPI_Datatype new_mpi_t)

• int MPI_Type_contiguous(int count, MPI_Datatype old_type, MPI_Datatype new_mpi_t)

• int MPI_Type_indexed(int count, int block_lengths[], int displacements[], MPI_Datatype old_type, MPI_Datatype new_mpi_t)

Page 6: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 6

Transpose

float A[10][10]; /* stored in row-major order. */

/* Send 3rd row of A from process 0 to process 1. */

If (my_rank == 0) {

MPI_Send(&(A[2][0]), 10, MPI_FLOAT, 1, 0, MPI_COMM_WORLD);

} else {

MPI_Recv(&(A[2][0]), 10, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &status);

}

/* Doesn’t work for columns, since not contiguous. */

Page 7: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 7

Transpose

float A[10][10]; /* stored in row-major order. */

/* Send 3rd column of A from process 0 to process 1. */

MPI_Type_vector(10, 1, 10, MPI_FLOAT, &column_mpi_t);

MPI_Type_commit(&column_mpi_t);

If (my_rank == 0) {

MPI_Send(&(A[0][2]), 1, column_mpi_t, 1, 0, MPI_COMM_WORLD);

} else {

MPI_Recv(&(A[0][2]), 1, column_mpi_t, 0, 0, MPI_COMM_WORLD, &status);

}

Page 8: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 8

Upper Triangular Matrix

float A[n][n]; /* Complete matrix */float T[n][n]; /* Upper triangle. */int displacements[n];int block_lengths[n];MPI_Datatype index_mpi_t;

for(i = 0; i < n; i++) { block_lengths[i] = n-i; displacements[i] = (n+1)*i;}MPI_Type_indexed(n, block_lengths, displacments, MPI_FLOAT, &index_mpi_t);MPI_Type_commit(&index_mpi_t);if (my_rank == 0) { MPI_Send(A, 1, index_mpi_t, 1, 0, MPI_COMM_WORLD);else MPI_Recv(T, 1, index_mpi_t, 0, 0, MPI_COMM_WORLD, &status);

Page 9: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 9

Type Matching

• When can a receiving process match the data sent by a sending process?

– MPI_Send(message, send_count, send_mpi_t, 1, 0, MPI_COMM_WORLD)

– MPI_Recv(message, recv_count, recv_mpi_t, 0, 0,MPI_COMM_WORLD,&status)

• Given a derived type {(t0,d0),…,(tn-1,dn-1)}– Displacements do not matter

– Type signatures {t0,…,tn-1} and {u0,…,um-1} must be compatible

– n m, ti = ui for i=0,…,n-1

– For collective communications sending and receiving types must be identical

Page 10: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 10

Type Matching Example

• For type column_mpi_t (column of 10 10 array of floats)– {(MPI_FLOAT,0), (MPI_FLOAT,10*sizeof(float)),– (MPI_FLOAT,20*sizeof(float)),…,(MPI_FLOAT,90*sizeof(float))}– Signature is {MPI_FLOAT,…,MPI_FLOAT}, MPI_FLOAT 10 times

– Example: Can send column to row

float A[10][10]; /* stored in row-major order. */

If (my_rank == 0)

MPI_Send(&(A[0][0]), 1, column_mpi_t, 1, 0, MPI_COMM_WORLD);

else if (my_rank == 0)

MPI_Recv(&(A[0][0]),10, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &status);

Page 11: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 11

Pack and Unpack

• MPI_Pack and MPI_Unpack allow a user to copy non-contiguous memory locations into a contiguous buffer and to copy a contiguous buffer into non-contiguous memory locations

• int MPI_Pack(void* pack_data, int in_count, MPI_Datatype datatype, void* buffer, int buffer_size, int* position, MPI_Comm comm)

– On input copy data starting at location &buffer + position– On output position points to first location in the buffer after pack_data

• int MPI_Unpack(void* buffer, int size, int* position,void* unpack_data, int count,MPI_Datatype datatype, MPI_Comm comm)

– Data starting at location &buffer + *position is copied into memory referenced by unpack_data

– count data elements of type datatype are copied into unpack_data– position is updated to point to location in buffer after the data just copied

• Messages constructed with MPI_Pack should be communicated with datatype argument MPI_PACKED

Page 12: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 12

Deciding which Method to Use

• Creating a derived type has overhead associated with it.• Depends on number of times type will be used.• Can avoid system buffering with Pack/Unpack• Can use variable length messages with Pack/Unpack

Page 13: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 13

Variable Length Messages

float* entries;int* column_subscripts;int nonzeros;int position;int row_number;char buffer[HUGE];

if (my_rank == 0) { position = 0; MPI_Pack(&nonzeros,1,MPI_INT,buffer,HUGE,&position,MPI_COMM_WORLD); MPI_Pack(row_number,1,MPI_INT,buffer,HUGE,&position,MPI_COMM_WORLD); MPI_Pack(entries,nonzeros,MPI_FLOAT,buffer,HUGE, &position,MPI_COMM_WORLD); MPI_Pack(column_subscripts,nonzeros,MPI_FLOAT,buffer,HUGE, &position,MPI_COMM_WORLD); MPI_Send(buffer,position,MPI_PACKED, 1,0,MPI_COMM_WORLD);}

Page 14: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 14

Variable Length Messages (cont)

else {

MPI_Recv(buffer,HUGE,MPI_PACKED, 0,0,

MPI_COMM_WORLD, &status);

position = 0;

MPI_UnPack(buffer,HUGE,&position, &nonzeros, MPI_INT,MPI_COMM_WORLD);

MPI_UnPack(buffer,HUGE,&position, &row_number, MPI_INT,

MPI_COMM_WORLD);

entries = (float *) malloc(nonzeros*sizeof(float));

column_subscripts = (int *) malloc(nonzeros*sizeof(int));

MPI_UnPack(buffer,HUGE,&position, entries, MPI_FLOAT,

MPI_COMM_WORLD);

MPI_UnPack(buffer,HUGE,&position, &column_subscripts, MPI_INT,

MPI_COMM_WORLD);

}

Page 15: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 15

Communicators

• A mechanism to treat a subset of processes as a universe for communication (both point-to-point and collective)

• Types– intra-communicator– inter-communicator

• Components– group (ordered collection of processes)– context (unique identifier)– optional additional information such as topology

• Create new communicators from existing communicators

Page 16: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 16

Working with Groups, Contexts, and Communicators

• MPI_Comm_group(MPI_Comm comm, MPI_Group* group)

• MPI_Group_incl(MPI_Group old_group,

int new_group_size,

int ranks_in_old_group[];

MPI_Group* new_group)

• MPI_Comm_create(MPI_Comm old_comm,

MPI_Group group,

MPI_Comm* new_com)

Page 17: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 17

Creating a Communicator

/* Create communicator out of first row of q^2 processes organized in a q q grid in row-major order. */

MPI_Group group_world;

MPI_Group first_row_group;

MPI_Comm first_row_comm;

int* process_ranks;

process_ranks = (int*) malloc(q*sizeof(int));

for (proc = 0; proc < q; proc++)

process_ranks[proc] = proc;

MPI_Comm_Group(MPI_COMM_WORLD,&group_world);

MPI_Group_incl(group_world,q,process_ranks,&first_row_group);

MPI_Comm_create(MPI_COMM_WORLD,first_row_group,&first_row_comm);

Page 18: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 18

Using a Communicator

/* Broadcast first block to all processes in the same row. */

int my_rank_in_first_row;

float* A00;

if (my_rank < q) {

MPI_Comm_rank(first_row_comm,&my_rank_in_first_row);

A_00 = (float *) malloc(n_bar*n_bar*sizeof(float));

if (my_rank_in_first_row == 0)

{ /* initialize A_00 */}

MPI_Bcast(A_00,n_bar*n_bar,MPI_FLOAT,0,first_row_comm);

}

Page 19: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 19

MPI_Comm_split

• int MPI_Comm_split(MPI_Comm old_comm,

int split_key, int rank_key,

MPI_Comm new_comm)MPI_Comm my_row_comm; int my_row;

/* my_rank is in MPI_COMM_WORLD, q*q = p */

my_row = my_rank/q;

MPI_Comm_split(MPI_COMM_WORLD,my_row,my_rank,&my_row_comm);

/* Creates q new communicators. Processes with the same value of split_key form a new group. The rank in the new communicator is determined by rank_key. Order is preserved. If the same rank_key is used, then the choice is arbitrary. */

Page 20: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 20

Topologies

• Communicators can have attributes. One such attribute is a topology.

• A topology is a mechanism for associating a different addressing scheme with processes belonging to a group.

• Provides a virtual interconnection organization of processes that may be convenient for a particular algorithm.

• Types– Cartesian (grid)– Graph

Page 21: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 21

Working with Cartesian Topologies

• MPI_Cart_create(MPI_Comm old_comm, int number_of_dims, int dim_sizes[], int wrap_around[], int reorder, MPI_Comm* cart_comm)

• MPI_Cart_rank(MPI_Comm comm, int rank, int number_of_dims, int* rank)

• MPI_Cart_coords(MPI_Comm comm, int rank, int number_of_dims, int coordinates[])• MPI_Cart_sub(MPI_Comm cart_comm, int free_coords[], MPI_Comm* new_comm)

Page 22: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 22

Creating a Cartesian Topology

/* create communicator with 2D grid topology. */

MPI_Comm grid_comm;

int dim_sizes[2]; int wrap_around[2]; int reorder = 1;

dim_sizes[0] = dim_sizes[1] = q;

wrap_around[0] = wrap_around[1] = 1;

MPI_Cart_create(MPI_COMM_WORLD, 2, dim_sizes,wrap_around,reorder,

&grid_comm);

Page 23: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 23

Creating a Sub-Cartesian Topology

int free_coords[2];MPI_Comm row_comm;

/* create communicator for each row of grid_comm */free_coords[0] = 0;free_coords[1] = 1;

MPI_Cart_sub(grid_comm, free_coords, &row_comm)

/* create communicator for each column of grid_comm */free_coords[0] = 1;free_coords[1] = 0;

MPI_Cart_sub(grid_comm, free_coords, &col_comm)

Page 24: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 24

Cartesian Addressing

int coordinates[2];

int my_grid_rank;

MPI_Comm_rank(grid_comm, &my_grid_rank);

MPI_Cart_coords(grid_comm, my_grid_rank,2, coordinates);

/* inverse operation */

MPI_Cart_rank(grid_comm, coordinates, &my_grid_rank)

Page 25: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 25

Matrix Multiplication

• Let A, B be n n matrices, and C = A*B BAC kj

n

kikij

1

0

void Serial_matrix_mult(MATRIX_T A, MATRIX_T B, MATRIX_T C, int n)

{

int i,j,k;

for (i=0; i< n; i++)

for (j=0; j< n;j++) {

C[i][j] = 0.0;

for (k=0; k < n;k++)

C[i][j] = C[i][j] + A[i][k]*B[k][j];

}

}

Page 26: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 26

Parallel Matrix Multiplication

/* distribute matrices by rows. */

void Parallel_matrix_mult(MATRIX_T A, MATRIX_T B, MATRIX_T C, int n)

{

for each column of B {

Allgather(column);

Compute dot product of my row of A with column;

}

/* can distribute matrices by blocks of rows. Also B could be distributed by

* columns

*/

Page 27: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 27

Cyclic Matrix Multiplication

/* Arrange processors in a circle, storing rows of A and B in each process. C i.* = Ai,0 * B0,* + … + Ai,n-1 * Bn-1,* */

void Parallel_matrix_mult(MATRIX_T A, MATRIX_T B, MATRIX_T C, int n)

{

i = rank;

Blocal = ith row of B; Alocal = ith row of A;

Clocal = 0; /* ith row of C */

dest = (i+1) % n; src = (i-1) % n;

for (k=0;k<n;k++) {

Ci,* = Ci,* + Ai,j*Bj,*

Clocal = Clocal + Alocal * Blocal;

send_recv(Blocal,dest,src);

}

Page 28: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 28

Fox’s Matrix Multiplication

• Let A, B be q q matrices, and C = A*B

• Organize processors into a sqrt(p) sqrt(p) grid• Store (i,j) block on processor (i,j)

• Broadcast elements of A as k = 0,…,q-1• Cyclically rotate elements of B.

BAC kj

q

kikij

1

0

BAC jqki

q

kqkiiij ,mod)(

1

0mod)(,

Page 29: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 29

ExampleA00 B00

A00 B01

A00 B02

A11 B10

A11 B11

A11 B12

A22 B20

A22 B21

A22 B22

A01 B10

A01 B11

A01 B12

A12 B20

A12 B21

A12 B22

A20 B00

A20 B01

A20 B02

A02 B20

A02 B21

A02 B22

A10 B00

A10 B01

A10 B02

A21 B10

A21 B11

A21 B12

Page 30: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 30

Fox’s Matrix Multiplication

/* Uses a block matrix allocation. Group processors in a q × q grid, where q = sqrt(p). Processor (i,j) stores Aij and initially Bij

*/

void Parallel_matrix_mult(MATRIX_T A, MATRIX_T B, MATRIX_T C, int n)

{

i = my process row; j = my process column;

dest = ((i-1) % q,j);

src = ((i+1) % q,j);

for (stage=0;stage < q; stage++) {

k_bar = (i + stage) mod q;

Broadcast A[i,k_bar] across process row i;

C[i,j] = C[i,j] + A[i,k_bar]*B[k_bar,j];

Send B[k_bar,j] to dest;

Receive B[(k_bar+1) mod q,j] from source;

}

Page 31: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 31

Variant of Fox’s Matrix Multiplication

• Let A, B be q q matrices, and C = A*B

• Organize processors into a sqrt(p) sqrt(p) grid• Store (i,j+i mod q) block of A and (i+j mod q,j) block of B on

processor (i,j)

• Cyclically rotate rows of A to the left.• Cyclically rotate columns of B upward.

BAC kj

q

kikij

1

0

BAC jqkji

q

kqkjiiij ,mod)(

1

0mod)(,

Page 32: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 32

ExampleA00 B00

A01 B11

A02 B22

A11 B10

A12 B21

A10 B02

A22 B20

A20 B01

A21 B12

A01 B10

A02 B21

A00 B02

A12 B20

A10 B01

A11 B12

A20 B00

A21 B11

A22 B22

A02 B20

A00 B01

A01 B12

A10 B00

A11 B11

A12 B22

A21 B10

A22 B21

A20 B02

Page 33: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 33

Variant of Fox’s Matrix Multiplication

/* Uses a block matrix allocation. Group processors in a q × q grid, where q = sqrt(p). Processor (i,j) stores Ai,i+j and initially Bi+i,j

*/

void Parallel_matrix_mult(MATRIX_T A, MATRIX_T B, MATRIX_T C, int n)

{

i = my process row; j = my process column;

coldest = ((i-1) % q,j); colsrc = ((i+1) % q,j);

rowdest = (i,(j-1) % q); rowsrc = (i,(j+1) % q);

for (stage=0;stage < q; stage++) {

k_bar = (i +j + stage) mod q;

C[i,j] = C[i,j] + A[i,k_bar]*B[k_bar,j];

Send_Recv A[i,k_bar] to/from rowdest,rowsrc;

Send_Recv B[k_bar,j] to/from coldest, colsrc;

}

Page 34: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 34

Performance Model

• Communication cost: C(n) = +n = latency– 1/ = bandwidth

• Empirically determine and by measuring time to send/recv messages with different lengths

– Least squares fit

Page 35: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 35

Analysis of Matrix Multiplication

• Let A, B be n n matrices, and C = A*B

• Sequential cost

– T(n)= an3+bn2+cn+d = (n3) – Least squares fit

– T(n) an3

BAC kj

q

kikij

1

0

Page 36: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 36

Analysis of Parallel Matrix Multiplication (Allgather)

• Let A, B be n n matrices, and C = A*B

• Let p = number of processors• Store ith block of n/p rows of A, B, and C on process i

• Parallel computing time: (n3/p + plog(p) + n2log(p))

• Computation time: p(n/p n n/p) = n3/p• Communication time: p(log(p)(+n2/p) [Allgather]

BAC kj

q

kikij

1

0

Page 37: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 37

Analysis of Parallel Matrix Multiplication (Cyclic)

• Let A, B be n n matrices, and C = A*B

• Let p = number of processors• Store ith block of n/p rows of A, B, and C on process i

• Parallel computing time: (n3/p + p + n2)

• Computation time: p(n/p n/p n ) = n3/p• Communication time: p(+n2/p)

BAC kj

q

kikij

1

0

Page 38: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 38

Analysis of Parallel Matrix Multiplication (Fox)

• Let A, B be n n matrices, and C = A*B

• Let p = q2 number of processors organized in a q q grid• Store (i,j)th n/q n/q block of A, B, and C on process (i,j)

• Parallel computing time: (n3/p + qlog(q) + log(q)n2/q)

• Computation time: q(n/q n/q n/q ) = n3/q2 = n3/p• Communication time: qlog(q)(+(n/q)2)

BAC kj

q

kikij

1

0

Page 39: Parallel Processing  (CS 676) Lecture 8:  Grouping Data and Communicators in MPI

Parallel Processing 39

Analysis of Parallel Matrix Multiplication (Fox Variant)

• Let A, B be n n matrices, and C = A*B

• Let p = q2 number of processors organized in a q q grid• Store (i,j)th n/q n/q block of A, B, and C on process (i,j)

• Parallel computing time: (n3/p + q + n2/q)

• Computation time: q(n/q n/q n/q ) = n3/q2 = n3/p• Communication time: q(+(n/q)2)

BAC kj

q

kikij

1

0