34
Lecture 9: MPI continued David Bindel 27 Sep 2011

Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Lecture 9: MPI continued

David Bindel

27 Sep 2011

Page 2: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Logistics

I Matrix multiply is done! Still have to run.I Small HW 2 will be up before lecture on Thursday, due next

Tuesday.I Project 2 will be posted next Tuesday.I Email me if interested in Sandia recruitingI Also email me if interested in MEng projects.

Page 3: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Previously on Parallel Programming

Can write a lot of MPI code with 6 operations we’ve seen:I MPI_Init

I MPI_Finalize

I MPI_Comm_size

I MPI_Comm_rank

I MPI_Send

I MPI_Recv

... but there are sometimes better ways. Decide oncommunication style using simple performance models.

Page 4: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Communication performance

I Basic info: latency and bandwidthI Simplest model: tcomm = α+ βMI More realistic: distinguish CPU overhead from “gap”

(∼ inverse bw)I Different networks have different parametersI Can tell a lot via a simple ping-pong experiment

Page 5: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

OpenMPI on crocus

I Two quad-core chips per nodes, five nodesI Heterogeneous network:

I Crossbar switch between cores (?)I Bus between chipsI Gigabit ehternet between nodes

I Default process layout (16 process example)I Processes 0-3 on first chip, first nodeI Processes 4-7 on second chip, first nodeI Processes 8-11 on first chip, second nodeI Processes 12-15 on second chip, second node

I Test ping-pong from 0 to 1, 7, and 8.

Page 6: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Approximate α-β parameters (on node)

0

2

4

6

8

10

12

0 2 4 6 8 10 12 14 16 18

Tim

e/m

sg (m

icro

sec)

Message size (kilobytes)

Measured (1)Model (1)

Measured (7)Model (7)

α1 ≈ 1.0× 10−6, β1 ≈ 5.7× 10−10

α2 ≈ 8.4× 10−7, β2 ≈ 6.8× 10−10

Page 7: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Approximate α-β parameters (cross-node)

0

50

100

150

200

250

0 2 4 6 8 10 12 14 16 18

Tim

e/m

sg (m

icro

sec)

Message size (kilobytes)

Measured (1)Model (1)

Measured (7)Model (7)

Measured (8)Model (8)

α3 ≈ 7.1× 10−5, β3 ≈ 9.7× 10−9

Page 8: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Moral

Not all links are created equal!I Might handle with mixed paradigm

I OpenMP on node, MPI acrossI Have to worry about thread-safety of MPI calls

I Can handle purely within MPII Can ignore the issue completely?

For today, we’ll take the last approach.

Page 9: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Reminder: basic send and recv

MPI_Send(buf, count, datatype,dest, tag, comm);

MPI_Recv(buf, count, datatype,source, tag, comm, status);

MPI_Send and MPI_Recv are blockingI Send does not return until data is in systemI Recv does not return until data is ready

Page 10: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Blocking and buffering

data

buffer

buffer

P0 OS Net OS P1

data

Block until data “in system” — maybe in a buffer?

Page 11: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Blocking and buffering

data

P0 OS Net OS P1

data

Alternative: don’t copy, block until done.

Page 12: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Problem 1: Potential deadlock

... blocked ...

Send Send

Both processors wait to finish send before they can receive!May not happen if lots of buffering on both sides.

Page 13: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Solution 1: Alternating order

Recv

SendRecv

Send

Could alternate who sends and who receives.

Page 14: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Solution 2: Combined send/recv

SendrecvSendrecv

Common operations deserve explicit support!

Page 15: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Combined sendrecv

MPI_Sendrecv(sendbuf, sendcount, sendtype,dest, sendtag,recvbuf, recvcount, recvtype,source, recvtag,comm, status);

Blocking operation, combines send and recv to avoid deadlock.

Page 16: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Problem 2: Communication overhead

..waiting...

Sendrecv Sendrecv

Sendrecv Sendrecv

..waiting...

Partial solution: nonblocking communication

Page 17: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Blocking vs non-blocking communication

I MPI_Send and MPI_Recv are blockingI Send does not return until data is in systemI Recv does not return until data is readyI Cons: possible deadlock, time wasted waiting

I Why blocking?I Overwrite buffer during send =⇒ evil!I Read buffer before data ready =⇒ evil!

I Alternative: nonblocking communicationI Split into distinct initiation/completion phasesI Initiate send/recv and promise not to touch bufferI Check later for operation completion

Page 18: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Overlap communication and computation

}

Start recv

Start send

Start recv

End send

End recv

End send

End recv

Compute, but don’t touch buffers!

Start send

Page 19: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Nonblocking operations

Initiate message:

MPI_Isend(start, count, datatype, desttag, comm, request);

MPI_Irecv(start, count, datatype, desttag, comm, request);

Wait for message completion:

MPI_Wait(request, status);

Test for message completion:

MPI_Wait(request, status);

Page 20: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Multiple outstanding requests

Sometimes useful to have multiple outstanding messages:

MPI_Waitall(count, requests, statuses);MPI_Waitany(count, requests, index, status);MPI_Waitsome(count, requests, indices, statuses);

Multiple versions of test as well.

Page 21: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Other send/recv variants

Other variants of MPI_SendI MPI_Ssend (synchronous) – do not complete until receive

has begunI MPI_Bsend (buffered) – user provides buffer (viaMPI_Buffer_attach)

I MPI_Rsend (ready) – user guarantees receive has alreadybeen posted

I Can combine modes (e.g. MPI_Issend)MPI_Recv receives anything.

Page 22: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Another approach

I Send/recv is one-to-one communicationI An alternative is one-to-many (and vice-versa):

I Broadcast to distribute data from one processI Reduce to combine data from all processorsI Operations are called by all processes in communicator

Page 23: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Broadcast and reduce

MPI_Bcast(buffer, count, datatype,root, comm);

MPI_Reduce(sendbuf, recvbuf, count, datatype,op, root, comm);

I buffer is copied from root to othersI recvbuf receives result only at rootI op ∈ { MPI_MAX, MPI_SUM, . . . }

Page 24: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Example: basic Monte Carlo#include <stdio.h>#include <mpi.h>int main(int argc, char** argv) {

int nproc, myid, ntrials;MPI_Init(&argc, &argv);MPI_Comm_size(MPI_COMM_WORLD, &nproc);MPI_Comm_rank(MPI_COMM_WORLD, &my_id);if (myid == 0) {

printf("Trials per CPU:\n");scanf("%d", &ntrials);

}MPI_Bcast(&ntrials, 1, MPI_INT,

0, MPI_COMM_WORLD);run_trials(myid, nproc, ntrials);MPI_Finalize();return 0;

}

Page 25: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Example: basic Monte Carlo

Let sum[0] =∑

i Xi and sum[1] =∑

i X 2i .

void run_mc(int myid, int nproc, int ntrials) {double sums[2] = {0,0};double my_sums[2] = {0,0};/* ... run ntrials local experiments ... */MPI_Reduce(my_sums, sums, 2, MPI_DOUBLE,

MPI_SUM, 0, MPI_COMM_WORLD);if (myid == 0) {

int N = nproc*ntrials;double EX = sums[0]/N;double EX2 = sums[1]/N;printf("Mean: %g; err: %g\n",

EX, sqrt((EX*EX-EX2)/N));}

}

Page 26: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Collective operations

I Involve all processes in communicatorI Basic classes:

I Synchronization (e.g. barrier)I Data movement (e.g. broadcast)I Computation (e.g. reduce)

Page 27: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Barrier

MPI_Barrier(comm);

Not much more to say. Not needed that often.

Page 28: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Broadcast

Broadcast

P0

P1

P2

P3

P0

P1

P2

P3

A

A

A

A

A

Page 29: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Scatter/gather

Gather

P0

P1

P2

P3

P0

P1

P2

P3

A

B

C

D

AScatter

B C D

Page 30: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Allgather

Allgather

P0

P1

P2

P3

P0

P1

P2

P3

A

A

A

A

B C D

B

B

B

C

C

C

D

D

D

A

B

C

D

Page 31: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Alltoall

Alltoall

P0

P1

P2

P3

P0

P1

P2

P3

A0

A1

A2

A3

B0

B1

B2

B3

C0

C1

C2

C3

D0

D1

D2

D3

A0 A1 A2 A3

B0 B1 B2 B3

C0 C1 C2 C3

D0 D1 D2 D3

Page 32: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Reduce

ABCDP0

P1

P2

P3

P0

P1

P2

P3

Reduce

A

B

C

D

Page 33: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

Scan

Scan

P0

P1

P2

P3

P0

P1

P2

P3

A

B

C

D ABCD

A

AB

ABC

Page 34: Lecture 9: MPI continued - Cornell Universitybindel/class/cs5220-f11/slides/lec09.pdfLecture 9: MPI continued David Bindel 27 Sep 2011 Logistics I Matrix multiply is done! Still have

The kitchen sink

I In addition to above, have vector variants (v suffix), moreAll variants (Allreduce), Reduce_scatter, ...

I MPI3 adds one-sided communication (put/get)I MPI is not a small library!I But a small number of calls goes a long way

I Init/FinalizeI Get_comm_rank, Get_comm_sizeI Send/Recv variants and WaitI Allreduce, Allgather, Bcast