43
Basic Communication Operations Alexandre David B2-206

Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

Basic Communication Operations

Alexandre DavidB2-206

Page 2: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 2

TodayOne-to-all broadcast & all-to-one reduction (4.1).All-to-all broadcast and reduction (4.2).All-reduce and prefix-sum operations (4.3).

Page 3: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 3

Collective Communication OperationsRepresent regular communication patterns.Used extensively in most data-parallel algorithms.Critical for efficiency.Available in most parallel libraries.Very useful to “get started” in parallel processing.

Page 4: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 4

ReminderResult from previous analysis:

Data transfer time is roughly the same between all pairs of nodes.Homogeneity true on modern hardware (randomized routing, cut-through routing…).

ts+mtwAdjust tw for congestion: effective tw.

Model: bidirectional links, single port.Communication with point-to-point primitives.

Page 5: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 5

Broadcast/ReductionOne-to-all broadcast:

Single process sends identical data to all (or subset of) processes.

All-to-one reduction:Dual operation.P processes have m words to send to one destination.Parts of the message need to be combined.

Page 6: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 6

Broadcast/Reduction

Broadcast Reduce

Page 7: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 7

One-to-All Broadcast –Ring/Linear ArrayNaïve approach: send sequentially.

Bottleneck.Poor utilization of the network.

Recursive doubling:Broadcast in logp steps (instead of p).Divide-and-conquer type of algorithm.Reduction is similar.

Page 8: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 8

Recursive Doubling

0 1 2 3

7 6 5 4

1

4

2

6

2

2

3 3

33

1 3

7 5

Page 9: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 9

Example: Matrix*Vector1) 1->all

2) Compute

3) All->1

Page 10: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 10

One-to-All Broadcast – MeshExtensions of the linear array algorithm.

Rows & columns = arrays.Broadcast on a row, broadcast on columns.Similar for reductions.Generalize for higher dimensions (cubes…).

Page 11: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 11

Broadcast on a Mesh

Page 12: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 12

One-to-All Broadcast –HypercubeHypercube with 2d nodes = d-dimensional mesh with 2 nodes in each direction.Similar algorithm in d steps.Also in logp steps.Reduction follows the same pattern.

Page 13: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 13

Broadcast on a Hypercube

Page 14: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 14

All-to-One Broadcast – Balanced Binary TreeProcessing nodes = leaves.Hypercube algorithm maps well.Similarly good w.r.t. congestion.

Page 15: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 15

Broadcast on a Balanced Binary Tree

Page 16: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 16

AlgorithmsSo far we saw pictures.Not enough to implement.Precise description

to implement.to analyze.

Description for hypercube.Execute the following procedure on all the nodes.

Page 17: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 17

Broadcast Algorithm

000 001101

100

010

110 111

011

111

011 001 000

011

011

001

001 000

000

000000

Current dimension

Page 18: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 18

Broadcast Algorithm

000 001101

100

010

110 111

011

001 000

Page 19: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 19

Broadcast Algorithm

000 001101

100

010

110 111

011

000

Page 20: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 20

Algorithm For Any Source

Page 21: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 21

Reduce Algorithm

In a nutshell:reverse the previous one.

Page 22: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 22

Cost Analysis

p processes → logp steps (point-to-pointtransfers in parallel).Each transfer has a time cost ofts+twm.Total time: T=(ts+twm) logp.

Page 23: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 23

All-to-All Broadcast and ReductionGeneralization of broadcast:

Each processor is a source and destination.Several processes broadcast different messages.

Used in matrix multiplication (and matrix-vector multiplication).Dual: all-to-all reduction.

Page 24: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 24

All-to-All Broadcast and Reduction

Page 25: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 25

All-to-All Broadcast – Rings

0 1 2 3

7 6 5 446

21 3

7 5

0 1 2 3

4567

0 1 2

3456

7 7 0 1

2345

6

etc…

Page 26: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 26

All-to-All Broadcast Algorithm

Ring: mod p.Receive & send – point-to-point.Initialize the loop.

Forward msg.Accumulate result.

Page 27: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 27

All-to-All Reduce Algorithm

Accumulate and forward.

Last message for my_id.

Page 28: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 28

1 2 3 4

5670

All-to-All Reduce – Rings

0 1 2 3

7 6 5 446

21 3

7 5

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

1

2 3 4 5

670

2 3 4 5

6701

Page 29: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 29

All-to-All Reduce – Rings

0 1 2 3

7 6 5 446

21 3

7 5

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

12

3 4 5 6

702

3 4 5 6

7012 1 0 7

6543

Page 30: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 30

All-to-All Broadcast – MeshesTwo phases:

All-to-all on rows – messages size m.Collect sqrt(p) messages.

All-to-all on columns – messages size sqrt(p)*m.

Page 31: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 31

All-to-All Broadcast – Meshes

Page 32: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 32

Algorithm

Page 33: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 33

All-to-All Broadcast -HypercubesGeneralization of the mesh algorithm to logp dimensions.Message size doubles at every step.Number of steps: logp.

Page 34: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 34

All-to-All Broadcast – Hypercubes

Page 35: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 35

Algorithm

Loop on the dimensions

Exchange messages

Forward (double size)

Page 36: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 36

All-to-All Reduction – Hypercubes

Similar patternin reverse order.

Combine results

Page 37: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 37

Cost Analysis (Time)Ring:

T=(ts + twm)(p-1).Mesh:

T=(ts + twm)(√p-1)+(ts + twm√p) (√p-1)= 2ts(√p – 1) + twm(p-1).

Hypercube:logp stepsmessage of size 2i-1m.

Page 38: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 38

Dense to Sparser: Congestion

Page 39: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 39

All-ReduceEach node starts with a buffer of size m.The final result is the same combination of all buffers on every node.Same as all-to-one reduce + one-to-all broadcast.Different from all-to-all reduce.

1 2 3 4 1234 1234 1234 1234

Page 40: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 40

All-Reduce AlgorithmUse all-to-all broadcast but

Combine messages instead of concatenating them.The size of the messages does not grow.Cost (in logp steps): T=(ts+twm) logp.

Page 41: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 41

Prefix-SumGiven p numbers n0,n1,…,np-1 (one on each node), the problem is to compute the sums sk = ∑i

k= 0 ni for all k between 0 and

p-1. Initially, nk is on the node labeled k, and at the end, the same node holds Sk.

Page 42: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 42

Prefix-Sum Algorithm

All-reduce

Prefix-sum

Page 43: Basic Communication Operationspeople.cs.aau.dk/~adavid/teaching/MTP-06/06-MVP06-slides.pdf · 2014. 2. 4. · All-reduce and prefix-sum operations (4.3). 28-02-2006 Alexandre David,

28-02-2006 Alexandre David, MVP'06 43

Prefix-Sum

0 1

2 3

4

6 7

5

0 1

54

2

6 7

3

0 1

5

2

7

3

4

6 67

45

01

6

4

0

23

2

1 0

3 2

5 4

7 6

0 1

2 3

4 5

6 7

1 00 1

4 5 5 4

Buffer = all-reduce sum