Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Basic Communication Operations
Alexandre DavidB2-206
28-02-2006 Alexandre David, MVP'06 2
TodayOne-to-all broadcast & all-to-one reduction (4.1).All-to-all broadcast and reduction (4.2).All-reduce and prefix-sum operations (4.3).
28-02-2006 Alexandre David, MVP'06 3
Collective Communication OperationsRepresent regular communication patterns.Used extensively in most data-parallel algorithms.Critical for efficiency.Available in most parallel libraries.Very useful to “get started” in parallel processing.
28-02-2006 Alexandre David, MVP'06 4
ReminderResult from previous analysis:
Data transfer time is roughly the same between all pairs of nodes.Homogeneity true on modern hardware (randomized routing, cut-through routing…).
ts+mtwAdjust tw for congestion: effective tw.
Model: bidirectional links, single port.Communication with point-to-point primitives.
28-02-2006 Alexandre David, MVP'06 5
Broadcast/ReductionOne-to-all broadcast:
Single process sends identical data to all (or subset of) processes.
All-to-one reduction:Dual operation.P processes have m words to send to one destination.Parts of the message need to be combined.
28-02-2006 Alexandre David, MVP'06 6
Broadcast/Reduction
Broadcast Reduce
28-02-2006 Alexandre David, MVP'06 7
One-to-All Broadcast –Ring/Linear ArrayNaïve approach: send sequentially.
Bottleneck.Poor utilization of the network.
Recursive doubling:Broadcast in logp steps (instead of p).Divide-and-conquer type of algorithm.Reduction is similar.
28-02-2006 Alexandre David, MVP'06 8
Recursive Doubling
0 1 2 3
7 6 5 4
1
4
2
6
2
2
3 3
33
1 3
7 5
28-02-2006 Alexandre David, MVP'06 9
Example: Matrix*Vector1) 1->all
2) Compute
3) All->1
28-02-2006 Alexandre David, MVP'06 10
One-to-All Broadcast – MeshExtensions of the linear array algorithm.
Rows & columns = arrays.Broadcast on a row, broadcast on columns.Similar for reductions.Generalize for higher dimensions (cubes…).
28-02-2006 Alexandre David, MVP'06 11
Broadcast on a Mesh
28-02-2006 Alexandre David, MVP'06 12
One-to-All Broadcast –HypercubeHypercube with 2d nodes = d-dimensional mesh with 2 nodes in each direction.Similar algorithm in d steps.Also in logp steps.Reduction follows the same pattern.
28-02-2006 Alexandre David, MVP'06 13
Broadcast on a Hypercube
28-02-2006 Alexandre David, MVP'06 14
All-to-One Broadcast – Balanced Binary TreeProcessing nodes = leaves.Hypercube algorithm maps well.Similarly good w.r.t. congestion.
28-02-2006 Alexandre David, MVP'06 15
Broadcast on a Balanced Binary Tree
28-02-2006 Alexandre David, MVP'06 16
AlgorithmsSo far we saw pictures.Not enough to implement.Precise description
to implement.to analyze.
Description for hypercube.Execute the following procedure on all the nodes.
28-02-2006 Alexandre David, MVP'06 17
Broadcast Algorithm
000 001101
100
010
110 111
011
111
011 001 000
011
011
001
001 000
000
000000
Current dimension
28-02-2006 Alexandre David, MVP'06 18
Broadcast Algorithm
000 001101
100
010
110 111
011
001 000
28-02-2006 Alexandre David, MVP'06 19
Broadcast Algorithm
000 001101
100
010
110 111
011
000
28-02-2006 Alexandre David, MVP'06 20
Algorithm For Any Source
28-02-2006 Alexandre David, MVP'06 21
Reduce Algorithm
In a nutshell:reverse the previous one.
28-02-2006 Alexandre David, MVP'06 22
Cost Analysis
p processes → logp steps (point-to-pointtransfers in parallel).Each transfer has a time cost ofts+twm.Total time: T=(ts+twm) logp.
28-02-2006 Alexandre David, MVP'06 23
All-to-All Broadcast and ReductionGeneralization of broadcast:
Each processor is a source and destination.Several processes broadcast different messages.
Used in matrix multiplication (and matrix-vector multiplication).Dual: all-to-all reduction.
28-02-2006 Alexandre David, MVP'06 24
All-to-All Broadcast and Reduction
28-02-2006 Alexandre David, MVP'06 25
All-to-All Broadcast – Rings
0 1 2 3
7 6 5 446
21 3
7 5
0 1 2 3
4567
0 1 2
3456
7 7 0 1
2345
6
etc…
28-02-2006 Alexandre David, MVP'06 26
All-to-All Broadcast Algorithm
Ring: mod p.Receive & send – point-to-point.Initialize the loop.
Forward msg.Accumulate result.
28-02-2006 Alexandre David, MVP'06 27
All-to-All Reduce Algorithm
Accumulate and forward.
Last message for my_id.
28-02-2006 Alexandre David, MVP'06 28
1 2 3 4
5670
All-to-All Reduce – Rings
0 1 2 3
7 6 5 446
21 3
7 5
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
1
2 3 4 5
670
2 3 4 5
6701
28-02-2006 Alexandre David, MVP'06 29
All-to-All Reduce – Rings
0 1 2 3
7 6 5 446
21 3
7 5
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
12
3 4 5 6
702
3 4 5 6
7012 1 0 7
6543
28-02-2006 Alexandre David, MVP'06 30
All-to-All Broadcast – MeshesTwo phases:
All-to-all on rows – messages size m.Collect sqrt(p) messages.
All-to-all on columns – messages size sqrt(p)*m.
28-02-2006 Alexandre David, MVP'06 31
All-to-All Broadcast – Meshes
28-02-2006 Alexandre David, MVP'06 32
Algorithm
28-02-2006 Alexandre David, MVP'06 33
All-to-All Broadcast -HypercubesGeneralization of the mesh algorithm to logp dimensions.Message size doubles at every step.Number of steps: logp.
28-02-2006 Alexandre David, MVP'06 34
All-to-All Broadcast – Hypercubes
28-02-2006 Alexandre David, MVP'06 35
Algorithm
Loop on the dimensions
Exchange messages
Forward (double size)
28-02-2006 Alexandre David, MVP'06 36
All-to-All Reduction – Hypercubes
Similar patternin reverse order.
Combine results
28-02-2006 Alexandre David, MVP'06 37
Cost Analysis (Time)Ring:
T=(ts + twm)(p-1).Mesh:
T=(ts + twm)(√p-1)+(ts + twm√p) (√p-1)= 2ts(√p – 1) + twm(p-1).
Hypercube:logp stepsmessage of size 2i-1m.
28-02-2006 Alexandre David, MVP'06 38
Dense to Sparser: Congestion
28-02-2006 Alexandre David, MVP'06 39
All-ReduceEach node starts with a buffer of size m.The final result is the same combination of all buffers on every node.Same as all-to-one reduce + one-to-all broadcast.Different from all-to-all reduce.
1 2 3 4 1234 1234 1234 1234
28-02-2006 Alexandre David, MVP'06 40
All-Reduce AlgorithmUse all-to-all broadcast but
Combine messages instead of concatenating them.The size of the messages does not grow.Cost (in logp steps): T=(ts+twm) logp.
28-02-2006 Alexandre David, MVP'06 41
Prefix-SumGiven p numbers n0,n1,…,np-1 (one on each node), the problem is to compute the sums sk = ∑i
k= 0 ni for all k between 0 and
p-1. Initially, nk is on the node labeled k, and at the end, the same node holds Sk.
28-02-2006 Alexandre David, MVP'06 42
Prefix-Sum Algorithm
All-reduce
Prefix-sum
28-02-2006 Alexandre David, MVP'06 43
Prefix-Sum
0 1
2 3
4
6 7
5
0 1
54
2
6 7
3
0 1
5
2
7
3
4
6 67
45
01
6
4
0
23
2
1 0
3 2
5 4
7 6
0 1
2 3
4 5
6 7
1 00 1
4 5 5 4
Buffer = all-reduce sum