Upload
evangeline-price
View
222
Download
0
Embed Size (px)
DESCRIPTION
Model of Parallel Computation Target Architectures –distributed memory parallel architectures Indexing –p nodes –indexed 0 … p – 1 –each node has one computational processor
Citation preview
On Optimizing Collective Communication
UT/Texas AdvancedComputing Center
UT/Computer Science
Avi Purkayastha
Ernie Chan, Marcel HeinrichRobert van de Geijn
ScicomP 10, August 9-13 Austin, TX
Outline
• Model of Parallel Computation• Collective Communications• Algorithms• Performance Results• Conclusions and Future work
Model of Parallel Computation
• Target Architectures– distributed memory parallel architectures
• Indexing– p nodes– indexed 0 … p – 1– each node has one computational processor
10 2 3 4
• often logically viewed as a linear array
5 6 7 8
Model of Parallel Computation
• Logically Fully Connected– a node can send directly to any other node
• Communicating Between Nodes– a node can simultaneously receive and send
• Network Conflicts– sending over a path between two nodes that is
completely occupied
Model of Parallel Computation
• Cost of Communication– sending a message of length n between any two nodes
is the startup cost (latency) is the transmission cost (bandwidth)
• Cost of Computation– cost to perform arithmetic operation is
• reduction operations– sum– prod– min– max
+n
Outline
• Model of Parallel Computation• Collective Communications• Algorithms• Performance Results• Conclusions and Future work
Collective Communications
• Broadcast• Reduce(-to-one)• Scatter • Gather• Allgather• Reduce-scatter• Allreduce
Lower Bounds (Latency)
• Broadcast
• Reduce(-to-one)
• Scatter/Gather
• Allgather
• Reduce-scatter
• Allreduce
log(p)⎡ ⎤
log(p)⎡ ⎤
log(p)⎡ ⎤
log(p)⎡ ⎤
log(p)⎡ ⎤
log(p)⎡ ⎤
Lower Bounds (Bandwidth)
• Broadcast
• Reduce(-to-one)
• Scatter/Gather
• Allgather
• Reduce-scatter
• Allreduce
n
n +p−1p
n
p−1p
n
p−1p
n
p−1p
n +p−1p
n
2p−1p
n +p−1p
n
Outline
• Model of Parallel Computation• Collective Communications• Algorithms• Performance Results• Conclusions and Future work
Motivating Example
• We will illustrate the different types of algorithms and implementations using the Reduce-scatter operation.
A building block approach to library implementation
• Short-vector case
• Long-vector case
• Hybrid algorithms
Short-vector case
• Primary concern: – algorithms must have low latency cost
• Secondary concerns:– algorithms must work for arbitrary number of
nodes• in particular, not just for power-of-two numbers of nodes
– algorithms should avoid network conflicts• not absolutely necessary, but nice if possible
Minimum-Spanning Tree based algorithms
• We will show how the following building blocks:– broadcast/reduce– scatter/gather
• Using minimum spanning trees embedded in the logical linear array while attaining– minimal latency– implementation for arbitrary numbers of nodes– no network conflicts
General principles
• message starts on one processor
General principles
• divide logical linear array in half
General principles
• send message to the half of the network that does not contain the current node (root) that holds the message
General principles
• send message to the half of the network that does not contain the current node (root) that holds the message
General principles
• continue recursively in each of the two halves
General principles
• The demonstrated technique directly applies to – broadcast– scatter
• The technique can be applied in reverse to– reduce– gather
General principles
• This technique can be used to implement the following building blocks:– broadcast/reduce– scatter/gather
• Using a minimum spanning tree embedded in the logical linear array while attaining– minimal latency– implementation for arbitrary numbers of nodes
– no network conflicts?• Yes, on linear arrays
Reduce-scatter (short vector)
Reduce
Reduce-scatter (short vector)
Scatter
Reduce-scatter (short vector)
Reduce
Before Reduce
+ + + + + + + +
+ + + + + + +
Reduce
Reduce After
+
Cost of Minimum-Spanning Tree Reduce
log( p)⎡ ⎤ + n +n( )
number of steps cost per steps
Cost of Minimum-Spanning Tree Reduce
log( p)⎡ ⎤ + n +n( )
number of steps cost per steps
Notice: attains lower bound for latency component
Scatter
Before After
Cost of Minimum-Spanning Tree Scatter
+ n2kβ
⎛ ⎝ ⎜
⎞ ⎠ ⎟
k=1
log(p)∑
=log( p) α + p − 1
pnβ
• Assumption: power of two number of nodes
Cost of Minimum-Spanning Tree Scatter
+ n2kβ
⎛ ⎝ ⎜
⎞ ⎠ ⎟
k=1
log(p)∑
=log( p) α + p − 1
pnβ
• Assumption: power of two number of nodes
Notice: attains lower bound for latency and bandwidth components
Cost of Reduce/Scatter Reduce-scatter
log( p)( +n +n)
lo(p) + p−1p
n
2lo(p) + p−1p
+lo(p)⎛⎝⎜
⎞⎠⎟n +lo(p)n
• Assumption: power of two number of nodes
reducescatter
Cost of Reduce/Scatter Reduce-scatter
log( p)( +n +n)
lo(p) + p−1p
n
2lo(p) + p−1p
+lo(p)⎛⎝⎜
⎞⎠⎟n +lo(p)n
• Assumption: power of two number of nodes
reducescatter
Notice: does not attain lower bound for latency or bandwidth components
RecapReduce
log( p )( + n + n)
Scatterlog(p)+
p−1p
n
Broadcastlog( p )( + n)
Gatherlog(p)+
p−1p
n
Allreduce2log( p) + lo(p)n(2 + )
Reduce-scatter2log( p) + lo(p)n( + ) +
p−1p
n
Allgather2log( p) + lo(p)n +
p−1p
n
A building block approach to library implementation
• Short-vector case
• Long-vector case
• Hybrid algorithms
Long-vector case
• Primary concern: – algorithms must have low cost due to vector length– algorithms must avoid network conflicts
• Secondary concerns:– algorithms must work for arbitrary number of
nodes• in particular, not just for power-of-two numbers of nodes
Long-vector building blocks
• We will show how the following building blocks:– allgather/reduce-scatter
• Can be implemented using “bucket” algorithms while attaining– minimal cost due to length of vectors– implementation for arbitrary numbers of nodes– no network conflicts
• A logical ring can be embedded in a physical linear array
General principles
• This technique can be used to implement the following building blocks:– allgather/reduce-scatter
• Send subvectors of data around the ring at each step until all data is collected, like a “bucket”– minimal cost due to length of vectors– implementation for arbitrary numbers of nodes– no network conflicts
Reduce-scatter
Before Reduce
+ + + + + + + +
+ + + + + + +
Reduce-scatter
Reduce After
+
Cost of Bucket Reduce-scatter
( p−1) +np +
np
⎛⎝⎜
⎞⎠⎟
=(p−1) +
p−1p
n +p−1p
nnumber of steps cost per steps
Cost of Bucket Reduce-scatter
( p−1) +np +
np
⎛⎝⎜
⎞⎠⎟
=(p−1) +
p−1p
n +p−1p
nnumber of steps cost per steps
Notice: attains lower bound for bandwidth andcomputation component
RecapReduce-scatter( p−1)+p−1
pn(+)
Scatterlog(p)+
p−1p
n
Allgather( p−1)+p−1
pn
Gatherlog(p)+
p−1p
n
Allreduce2( p −1) +
p−1p
n(2 + )
Reduce( p −1 +lo(p)) +
p−1p
n(2 + )
Broadcast(log( p) + p −1) + 2
p−1p
n
A building block approach to library implementation
• Short-vector case
• Long-vector case
• Hybrid algorithms
Hybrid algorithms(Intermediate length case)
• Algorithms must balance latency, cost due to vector length, and network conflicts
General principles
• View p nodes as a 2-dimensional mesh– p = r x c, row and column dimensions
• Perform operation in each dimension– reduce-scatter within columns, followed by reduce-
scatter within rows
General principles
• Many different combinations of short- and long- algorithms
• Generally try to reduce vector length to use at next dimension
• Can use more than 2-dimensions
Example: 2D Scatter/Allgather Broadcast
Example: 2D Scatter/Allgather Broadcast
Scatter in columns
Example: 2D Scatter/Allgather Broadcast
Scatter in rows
Example: 2D Scatter/Allgather Broadcast
Allgather in rows
Example: 2D Scatter/Allgather Broadcast
Allgather in columns
Cost of 2D Scatter/Allgather Broadcast
(log( p)+r+c−2) + 2 p−1p
n
Cost comparison
• Option 1:– MST Broadcast in column– MST Broadcast in rows
• Option 2:– Scatter in column– MST Broadcast in rows– Allgather in columns
• Option 3:– Scatter in column– Scatter in rows– Allgather in rows– Allgather in columns
log( p) +lo(p)n
log( p)+c−1( ) + 2c−1+ lo(r)
c⎛⎝⎜
⎞⎠⎟n
(log( p)+r+c−2) + 2p−1p
n
Outline
• Model of Parallel Computation• Collective Communications• Algorithms• Performance Results• Conclusions and Future work
Testbed Architecture• Cray-Dell PowerEdge Linux Cluster
– Lonestar• Texas Advanced Computing Center
– J. J. Pickle Research Campus» The University of Texas at Austin
– 856 3.06 GHz Xeon processors and– 410 Dell dual-processor PowerEdge 1750 compute nodes– 2 GB of memory per node in the compute nodes– Myrinet-2000 switch fabric– mpich-gm library 1.2.5..12a
Performance Results• Log-log graphs
– Shows performance better than linear-linear graphs• Used one processor per node
• Graphs– ‘send-min’ is the minimum time to send between nodes– ’MPI’ is the MPICH implementation– ‘short’ is minimum spanning tree algorithm– ‘long’ is the bucket algorithm– ‘hybrid’ is a 3-dimensional hybrid algorithm
• two higher dimensions with ‘long’ algorithm• Lowest dimension with ‘short’ algorithm
Conclusions and Future work• Have a working prototype for an optimal collective
communication library, using optimal algorithms for short, intermediate and long vector lengths.
• Need to obtain heuristics for cross-over points between short, hybrid and long vector algorithms, independent of architecture.
• Need to complete this approach for ALL MPI data types and operations.
• Need to generalize approach or have heuristics for large SMP nodes or small n-way node clusters.