Upload
barnaby-mckenzie
View
215
Download
0
Embed Size (px)
DESCRIPTION
Ethernet Switched Cluster switch
Citation preview
Pipelined Broadcast on Ethernet Switched Clusters
Pitch Patarasuk, Ahmad Faraj, Xin Yuan
Department of Computer ScienceFlorida State UniversityTallahassee, FL 32306
Broadcast communication(MPI_Bcast)
n0 n1 n2 n3
n0 n1 n2 n3
Before
After
A B C D
A B C D A B C D A B C D A B C D
Let T(msize) = time to send a message of size msizeBroadcast(msize) >= T(msize)
Ethernet Switched Cluster
switch switch switch
switch
Problem statement:How to efficiently realize the broadcast operation with large message sizes on Ethernet switched clusters.
Using pipelined broadcast can achieve near optimal results (T(msize) time for broadcasting a message of size msize).
Finding contention free broadcast treeFinding a good segment size
Traditional Broadcast algorithms
0 1 2 3 4 5 6 7
• Linear tree
1 2 3 4 5 6 7
• Flat tree 0
Time = (P-1) x T(msize)
Time = (P-1) x T(msize)
0
1 2
3 4 5 6
7
• Binary tree0
1 2 3
4 5 6 7
• k-ary tree
• Time = 2x(log2(P+1)-1)xT(msize)
0
4 2
6 5
1
3
7
• Binomial tree
Time = log2P x T(msize)
• Scatter/Allgather
n0 n1 n2 n3
Before A B C D
A B C DScatter
Allgather A B C D A B C D A B C D A B C D
Time = 2 x T(msize)
Time Complexity for large messages
Linear tree (P-1) x T(msize)Flat tree (P-1) x T(msize)Binary tree 2x(log2(P+1)-
1)xT(msize)Approx. 2xlog2P x T(msize)
Binomial tree log2P x T(msize)Scatter/allgather
2xT(msize)
Pipelined Broadcast AlgorithmLinear pipeline
0 1 2 3
Performance of pipelined broadcast:Assume no network contentiona message of size msize be broken into X messages of msize/X.H: tree hight, D: the number of children
Size of pipelined stage: D * T(msize/X)Total time T: (X + H –1) * (D * T(msize /X))
linear tree: H = P, D = 1, T = T(msize) Binary tree: H = log(P), D= 2, T = 2T(msize)K-ary tree: H = log_k(P), D = k, in general not as efficient as binary tree.
Time Complexity for large messagesPipelined (linear)
T(msize)
Pipelined (binary)
2 x T(msize)
k-ary pipeline k x T(msize)Binomial tree log2P x T(msize)Scatter/allgather
2xT(msize)
Pipelined broadcastHow to find a contention-free broadcast tree?How to select the best segment size?
Example of network contention
0
1 2
3 4 5 6
7
• Binary tree
switch switch
n0,n1,n2,n3n4,n5,n6,n7
There is a link contention cause by communication (14), (25), (2 6), and (3 7)
• Linear tree
switch switch
n0,n1,n4,n5n2,n3,n6,n7
The linear tree 0123…7 will have acontention caused by (12) and (56)
Algorithm for constructing contention free linear tree
Step 1: Traverse through all switches using depth-first-search (DFS) algorithm, name the switch by the order of their arrival in DFS treeStep 2: The linear tree consists of all machines in switch S0, follows by all machines in S1, then S2,and so on
Example of contention free linear tree
SwitchS0
SwitchS1
n0,n1,n4,n5 n2,n3,n6,n7
SwitchS3
SwitchS2
n12,n13,n14,n15
n8,n9,n10,n11
Linear tree: n0n1n4n5236789…15
Algorithm for constructing contention free binary tree
Start with a contention free linear treeRecursively divide the tree into 2 sub-treesMake sure that the cannot be a contentionThe sub-trees are chosen such that the height of the whole tree will be minimal
0 1 2 3 4 5 6 7 8 9 101112131415
Binary tree height
Performance of binary pipeline broadcast depends on the height of a binary treeEven though contention free binary tree may not be a complete binary tree, its height is not that much more than a complete binary tree
Average tree heights for 20 randomly generated topologies
EvaluationContention free pipelined algorithms:
Routine generators from topology informationThe generated routines are based on MPICH p2p primitives.Linear treeBinary tree3-nary tree
Targets for comparison:MPICH: Binomial tree, Scatter/allgatherLAM: Flat-tree, BinomialTopology unaware pipelined linear and binary algorithms
Evaluation
Performance of different pipelined trees (topology 1)
Comparing pipelined broadcast with other schemes
Topology unaware and contention-free pipelined broadcast
Segment size for pipelined broadcast
ConclusionsPipelined broadcast is faster than the current broadcast algorithm for medium and large messages Linear pipeline has a completion time roughly equal to T(msize)binary pipeline broadcast is best for medium messagesContention free broadcast tree is necessary for pipelined algorithmsA good segment size for pipelined broadcast is not difficult to find.
Questions?