28
Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee, FL 32306

Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Embed Size (px)

DESCRIPTION

Ethernet Switched Cluster switch

Citation preview

Page 1: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Pipelined Broadcast on Ethernet Switched Clusters

Pitch Patarasuk, Ahmad Faraj, Xin Yuan

Department of Computer ScienceFlorida State UniversityTallahassee, FL 32306

Page 2: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Broadcast communication(MPI_Bcast)

n0 n1 n2 n3

n0 n1 n2 n3

Before

After

A B C D

A B C D A B C D A B C D A B C D

Let T(msize) = time to send a message of size msizeBroadcast(msize) >= T(msize)

Page 3: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Ethernet Switched Cluster

switch switch switch

switch

Page 4: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Problem statement:How to efficiently realize the broadcast operation with large message sizes on Ethernet switched clusters.

Using pipelined broadcast can achieve near optimal results (T(msize) time for broadcasting a message of size msize).

Finding contention free broadcast treeFinding a good segment size

Page 5: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Traditional Broadcast algorithms

0 1 2 3 4 5 6 7

• Linear tree

1 2 3 4 5 6 7

• Flat tree 0

Time = (P-1) x T(msize)

Time = (P-1) x T(msize)

Page 6: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

0

1 2

3 4 5 6

7

• Binary tree0

1 2 3

4 5 6 7

• k-ary tree

• Time = 2x(log2(P+1)-1)xT(msize)

Page 7: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

0

4 2

6 5

1

3

7

• Binomial tree

Time = log2P x T(msize)

Page 8: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

• Scatter/Allgather

n0 n1 n2 n3

Before A B C D

A B C DScatter

Allgather A B C D A B C D A B C D A B C D

Time = 2 x T(msize)

Page 9: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Time Complexity for large messages

Linear tree (P-1) x T(msize)Flat tree (P-1) x T(msize)Binary tree 2x(log2(P+1)-

1)xT(msize)Approx. 2xlog2P x T(msize)

Binomial tree log2P x T(msize)Scatter/allgather

2xT(msize)

Page 10: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Pipelined Broadcast AlgorithmLinear pipeline

0 1 2 3

Page 11: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Performance of pipelined broadcast:Assume no network contentiona message of size msize be broken into X messages of msize/X.H: tree hight, D: the number of children

Size of pipelined stage: D * T(msize/X)Total time T: (X + H –1) * (D * T(msize /X))

linear tree: H = P, D = 1, T = T(msize) Binary tree: H = log(P), D= 2, T = 2T(msize)K-ary tree: H = log_k(P), D = k, in general not as efficient as binary tree.

Page 12: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Time Complexity for large messagesPipelined (linear)

T(msize)

Pipelined (binary)

2 x T(msize)

k-ary pipeline k x T(msize)Binomial tree log2P x T(msize)Scatter/allgather

2xT(msize)

Page 13: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Pipelined broadcastHow to find a contention-free broadcast tree?How to select the best segment size?

Page 14: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Example of network contention

0

1 2

3 4 5 6

7

• Binary tree

switch switch

n0,n1,n2,n3n4,n5,n6,n7

There is a link contention cause by communication (14), (25), (2 6), and (3 7)

Page 15: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

• Linear tree

switch switch

n0,n1,n4,n5n2,n3,n6,n7

The linear tree 0123…7 will have acontention caused by (12) and (56)

Page 16: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Algorithm for constructing contention free linear tree

Step 1: Traverse through all switches using depth-first-search (DFS) algorithm, name the switch by the order of their arrival in DFS treeStep 2: The linear tree consists of all machines in switch S0, follows by all machines in S1, then S2,and so on

Page 17: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Example of contention free linear tree

SwitchS0

SwitchS1

n0,n1,n4,n5 n2,n3,n6,n7

SwitchS3

SwitchS2

n12,n13,n14,n15

n8,n9,n10,n11

Linear tree: n0n1n4n5236789…15

Page 18: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Algorithm for constructing contention free binary tree

Start with a contention free linear treeRecursively divide the tree into 2 sub-treesMake sure that the cannot be a contentionThe sub-trees are chosen such that the height of the whole tree will be minimal

0 1 2 3 4 5 6 7 8 9 101112131415

Page 19: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Binary tree height

Performance of binary pipeline broadcast depends on the height of a binary treeEven though contention free binary tree may not be a complete binary tree, its height is not that much more than a complete binary tree

Page 20: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Average tree heights for 20 randomly generated topologies

Page 21: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

EvaluationContention free pipelined algorithms:

Routine generators from topology informationThe generated routines are based on MPICH p2p primitives.Linear treeBinary tree3-nary tree

Targets for comparison:MPICH: Binomial tree, Scatter/allgatherLAM: Flat-tree, BinomialTopology unaware pipelined linear and binary algorithms

Page 22: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Evaluation

Page 23: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Performance of different pipelined trees (topology 1)

Page 24: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Comparing pipelined broadcast with other schemes

Page 25: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Topology unaware and contention-free pipelined broadcast

Page 26: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Segment size for pipelined broadcast

Page 27: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

ConclusionsPipelined broadcast is faster than the current broadcast algorithm for medium and large messages Linear pipeline has a completion time roughly equal to T(msize)binary pipeline broadcast is best for medium messagesContention free broadcast tree is necessary for pipelined algorithmsA good segment size for pipelined broadcast is not difficult to find.

Page 28: Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department of Computer Science Florida State University Tallahassee,

Questions?