98
On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de Geijn ScicomP 10, August 9- 13 Austin, TX

On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Embed Size (px)

DESCRIPTION

Model of Parallel Computation Target Architectures –distributed memory parallel architectures Indexing –p nodes –indexed 0 … p – 1 –each node has one computational processor

Citation preview

Page 1: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

On Optimizing Collective Communication

UT/Texas AdvancedComputing Center

UT/Computer Science

Avi Purkayastha

Ernie Chan, Marcel HeinrichRobert van de Geijn

ScicomP 10, August 9-13 Austin, TX

Page 2: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Outline

• Model of Parallel Computation• Collective Communications• Algorithms• Performance Results• Conclusions and Future work

Page 3: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Model of Parallel Computation

• Target Architectures– distributed memory parallel architectures

• Indexing– p nodes– indexed 0 … p – 1– each node has one computational processor

Page 4: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

10 2 3 4

• often logically viewed as a linear array

5 6 7 8

Page 5: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Model of Parallel Computation

• Logically Fully Connected– a node can send directly to any other node

• Communicating Between Nodes– a node can simultaneously receive and send

• Network Conflicts– sending over a path between two nodes that is

completely occupied

Page 6: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Model of Parallel Computation

• Cost of Communication– sending a message of length n between any two nodes

is the startup cost (latency) is the transmission cost (bandwidth)

• Cost of Computation– cost to perform arithmetic operation is

• reduction operations– sum– prod– min– max

+n

Page 7: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Outline

• Model of Parallel Computation• Collective Communications• Algorithms• Performance Results• Conclusions and Future work

Page 8: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Collective Communications

• Broadcast• Reduce(-to-one)• Scatter • Gather• Allgather• Reduce-scatter• Allreduce

Page 9: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Lower Bounds (Latency)

• Broadcast

• Reduce(-to-one)

• Scatter/Gather

• Allgather

• Reduce-scatter

• Allreduce

log(p)⎡ ⎤

log(p)⎡ ⎤

log(p)⎡ ⎤

log(p)⎡ ⎤

log(p)⎡ ⎤

log(p)⎡ ⎤

Page 10: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Lower Bounds (Bandwidth)

• Broadcast

• Reduce(-to-one)

• Scatter/Gather

• Allgather

• Reduce-scatter

• Allreduce

n

n +p−1p

n

p−1p

n

p−1p

n

p−1p

n +p−1p

n

2p−1p

n +p−1p

n

Page 11: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Outline

• Model of Parallel Computation• Collective Communications• Algorithms• Performance Results• Conclusions and Future work

Page 12: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Motivating Example

• We will illustrate the different types of algorithms and implementations using the Reduce-scatter operation.

Page 13: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

A building block approach to library implementation

• Short-vector case

• Long-vector case

• Hybrid algorithms

Page 14: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Short-vector case

• Primary concern: – algorithms must have low latency cost

• Secondary concerns:– algorithms must work for arbitrary number of

nodes• in particular, not just for power-of-two numbers of nodes

– algorithms should avoid network conflicts• not absolutely necessary, but nice if possible

Page 15: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Minimum-Spanning Tree based algorithms

• We will show how the following building blocks:– broadcast/reduce– scatter/gather

• Using minimum spanning trees embedded in the logical linear array while attaining– minimal latency– implementation for arbitrary numbers of nodes– no network conflicts

Page 16: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

General principles

• message starts on one processor

Page 17: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

General principles

• divide logical linear array in half

Page 18: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

General principles

• send message to the half of the network that does not contain the current node (root) that holds the message

Page 19: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

General principles

• send message to the half of the network that does not contain the current node (root) that holds the message

Page 20: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

General principles

• continue recursively in each of the two halves

Page 21: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

General principles

• The demonstrated technique directly applies to – broadcast– scatter

• The technique can be applied in reverse to– reduce– gather

Page 22: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

General principles

• This technique can be used to implement the following building blocks:– broadcast/reduce– scatter/gather

• Using a minimum spanning tree embedded in the logical linear array while attaining– minimal latency– implementation for arbitrary numbers of nodes

– no network conflicts?• Yes, on linear arrays

Page 23: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Reduce-scatter (short vector)

Page 24: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Reduce

Reduce-scatter (short vector)

Page 25: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Scatter

Reduce-scatter (short vector)

Page 26: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Reduce

Before Reduce

+ + + + + + + +

Page 27: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

+ + + + + + +

Reduce

Reduce After

+

Page 28: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 29: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 30: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 31: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 32: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 33: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 34: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Cost of Minimum-Spanning Tree Reduce

log( p)⎡ ⎤ + n +n( )

number of steps cost per steps

Page 35: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Cost of Minimum-Spanning Tree Reduce

log( p)⎡ ⎤ + n +n( )

number of steps cost per steps

Notice: attains lower bound for latency component

Page 36: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Scatter

Before After

Page 37: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 38: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 39: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 40: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 41: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 42: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 43: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Cost of Minimum-Spanning Tree Scatter

+ n2kβ

⎛ ⎝ ⎜

⎞ ⎠ ⎟

k=1

log(p)∑

=log( p) α + p − 1

pnβ

• Assumption: power of two number of nodes

Page 44: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Cost of Minimum-Spanning Tree Scatter

+ n2kβ

⎛ ⎝ ⎜

⎞ ⎠ ⎟

k=1

log(p)∑

=log( p) α + p − 1

pnβ

• Assumption: power of two number of nodes

Notice: attains lower bound for latency and bandwidth components

Page 45: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Cost of Reduce/Scatter Reduce-scatter

log( p)( +n +n)

lo(p) + p−1p

n

2lo(p) + p−1p

+lo(p)⎛⎝⎜

⎞⎠⎟n +lo(p)n

• Assumption: power of two number of nodes

reducescatter

Page 46: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Cost of Reduce/Scatter Reduce-scatter

log( p)( +n +n)

lo(p) + p−1p

n

2lo(p) + p−1p

+lo(p)⎛⎝⎜

⎞⎠⎟n +lo(p)n

• Assumption: power of two number of nodes

reducescatter

Notice: does not attain lower bound for latency or bandwidth components

Page 47: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

RecapReduce

log( p )( + n + n)

Scatterlog(p)+

p−1p

n

Broadcastlog( p )( + n)

Gatherlog(p)+

p−1p

n

Allreduce2log( p) + lo(p)n(2 + )

Reduce-scatter2log( p) + lo(p)n( + ) +

p−1p

n

Allgather2log( p) + lo(p)n +

p−1p

n

Page 48: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

A building block approach to library implementation

• Short-vector case

• Long-vector case

• Hybrid algorithms

Page 49: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Long-vector case

• Primary concern: – algorithms must have low cost due to vector length– algorithms must avoid network conflicts

• Secondary concerns:– algorithms must work for arbitrary number of

nodes• in particular, not just for power-of-two numbers of nodes

Page 50: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Long-vector building blocks

• We will show how the following building blocks:– allgather/reduce-scatter

• Can be implemented using “bucket” algorithms while attaining– minimal cost due to length of vectors– implementation for arbitrary numbers of nodes– no network conflicts

Page 51: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

• A logical ring can be embedded in a physical linear array

Page 52: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

General principles

• This technique can be used to implement the following building blocks:– allgather/reduce-scatter

• Send subvectors of data around the ring at each step until all data is collected, like a “bucket”– minimal cost due to length of vectors– implementation for arbitrary numbers of nodes– no network conflicts

Page 53: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Reduce-scatter

Before Reduce

+ + + + + + + +

Page 54: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

+ + + + + + +

Reduce-scatter

Reduce After

+

Page 55: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 56: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 57: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 58: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 59: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 60: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 61: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 62: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 63: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 64: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 65: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Cost of Bucket Reduce-scatter

( p−1) +np +

np

⎛⎝⎜

⎞⎠⎟

=(p−1) +

p−1p

n +p−1p

nnumber of steps cost per steps

Page 66: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Cost of Bucket Reduce-scatter

( p−1) +np +

np

⎛⎝⎜

⎞⎠⎟

=(p−1) +

p−1p

n +p−1p

nnumber of steps cost per steps

Notice: attains lower bound for bandwidth andcomputation component

Page 67: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

RecapReduce-scatter( p−1)+p−1

pn(+)

Scatterlog(p)+

p−1p

n

Allgather( p−1)+p−1

pn

Gatherlog(p)+

p−1p

n

Allreduce2( p −1) +

p−1p

n(2 + )

Reduce( p −1 +lo(p)) +

p−1p

n(2 + )

Broadcast(log( p) + p −1) + 2

p−1p

n

Page 68: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

A building block approach to library implementation

• Short-vector case

• Long-vector case

• Hybrid algorithms

Page 69: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Hybrid algorithms(Intermediate length case)

• Algorithms must balance latency, cost due to vector length, and network conflicts

Page 70: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

General principles

• View p nodes as a 2-dimensional mesh– p = r x c, row and column dimensions

• Perform operation in each dimension– reduce-scatter within columns, followed by reduce-

scatter within rows

Page 71: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

General principles

• Many different combinations of short- and long- algorithms

• Generally try to reduce vector length to use at next dimension

• Can use more than 2-dimensions

Page 72: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Example: 2D Scatter/Allgather Broadcast

Page 73: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Example: 2D Scatter/Allgather Broadcast

Scatter in columns

Page 74: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Example: 2D Scatter/Allgather Broadcast

Scatter in rows

Page 75: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Example: 2D Scatter/Allgather Broadcast

Allgather in rows

Page 76: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Example: 2D Scatter/Allgather Broadcast

Allgather in columns

Page 77: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 78: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 79: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 80: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 81: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 82: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 83: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 84: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 85: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 86: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 87: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 88: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 89: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Cost of 2D Scatter/Allgather Broadcast

(log( p)+r+c−2) + 2 p−1p

n

Page 90: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Cost comparison

• Option 1:– MST Broadcast in column– MST Broadcast in rows

• Option 2:– Scatter in column– MST Broadcast in rows– Allgather in columns

• Option 3:– Scatter in column– Scatter in rows– Allgather in rows– Allgather in columns

log( p) +lo(p)n

log( p)+c−1( ) + 2c−1+ lo(r)

c⎛⎝⎜

⎞⎠⎟n

(log( p)+r+c−2) + 2p−1p

n

Page 91: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Outline

• Model of Parallel Computation• Collective Communications• Algorithms• Performance Results• Conclusions and Future work

Page 92: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Testbed Architecture• Cray-Dell PowerEdge Linux Cluster

– Lonestar• Texas Advanced Computing Center

– J. J. Pickle Research Campus» The University of Texas at Austin

– 856 3.06 GHz Xeon processors and– 410 Dell dual-processor PowerEdge 1750 compute nodes– 2 GB of memory per node in the compute nodes– Myrinet-2000 switch fabric– mpich-gm library 1.2.5..12a

Page 93: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Performance Results• Log-log graphs

– Shows performance better than linear-linear graphs• Used one processor per node

• Graphs– ‘send-min’ is the minimum time to send between nodes– ’MPI’ is the MPICH implementation– ‘short’ is minimum spanning tree algorithm– ‘long’ is the bucket algorithm– ‘hybrid’ is a 3-dimensional hybrid algorithm

• two higher dimensions with ‘long’ algorithm• Lowest dimension with ‘short’ algorithm

Page 94: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 95: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 96: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 97: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de
Page 98: On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

Conclusions and Future work• Have a working prototype for an optimal collective

communication library, using optimal algorithms for short, intermediate and long vector lengths.

• Need to obtain heuristics for cross-over points between short, hybrid and long vector algorithms, independent of architecture.

• Need to complete this approach for ALL MPI data types and operations.

• Need to generalize approach or have heuristics for large SMP nodes or small n-way node clusters.