On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de

On Optimizing Collective Communication

UT/Texas AdvancedComputing Center

UT/Computer Science

Avi Purkayastha

Ernie Chan, Marcel HeinrichRobert van de Geijn

ScicomP 10, August 9-13 Austin, TX

Outline

• Model of Parallel Computation• Collective Communications• Algorithms• Performance Results• Conclusions and Future work

Model of Parallel Computation

• Target Architectures– distributed memory parallel architectures

• Indexing– p nodes– indexed 0 … p – 1– each node has one computational processor

10 2 3 4

• often logically viewed as a linear array

5 6 7 8


• Logically Fully Connected– a node can send directly to any other node

• Communicating Between Nodes– a node can simultaneously receive and send

• Network Conflicts– sending over a path between two nodes that is

completely occupied


• Cost of Communication– sending a message of length n between any two nodes

is the startup cost (latency) is the transmission cost (bandwidth)

• Cost of Computation– cost to perform arithmetic operation is

• reduction operations– sum– prod– min– max

+n

Outline


Collective Communications

• Broadcast• Reduce(-to-one)• Scatter • Gather• Allgather• Reduce-scatter• Allreduce

Lower Bounds (Latency)

• Broadcast

• Reduce(-to-one)

• Scatter/Gather

• Allgather

• Reduce-scatter

• Allreduce

log(p)⎡ ⎤

log(p)⎡ ⎤

log(p)⎡ ⎤

log(p)⎡ ⎤

log(p)⎡ ⎤

log(p)⎡ ⎤

Lower Bounds (Bandwidth)

• Broadcast

• Reduce(-to-one)

• Scatter/Gather

• Allgather

• Reduce-scatter

• Allreduce

n

n +p−1p

n

p−1p

n

p−1p

n

p−1p

n +p−1p

n

2p−1p

n +p−1p

n

Outline


Motivating Example

• We will illustrate the different types of algorithms and implementations using the Reduce-scatter operation.

A building block approach to library implementation

• Short-vector case

• Long-vector case

• Hybrid algorithms

Short-vector case

• Primary concern: – algorithms must have low latency cost

• Secondary concerns:– algorithms must work for arbitrary number of

nodes• in particular, not just for power-of-two numbers of nodes

– algorithms should avoid network conflicts• not absolutely necessary, but nice if possible

Minimum-Spanning Tree based algorithms

• We will show how the following building blocks:– broadcast/reduce– scatter/gather

• Using minimum spanning trees embedded in the logical linear array while attaining– minimal latency– implementation for arbitrary numbers of nodes– no network conflicts

General principles

• message starts on one processor

General principles

• divide logical linear array in half

General principles

• send message to the half of the network that does not contain the current node (root) that holds the message

General principles

• send message to the half of the network that does not contain the current node (root) that holds the message

General principles

• continue recursively in each of the two halves

General principles

• The demonstrated technique directly applies to – broadcast– scatter

• The technique can be applied in reverse to– reduce– gather

General principles

• This technique can be used to implement the following building blocks:– broadcast/reduce– scatter/gather

• Using a minimum spanning tree embedded in the logical linear array while attaining– minimal latency– implementation for arbitrary numbers of nodes

– no network conflicts?• Yes, on linear arrays

Reduce-scatter (short vector)

Reduce


Scatter


Reduce

Before Reduce

+ + + + + + + +

+ + + + + + +

Reduce

Reduce After

+

Cost of Minimum-Spanning Tree Reduce

log( p)⎡ ⎤ + n +n( )

number of steps cost per steps

Cost of Minimum-Spanning Tree Reduce

log( p)⎡ ⎤ + n +n( )

number of steps cost per steps

Notice: attains lower bound for latency component

Scatter

Before After

Cost of Minimum-Spanning Tree Scatter

+ n2kβ

⎛ ⎝ ⎜

⎞ ⎠ ⎟

k=1

log(p)∑

=log( p) α + p − 1

pnβ

• Assumption: power of two number of nodes

Cost of Minimum-Spanning Tree Scatter

+ n2kβ

⎛ ⎝ ⎜

⎞ ⎠ ⎟

k=1

log(p)∑

=log( p) α + p − 1

pnβ


Notice: attains lower bound for latency and bandwidth components

Cost of Reduce/Scatter Reduce-scatter

log( p)( +n +n)

lo(p) + p−1p

n

2lo(p) + p−1p

+lo(p)⎛⎝⎜

⎞⎠⎟n +lo(p)n


reducescatter

Cost of Reduce/Scatter Reduce-scatter

log( p)( +n +n)

lo(p) + p−1p

n

2lo(p) + p−1p

+lo(p)⎛⎝⎜

⎞⎠⎟n +lo(p)n


reducescatter

Notice: does not attain lower bound for latency or bandwidth components

RecapReduce

log( p )( + n + n)

Scatterlog(p)+

p−1p

n

Broadcastlog( p )( + n)

Gatherlog(p)+

p−1p

n

Allreduce2log( p) + lo(p)n(2 + )

Reduce-scatter2log( p) + lo(p)n( + ) +

p−1p

n

Allgather2log( p) + lo(p)n +

p−1p

n





Long-vector case

• Primary concern: – algorithms must have low cost due to vector length– algorithms must avoid network conflicts

• Secondary concerns:– algorithms must work for arbitrary number of

nodes• in particular, not just for power-of-two numbers of nodes

Long-vector building blocks

• We will show how the following building blocks:– allgather/reduce-scatter

• Can be implemented using “bucket” algorithms while attaining– minimal cost due to length of vectors– implementation for arbitrary numbers of nodes– no network conflicts

• A logical ring can be embedded in a physical linear array

General principles

• This technique can be used to implement the following building blocks:– allgather/reduce-scatter

• Send subvectors of data around the ring at each step until all data is collected, like a “bucket”– minimal cost due to length of vectors– implementation for arbitrary numbers of nodes– no network conflicts

Reduce-scatter

Before Reduce

+ + + + + + + +

+ + + + + + +

Reduce-scatter

Reduce After

+

Cost of Bucket Reduce-scatter

( p−1) +np +

np

⎛⎝⎜

⎞⎠⎟

=(p−1) +

p−1p

n +p−1p

nnumber of steps cost per steps

Cost of Bucket Reduce-scatter

( p−1) +np +

np

⎛⎝⎜

⎞⎠⎟

=(p−1) +

p−1p

n +p−1p

nnumber of steps cost per steps

Notice: attains lower bound for bandwidth andcomputation component

RecapReduce-scatter( p−1)+p−1

pn(+)

Scatterlog(p)+

p−1p

n

Allgather( p−1)+p−1

pn

Gatherlog(p)+

p−1p

n

Allreduce2( p −1) +

p−1p

n(2 + )

Reduce( p −1 +lo(p)) +

p−1p

n(2 + )

Broadcast(log( p) + p −1) + 2

p−1p

n





Hybrid algorithms(Intermediate length case)

• Algorithms must balance latency, cost due to vector length, and network conflicts

General principles

• View p nodes as a 2-dimensional mesh– p = r x c, row and column dimensions

• Perform operation in each dimension– reduce-scatter within columns, followed by reduce-

scatter within rows

General principles

• Many different combinations of short- and long- algorithms

• Generally try to reduce vector length to use at next dimension

• Can use more than 2-dimensions

Example: 2D Scatter/Allgather Broadcast


Scatter in columns


Scatter in rows


Allgather in rows


Allgather in columns

Cost of 2D Scatter/Allgather Broadcast

(log( p)+r+c−2) + 2 p−1p

n

Cost comparison

• Option 1:– MST Broadcast in column– MST Broadcast in rows

• Option 2:– Scatter in column– MST Broadcast in rows– Allgather in columns

• Option 3:– Scatter in column– Scatter in rows– Allgather in rows– Allgather in columns

log( p) +lo(p)n

log( p)+c−1( ) + 2c−1+ lo(r)

c⎛⎝⎜

⎞⎠⎟n

(log( p)+r+c−2) + 2p−1p

n

Outline


Testbed Architecture• Cray-Dell PowerEdge Linux Cluster

– Lonestar• Texas Advanced Computing Center

– J. J. Pickle Research Campus» The University of Texas at Austin

– 856 3.06 GHz Xeon processors and– 410 Dell dual-processor PowerEdge 1750 compute nodes– 2 GB of memory per node in the compute nodes– Myrinet-2000 switch fabric– mpich-gm library 1.2.5..12a

Performance Results• Log-log graphs

– Shows performance better than linear-linear graphs• Used one processor per node

• Graphs– ‘send-min’ is the minimum time to send between nodes– ’MPI’ is the MPICH implementation– ‘short’ is minimum spanning tree algorithm– ‘long’ is the bucket algorithm– ‘hybrid’ is a 3-dimensional hybrid algorithm

• two higher dimensions with ‘long’ algorithm• Lowest dimension with ‘short’ algorithm

Conclusions and Future work• Have a working prototype for an optimal collective

communication library, using optimal algorithms for short, intermediate and long vector lengths.

• Need to obtain heuristics for cross-over points between short, hybrid and long vector algorithms, independent of architecture.

• Need to complete this approach for ALL MPI data types and operations.

• Need to generalize approach or have heuristics for large SMP nodes or small n-way node clusters.

Documents

On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de