View
19
Download
1
Category
Preview:
DESCRIPTION
A Communication-Optimal N-Body Algorithm for Direct Interactions. Michael Driscoll, Evangelos Georganas , Penporn Koanantakool , Edgar Solomonik , Katherine Yelick * UC Berkeley *Lawrence Berkeley National Laboratory. Overview. Intro to N-Body problem. Communication bounds. - PowerPoint PPT Presentation
Citation preview
A Communication-Optimal N-Body Algorithm for Direct Interactions
Michael Driscoll, Evangelos Georganas, Penporn Koanantakool, Edgar Solomonik, Katherine Yelick*
UC Berkeley*Lawrence Berkeley National Laboratory
Overview
• Intro to N-Body problem.• Communication bounds.• Communication-optimal algorithm.• Performance results.• Conclusion
Direct N-Body
n particles- molecules, galaxies, database tuples, etc.- O(n2) interactions
for i = 1 to n: for j = 1 to n: force[i] += interact( particles[i], particles[j] )
p processors
Communication Model• Communication cost along critical path.
• Alpha-beta model:
• Can we find lower bounds on S or W?• Do current algorithms meet those bounds?• If not, can we find ones that do? or better bounds?
# messageslatency
1/bandwidth
# words
Communication Lower BoundsFrom Minimizing Communication in Numerical Linear Algebra [Ballard et al. 2011]:
F # flopsM size of fast memoryH max flops per M wordsS # messagesW # words
Generalized in: Communication Lower Bounds and Optimal Algorithms for Programs That Reference Arrays [Christ et al. 2013].
Lower Bounds for N-Body
Flops:Memory:Max flops per M words:
Plug into latency and bandwidth lower bounds:
Do current algorithms meet these bounds?
A Naïve N-Body Algorithm
• For p steps, send n/p particles.# messages: # words:
• Recall bounds, and :✔
✔
Proc. 2 Proc. 3 Proc. 4 Proc. 5 … Proc. PProc. 0 Proc. 1
+ +
+
particles:
replicas:
The naïve algorithm is optimal…
• Recall the lower bounds:
• Notice M in denominator.• Increase M => decrease communication.• Realize a “lower” lower bound.
Communication-Optimal N-Body
• Replication factor: c copies of each particle
• Communication cost: MessagesWords– Broadcast – Shifts – Reduction – Total
Team 2 Team 3 Team 4 Team 5 … Team p/c Team 0 Team 1particles:
processors:
p/c teams
c layers
+
reduce #messages by c2 reduce #words by c• c = p1/2 => force decomposition [Plimpton 1995]
Experiments
• Developed particle code– Flat MPI– 52-byte particles– Repulsive force drops off with square of distance– Reflective boundary conditions
• Platforms– Hopper: Cray XE-6 at NERSC, 24 cores/node– Intrepid: IBM BlueGene/P at ALCF, 4 cores/node– Both have 3D torus interconnect.
Performance on Hopper24K particles, 6K cores
Dow
n is good
95.6%reduction
Performance on Intrepid262K particles, 32K cores
Dow
n is good
99.3%reduction
Strong Scaling on Intrepid262K particles
Up is G
ood
Perfect Strong Scaling
4.5xspeedup
CA N-Body with Cutoff Distance
• No interactions beyond cutoff radius r
• Assuming:– uniform particle distribution– spatial processor decomposition
• Simple extension to support a cutoff:– still communication-optimal– works in space of any dimensions– speedups from 1D and 2D experiments
c layers
N-Body with Cutoff
• Shifts occur modulo the cutoff distance.• Optimality holds– same counting argument– see paper for details
particles:
processors:
p/c teams
+
cutoff diameter
Team 2 Team 3 Team 4 Team 5 … Team p/c Team 0 Team 1
1D Simulation on Intrepid262K particles, 32K cores
Dow
n is good
84.6% reduction
2D Simulation on Hopper196K particles, 24K cores
Dow
n is good
74.8% reduction
Strong Scaling on Hopper2D space, 24K cores, 196K particles
Up is G
ood
Good Strong Scaling
Conclusions• By using c times more memory, we reduce:
– Words sent along critical path: c.– Messages sent along critical path: c2.
• Theory: maximize c.• Practice: tune for best c.
– Saw 99.5% reduction in communication (11.8x speedup).
• Applications beyond direct n-body:– collision detection algorithms– database joins– bottom solvers in hierarchical n-body codes
Recommended