Upload
willis-harrison
View
216
Download
1
Embed Size (px)
Citation preview
IBM Research
© 2005 IBM Corporation
Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD
Sameer Kumar, Blue Gene Software Group,IBM T J Watson Research Center,Yorktown Heights, [email protected]
IBM Research
© 2005 IBM Corporation2
Outline
Motivation
NAMD and Charm++
BGL Techniques
– Problem mapping
– Overlap of communication with computation
– Grain size
– Load-balancing
– Communication optimizations
Summary
IBM Research
© 2005 IBM Corporation
Blue Gene/L
IBM Research
© 2005 IBM Corporation
Blue Gene/L
2.8/5.6 GF/s4 MB
2 processors
2 chips, 1x2x1
5.6/11.2 GF/s1.0 GB
(32 chips 4x4x2)16 compute, 0-2 IO cards
90/180 GF/s16 GB
32 Node Cards
2.8/5.6 TF/s512 GB
64 Racks, 64x32x32
180/360 TF/s32 TB
Rack
System
Node Card
Compute Card
Chip
IBM Research
© 2005 IBM Corporation5
Application Scaling
Weak
– Problem size increases with processors
Strong
– Constant problem size
– Linear to sub-linear decrease in computation time with processors
– Cache performance
– Communication overhead• Communication to computation ratio
IBM Research
© 2005 IBM Corporation6
Scaling on Blue Gene/L
Several applications have demonstrated weak scaling
Strong scaling on a large number of benchmarks still needs to be achieved
IBM Research
© 2005 IBM Corporation
NAMD and Charm++
IBM Research
© 2005 IBM Corporation8
NAMD: A Production MD program
NAMD
Fully featured program
NIH-funded development
Distributed free of charge (thousands downloads so far)
Binaries and source code
Installed at NSF centers
User training and support
Large published simulations (e.g., aquaporin simulation featured in keynote)
IBM Research
© 2005 IBM Corporation9
NAMD, CHARMM27, PMENpT ensemble at 310 or 298 K 1ns equilibration, 4ns production
Protein:~ 15,000 atomsLipids (POPE): ~ 40,000 atomsWater: ~ 51,000 atomsTotal: ~ 106,000 atoms
3.5 days / ns - 128 O2000 CPUs11 days / ns - 32 Linux CPUs.35 days/ns–512 LeMieux CPUs
Acquaporin Simulation
F. Zhu, E.T., K. Schulten, FEBS Lett. 504, 212 (2001)M. Jensen, E.T., K. Schulten, Structure 9, 1083 (2001)
IBM Research
© 2005 IBM Corporation10
Molecular Dynamics in NAMD
Collection of [charged] atoms, with bonds
– Newtonian mechanics
– Thousands of atoms (10,000 - 500,000)
At each time-step
– Calculate forces on each atom
• Bonds:• Non-bonded: electrostatic and van der Waal’s
– Short-distance: every timestep– Long-distance: using PME (3D FFT)– Multiple Time Stepping : PME every 4 timesteps
– Calculate velocities and advance positions
Challenge: femtosecond time-step, millions needed!
IBM Research
© 2005 IBM Corporation11
NAMD Benchmarks
BPTI3K atoms
Estrogen Receptor36K atoms (1996)
ATP Synthase327K atoms
(2001)
IBM Research
© 2005 IBM Corporation12
Parallel MD: Easy or Hard?
Easy
– Tiny working data
– Spatial locality
– Uniform atom density
– Persistent repetition
– Multiple time-stepping
Hard
– Sequential timesteps
– Very short iteration time
– Full electrostatics
– Fixed problem size
– Dynamic variations
IBM Research
© 2005 IBM Corporation13
NAMD Computation
Application data divided into data objects called patches
– Sub-grids determined by cutoff
Computation performed by migratable computes
– 13 computes per patch pair and hence much more parallelism
– Computes can be further split to increase parallelism
IBM Research
© 2005 IBM Corporation14
NAMD
Scalable molecular dynamics simulation
2 types of objects: patches and computes, to expose more parallelism
Requires more careful load balancing
IBM Research
© 2005 IBM Corporation15
Communication to Computation Ratio
Scalable
– Constant with number of processors
– In practice grows at a very small rate
IBM Research
© 2005 IBM Corporation16
Charm++ and Converse
Charm++: object-based asynchronous message-driven parallel programming paradigm
Converse: communication layer for Charm++
– Send, recv, progress, on node level
User ViewSystem implementation
Network
Scheduler
Recv Msg Q
obj
obj
obj
obj
obj
Send Msg Q
Interface
obj
IBM Research
© 2005 IBM Corporation
Optimizing NAMD on Blue Gene/L
IBM Research
© 2005 IBM Corporation18
Single Processor Performance
Worked with IBM Toronto for 3 weeks
– Inner loops slightly altered to enable software pipelining
– Aliasing issues resolved through the use of
#pragma disjoint (*ptr1, *ptr2)
– 40% serial speedup
– Current best performance is with 440
Continued efforts with Toronto to get good 440d performance
IBM Research
© 2005 IBM Corporation19
NAMD on BGL
Advantages
– Both application and hardware are 3D grids
– Large 4MB L3 cache
• On large number of processors NAMD will run from L3
– Higher bandwidth for short messages
• Midpoint of peak bandwidth achieved quickly
– Six outgoing links from each node
– No OS Daemons
IBM Research
© 2005 IBM Corporation20
NAMD on BGL
Disadvantages
– Slow embedded CPU
– Small memory per node
– Low bisection bandwidth
– Hard to scale full electrostatics
– Limited support for overlap of computation and communication• No cache coherence
IBM Research
© 2005 IBM Corporation21
BGL Parallelization
Topology driven problem mapping
Load-balancing schemes
Overlap of computation and communication
Communication optimizations
IBM Research
© 2005 IBM Corporation22
Problem Mapping
X
Y
Z
X
Y
Z
Application Data Space Processor Grid
IBM Research
© 2005 IBM Corporation23
Problem Mapping
X
Y
Z
X
Y
Z
Application Data Space Processor Grid
IBM Research
© 2005 IBM Corporation24
Problem Mapping
Application Data SpaceX
Y
Z
Processor Grid
Y
X
Z
IBM Research
© 2005 IBM Corporation25
Problem Mapping
X
Y
Z
Processor Grid
Data Objects
Cutoff-driven Compute Objects
IBM Research
© 2005 IBM Corporation26
Two Away Computation
Each data object (patch) is split along a dimension
– Patches now interact with neighbors of neighbors
– Makes application more fine grained
– Improves load balancing
– Messages of smaller size sent to more processors
– Improves torus bandwidth
IBM Research
© 2005 IBM Corporation27
Two Away X
IBM Research
© 2005 IBM Corporation28
Load Balancing Steps
Regular Timesteps
Instrumented Timesteps
Detailed, aggressive Load Balancing
Refinement Load Balancing
IBM Research
© 2005 IBM Corporation29
Load-balancing Metrics
Balancing load
Minimizing communication hop-bytes
– Place computes close to patches
– Biased through placement of proxies on near neighbors
Minimizing number of proxies
– Effects connectivity of each data object
IBM Research
© 2005 IBM Corporation30
Overlap of Computation and Communication
Each FIFO has 4 packet buffers
Progress engine should be called every 4400 cycles
Overhead of about 200 cycles
– 5 % increase in computation
Remaining time can be used for computation
IBM Research
© 2005 IBM Corporation31
Network Progress Calls
NAMD makes progress engine calls from the compute loops
– Typical frequency is10000 cycles, dynamically tunable
for ( i = 0; i < (i_upper SELF(- 1)); ++i ){
CmiNetworkProgress();
const CompAtom &p_i = p_0[i];
//……………………………
//Compute Pairlists
for (k=0; k<npairi; ++k) {
//Compute forces
}
}
void CmiNetworkProgress() {
new_time = rts_get_timebase();
if(new_time < lastProgress + PERIOD) {
lastProgress = new_time;
return;
}
lastProgress = new_time;
AdvanceCommunication();
}
IBM Research
© 2005 IBM Corporation32
MPI Scalability
Charm++ MPI Driver
– Iprobe based implementation
– Higher progress overhead of MPI_Test
– Statically pinned FIFOs for point to point communication
IBM Research
© 2005 IBM Corporation33
Charm++ Native Driver
BGX Message Layer (developed by George Almasi)
– Lower progress overhead
– Active messages• Easily design complex communication protocols
– Dynamic FIFO mapping
– Low overhead remote memory access
– Interrupts
– Charm++ BGX driver was developed by Chao Huang over this summer
IBM Research
© 2005 IBM Corporation34
BG/L Msglayer
Advance loop
post
0
1
2
n-1
…
ScratchpadMsq queue
TorusMsq queue
CollectiveMsq queue
Msg Queues
Torus FIFOs
0 1 2 H
0 1 2 H
I0
I1
R0
R1
x+ x- y+ y- z+ z- H
x+ x- y+ y- z+ z- H
Coll. network FIFO
Network
FIFOpinning
Torus pkt. registry
012
…p
Coll. pkt. disp.
Dispatching
TorusMessageTreeMessageSpadMessage
Messages
TorusPacket
Deterministicallyrouted packet
Dynamicallyrouted packet
Packets
TreePacket
packets
Templates
TorusDirectMessage<>
( This slide is taken from G. Almási’s talk on the “new” msglayer. )
IBM Research
© 2005 IBM Corporation35
Optimized Multicast
pinFifo Algorithms
– Decide which of the 6 FIFOs to use when send msg to {x,y,z,t}
– Cones, Chessboard
Dynamic FIFO mapping
– A special send queue that msg can go from whichever FIFO that is not full
IBM Research
© 2005 IBM Corporation36
Communication Pattern in PME
108
procs
108 procs
IBM Research
© 2005 IBM Corporation37
PME
Plane decomposition for 3D-FFT
PME objects placed close to patch objects on the torus
PME optimized through an asynchronous all-to-all with dynamic FIFO mapping
IBM Research
© 2005 IBM Corporation
Performance Results
IBM Research
© 2005 IBM Corporation39
BGX Message layer vs MPI
# NodesCutoff with PME
Msglayer MPI* Msglayer MPI*
4 2250 2250
32 314 316 356 371
128 85 91.6 103
512 22.7 23.8 26.7 27.8
1024 13.2 13.9 14.4 17.3
2048 7.9 8.1 9.7 10.2
4096 4.8 4.9 6.8 7.3
NAMD Co-Processor Mode Performance (ms/step)
Message layer has sender side blocking communication here
APoA1 Benchmark
Fully non-blocking version performed below par on MPI
– Polling overhead high for a list of posted receives
BGX message layer works well with asynchronous communication
IBM Research
© 2005 IBM Corporation40
Blocking vs Overlap
# Nodes
Cutoff with PME
Blocking Sender Non-Blocking Blocking Sender Non-Blocking
32 314 313 356 347
128 85 82 103 97.2
512 22.7 21.7 26.7 23.7
1024 13.2 11.9 14.4 13.8
2048 7.9 7.3 9.7 8.6
4096 4.8 4.3 6.8 6.2
8192 - 3.7 - -
APoA1 Benchmark in Co-Processor Mode
IBM Research
© 2005 IBM Corporation41
Effect of Network Progress
(Projections timeline of a 1024-node run without aggressive network progress)
Network progress not aggressive enough: communication gaps eat up utilization
IBM Research
© 2005 IBM Corporation42
Effect of Network Progress (2)
(Projections timeline of a 1024-node run with aggressive network progress)
More frequent advance closes gaps
IBM Research
© 2005 IBM Corporation43
Virtual Node Mode
0
20
40
60
80
100
120
128 512 1024 2048 4096
CP
VN
Processors
Ste
p T
ime
(ms)
APoA1 step time with PME
IBM Research
© 2005 IBM Corporation44
1
10
100
1000
32 64 128 256 512 1024 2048 4096
Spring
Now
Spring vs Now
Processors
Ste
p T
ime
(ms)
APoA1 step time with PME
IBM Research
© 2005 IBM Corporation
Summary
IBM Research
© 2005 IBM Corporation46
Summary
Demonstrated good scaling to 4k processors for the APoA1 with a speedup of 2100
– Still working on 8k results
ATPase scales well to 8k processors with a speedup of 4000+
IBM Research
© 2005 IBM Corporation47
Lessons Learnt
Eager messages lead to contention
Rendezvous messages don’t perform well with mid size messages
Topology optimizations are a big winner
Overlap of computation and communication is possible
– Overlap however makes compute load less predictable
Lack of operating system daemons leads to massive scaling
IBM Research
© 2005 IBM Corporation48
Future Plans
Experiment with new communication protocols
– Remote memory access
– Adaptive eager
– Fast asynchronous collectives
Improve load-balancing
– Newer distributed strategies
– Heavy processors dynamically unload to neighbors
Pencil decomposition for PME
Using the double hummer