IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson

IBM Research

© 2005 IBM Corporation

Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD

Sameer Kumar, Blue Gene Software Group,IBM T J Watson Research Center,Yorktown Heights, [email protected]

IBM Research

© 2005 IBM Corporation2

Outline

Motivation

NAMD and Charm++

BGL Techniques

– Problem mapping

– Overlap of communication with computation

– Grain size

– Load-balancing

– Communication optimizations

Summary

IBM Research


Blue Gene/L

IBM Research


Blue Gene/L

2.8/5.6 GF/s4 MB

2 processors

2 chips, 1x2x1

5.6/11.2 GF/s1.0 GB

(32 chips 4x4x2)16 compute, 0-2 IO cards

90/180 GF/s16 GB

32 Node Cards

2.8/5.6 TF/s512 GB

64 Racks, 64x32x32

180/360 TF/s32 TB

Rack

System

Node Card

Compute Card

Chip

IBM Research


Application Scaling

Weak

– Problem size increases with processors

Strong

– Constant problem size

– Linear to sub-linear decrease in computation time with processors

– Cache performance

– Communication overhead• Communication to computation ratio

IBM Research


Scaling on Blue Gene/L

Several applications have demonstrated weak scaling

Strong scaling on a large number of benchmarks still needs to be achieved

IBM Research


NAMD and Charm++

IBM Research


NAMD: A Production MD program

NAMD

Fully featured program

NIH-funded development

Distributed free of charge (thousands downloads so far)

Binaries and source code

Installed at NSF centers

User training and support

Large published simulations (e.g., aquaporin simulation featured in keynote)

IBM Research


NAMD, CHARMM27, PMENpT ensemble at 310 or 298 K 1ns equilibration, 4ns production

Protein:~ 15,000 atomsLipids (POPE): ~ 40,000 atomsWater: ~ 51,000 atomsTotal: ~ 106,000 atoms

3.5 days / ns - 128 O2000 CPUs11 days / ns - 32 Linux CPUs.35 days/ns–512 LeMieux CPUs

Acquaporin Simulation

F. Zhu, E.T., K. Schulten, FEBS Lett. 504, 212 (2001)M. Jensen, E.T., K. Schulten, Structure 9, 1083 (2001)

IBM Research


Molecular Dynamics in NAMD

Collection of [charged] atoms, with bonds

– Newtonian mechanics

– Thousands of atoms (10,000 - 500,000)

At each time-step

– Calculate forces on each atom

• Bonds:• Non-bonded: electrostatic and van der Waal’s

– Short-distance: every timestep– Long-distance: using PME (3D FFT)– Multiple Time Stepping : PME every 4 timesteps

– Calculate velocities and advance positions

Challenge: femtosecond time-step, millions needed!

IBM Research


NAMD Benchmarks

BPTI3K atoms

Estrogen Receptor36K atoms (1996)

ATP Synthase327K atoms

(2001)

IBM Research


Parallel MD: Easy or Hard?

Easy

– Tiny working data

– Spatial locality

– Uniform atom density

– Persistent repetition

– Multiple time-stepping

Hard

– Sequential timesteps

– Very short iteration time

– Full electrostatics

– Fixed problem size

– Dynamic variations

IBM Research


NAMD Computation

Application data divided into data objects called patches

– Sub-grids determined by cutoff

Computation performed by migratable computes

– 13 computes per patch pair and hence much more parallelism

– Computes can be further split to increase parallelism

IBM Research


NAMD

Scalable molecular dynamics simulation

2 types of objects: patches and computes, to expose more parallelism

Requires more careful load balancing

IBM Research


Communication to Computation Ratio

Scalable

– Constant with number of processors

– In practice grows at a very small rate

IBM Research


Charm++ and Converse

Charm++: object-based asynchronous message-driven parallel programming paradigm

Converse: communication layer for Charm++

– Send, recv, progress, on node level

User ViewSystem implementation

Network

Scheduler

Recv Msg Q

obj

obj

obj

obj

obj

Send Msg Q

Interface

obj

IBM Research


Optimizing NAMD on Blue Gene/L

IBM Research


Single Processor Performance

Worked with IBM Toronto for 3 weeks

– Inner loops slightly altered to enable software pipelining

– Aliasing issues resolved through the use of

#pragma disjoint (*ptr1, *ptr2)

– 40% serial speedup

– Current best performance is with 440

Continued efforts with Toronto to get good 440d performance

IBM Research


NAMD on BGL

Advantages

– Both application and hardware are 3D grids

– Large 4MB L3 cache

• On large number of processors NAMD will run from L3

– Higher bandwidth for short messages

• Midpoint of peak bandwidth achieved quickly

– Six outgoing links from each node

– No OS Daemons

IBM Research


NAMD on BGL

Disadvantages

– Slow embedded CPU

– Small memory per node

– Low bisection bandwidth

– Hard to scale full electrostatics

– Limited support for overlap of computation and communication• No cache coherence

IBM Research


BGL Parallelization

Topology driven problem mapping

Load-balancing schemes

Overlap of computation and communication

Communication optimizations

IBM Research


Problem Mapping

X

Y

Z

X

Y

Z

Application Data Space Processor Grid

IBM Research


Problem Mapping

X

Y

Z

X

Y

Z

Application Data Space Processor Grid

IBM Research


Problem Mapping

Application Data SpaceX

Y

Z

Processor Grid

Y

X

Z

IBM Research


Problem Mapping

X

Y

Z

Processor Grid

Data Objects

Cutoff-driven Compute Objects

IBM Research


Two Away Computation

Each data object (patch) is split along a dimension

– Patches now interact with neighbors of neighbors

– Makes application more fine grained

– Improves load balancing

– Messages of smaller size sent to more processors

– Improves torus bandwidth

IBM Research


Two Away X

IBM Research


Load Balancing Steps

Regular Timesteps

Instrumented Timesteps

Detailed, aggressive Load Balancing

Refinement Load Balancing

IBM Research


Load-balancing Metrics

Balancing load

Minimizing communication hop-bytes

– Place computes close to patches

– Biased through placement of proxies on near neighbors

Minimizing number of proxies

– Effects connectivity of each data object

IBM Research


Overlap of Computation and Communication

Each FIFO has 4 packet buffers

Progress engine should be called every 4400 cycles

Overhead of about 200 cycles

– 5 % increase in computation

Remaining time can be used for computation

IBM Research


Network Progress Calls

NAMD makes progress engine calls from the compute loops

– Typical frequency is10000 cycles, dynamically tunable

for ( i = 0; i < (i_upper SELF(- 1)); ++i ){

CmiNetworkProgress();

const CompAtom &p_i = p_0[i];

//……………………………

//Compute Pairlists

for (k=0; k<npairi; ++k) {

//Compute forces

}

}

void CmiNetworkProgress() {

new_time = rts_get_timebase();

if(new_time < lastProgress + PERIOD) {

lastProgress = new_time;

return;

}

lastProgress = new_time;

AdvanceCommunication();

}

IBM Research


MPI Scalability

Charm++ MPI Driver

– Iprobe based implementation

– Higher progress overhead of MPI_Test

– Statically pinned FIFOs for point to point communication

IBM Research


Charm++ Native Driver

BGX Message Layer (developed by George Almasi)

– Lower progress overhead

– Active messages• Easily design complex communication protocols

– Dynamic FIFO mapping

– Low overhead remote memory access

– Interrupts

– Charm++ BGX driver was developed by Chao Huang over this summer

IBM Research


BG/L Msglayer

Advance loop

post

0

1

2

n-1

…

ScratchpadMsq queue

TorusMsq queue

CollectiveMsq queue

Msg Queues

Torus FIFOs

0 1 2 H

0 1 2 H

I0

I1

R0

R1

x+ x- y+ y- z+ z- H

x+ x- y+ y- z+ z- H

Coll. network FIFO

Network

FIFOpinning

Torus pkt. registry

012

…p

Coll. pkt. disp.

Dispatching

TorusMessageTreeMessageSpadMessage

Messages

TorusPacket

Deterministicallyrouted packet

Dynamicallyrouted packet

Packets

TreePacket

packets

Templates

TorusDirectMessage<>

( This slide is taken from G. Almási’s talk on the “new” msglayer. )

IBM Research


Optimized Multicast

pinFifo Algorithms

– Decide which of the 6 FIFOs to use when send msg to {x,y,z,t}

– Cones, Chessboard

Dynamic FIFO mapping

– A special send queue that msg can go from whichever FIFO that is not full

IBM Research


Communication Pattern in PME

108

procs

108 procs

IBM Research


PME

Plane decomposition for 3D-FFT

PME objects placed close to patch objects on the torus

PME optimized through an asynchronous all-to-all with dynamic FIFO mapping

IBM Research


Performance Results

IBM Research


BGX Message layer vs MPI

# NodesCutoff with PME

Msglayer MPI* Msglayer MPI*

4 2250 2250

32 314 316 356 371

128 85 91.6 103

512 22.7 23.8 26.7 27.8

1024 13.2 13.9 14.4 17.3

2048 7.9 8.1 9.7 10.2

4096 4.8 4.9 6.8 7.3

NAMD Co-Processor Mode Performance (ms/step)

Message layer has sender side blocking communication here

APoA1 Benchmark

Fully non-blocking version performed below par on MPI

– Polling overhead high for a list of posted receives

BGX message layer works well with asynchronous communication

IBM Research


Blocking vs Overlap

# Nodes

Cutoff with PME

Blocking Sender Non-Blocking Blocking Sender Non-Blocking

32 314 313 356 347

128 85 82 103 97.2

512 22.7 21.7 26.7 23.7

1024 13.2 11.9 14.4 13.8

2048 7.9 7.3 9.7 8.6

4096 4.8 4.3 6.8 6.2

8192 - 3.7 - -

APoA1 Benchmark in Co-Processor Mode

IBM Research


Effect of Network Progress

(Projections timeline of a 1024-node run without aggressive network progress)

Network progress not aggressive enough: communication gaps eat up utilization

IBM Research


Effect of Network Progress (2)

(Projections timeline of a 1024-node run with aggressive network progress)

More frequent advance closes gaps

IBM Research


Virtual Node Mode

0

20

40

60

80

100

120

128 512 1024 2048 4096

CP

VN

Processors

Ste

p T

ime

(ms)

APoA1 step time with PME

IBM Research


1

10

100

1000

32 64 128 256 512 1024 2048 4096

Spring

Now

Spring vs Now

Processors

Ste

p T

ime

(ms)

APoA1 step time with PME

IBM Research


Summary

IBM Research


Summary

Demonstrated good scaling to 4k processors for the APoA1 with a speedup of 2100

– Still working on 8k results

ATPase scales well to 8k processors with a speedup of 4000+

IBM Research


Lessons Learnt

Eager messages lead to contention

Rendezvous messages don’t perform well with mid size messages

Topology optimizations are a big winner

Overlap of computation and communication is possible

– Overlap however makes compute load less predictable

Lack of operating system daemons leads to massive scaling

IBM Research


Future Plans

Experiment with new communication protocols

– Remote memory access

– Adaptive eager

– Fast asynchronous collectives

Improve load-balancing

– Newer distributed strategies

– Heavy processors dynamically unload to neighbors

Pencil decomposition for PME

Using the double hummer

Documents

IBM Research © 2005 IBM Corporation Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson