NAMD and BG/L

NAMD and BG/L

Chee Wai [email protected]

Parallel Programming LaboratoryComputer Science Department

University of Illinois at Urbana-Champaignhttp://charm.cs.uiuc.edu

mailto:[email protected]

Outline

● BG/L Platform overview● Optimization Efforts: Context● Optimization Efforts: Approaches

– Topology Awareness– Load Balancing– Parallelism– Computation/Communication Overlap

● Results

Bluegene/L Platform Review

● Hardware characteristics:– PowerPC 440 700 Mhz 32-bit processors– 2 Processors per node, no cache coherence– 4MB L3 Cache– 512 MB memory per node– 6 outgoing FIFO links per node– 3D Torus interconnect

Bluegene/L Platform Review (2)

● Other characteristics:– Microkernel on compute nodes, minimal OS

interference.

Outline



● Results

Objectives

● Scale the 92,000 atom benchmark apoa1 as far as possible.

● Sought understanding of scaling issues involved on the BG/L machine.

Outline



● Results

Topology Awareness

● Distribute Patches according to the topology.– Logically align the NAMD 3D patch grid to BG/L's

processor grid.– Patch Grid divided by Orthogonal Recursive

Bisection (ORB) scheme.– Processor Grid is divided in similar proportions and

assigned to corresponding Patch subgrids.

● Topology aware spanning tree for multicasts.

Load Balancing

● Framework optimizations– Memory footprint had to be reduced to accommodate

the desired number of processors.– Spanning Tree implemented to handle large numbers

of incoming messages to pe 0.

● Spread non-migratable work better– Bonded computations (eg. Dihedrals) allocated off

processors with Patch work where possible.

More Parallelism

● 2-away computation. Patches interact with neighbors of neighbors.– User-tunable configuration option.

● Break up compute objects.– Another User-tunable configuration option.– Balance tradeoffs in grainsize vs overheads.

● PME pencil decomposition efforts.

Overlap of Computation and Communication

● Hurt by lack of cache-coherence.

● One processor can serve as communication co-processor if the L1 caches are flushed for large messages. Hurts too much.

● Make use of FIFO link buffers. Every so often in NAMD's outer loop, we make AdvanceCommunication() calls.

Outline

● BG/L Platform overview● Optimization Efforts: Context● Optimization Efforts: Approaches● Results

Results

Nodes Processors Mode Time (watson)32 32 co347 ms128 128 co 97.2 ms512 512 co 23.7 ms1024 1024 co13.8 ms2048 2048 co8.6 ms4096 4096 co6.2 ms8192 Processor scaling was achieved at 5.2ms per step

Documents

NAMD and BG/L