Application Performance Analysis on Blue Gene/L

Application Performance Analysis on Blue Gene/L

Jim Pool, P.I.Maciej Brodowicz, Sharon Brunett,

Tom Gottschalk, Dan Meiron, Paul Springer, Thomas Sterling,

Ed Upchurch

Caltech’s Role in Blue Gene/L Project

• Understand implications of BG/L network architecture & drive results from real world ASCI applications

• Develop statistical models of applications, processors as message generators, and the network

• Focus on– Application communications distribution– Network contention as function of load, size and

adaptive routing• Represent 64K Nodes Explicitly in Statistical Model• Create trace analysis tools to characterize applications

– Extensible Trace Facility (ETF)

Blue Gene / L Node

Blue Gene / L Network

ETF Built-in Trace Options• MPI events

– All point-to-point communications (MPI-1)– All collective communications (MPI-1)– Non-blocking request tracking– Communicator creation and destruction– MPI datatype decoding (requires MPI-2)– Languages: C, Fortran– Easy instrumentation of applications

• Memory reference and program execution tracing– Tracking of statically and dynamically allocated arrays (identifiers,

element sizes, dimensions)– Tracking of scalar variables– Read and write accesses to individual scalars and array elements

as well as contiguous vectors of elements– Function calls– Program execution phases

ETF Tracing Example forMagnetic Hydro Dynamic (MHD) Code with Adaptive Mesh Refinement (AMR)• Parallel MHD fluid code solves equations of

hydrodynamics and resistive Maxwell’s equations– Part of larger application which computes dynamic

responses to strong shock waves impinging on target materials

– Fortran 90 + MPI

– MPI Cartesian communicators

– Nearest neighbor comms use non blocking send/recv

– MPI Allreduce for calculating stable time steps

AMR MHD: Communication Profile20 time steps on 32 processors, 128x128 cells

0

10000

20000

30000

40000

50000

60000

70000

80000

0 2E+10 4E+10 6E+10 8E+10 1E+11 1.2E+11 1.4E+11 1.6E+11 1.8E+11 2E+11

CPU cycles

Co

un

t

point-to-point collective

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

2000000

0 2E+11 4E+11 6E+11 8E+11 1E+12 1.2E+12 1.4E+12 1.6E+12 1.8E+12 2E+12

CPU cycles

Co

un

t

point-to-point collective

Max. level = 1

Max. level = 2

Lennard-Jones Molecular Dynamics

• Short range molecular dynamics application simulating Newtonian interactions in large groups of atoms– production code from Sandia National Lab

• Simulations are large in two dimensions– number of atoms and number of time steps

• Spatial decomposition case selected – each processing node keeps track of the positions and movement of the atoms in a

3-D box

• Computations carried out in a single time step correspond to femto-seconds of real time – a meaningful simulation of the evolution of the system’s state typically requires

thousands of time steps

• Point-to-point MPI messages are exchanged across each of the 6 sides of the box / time step

• Code is written in Fortran and MPI

rc

(1) (2)

Comms Comp Wait

Time tSYNCH

Typical Grid Cell and Cutoff Radius Communication Steps

Computational Cycle Model

Lennard-Jones Molecular Dynamics

LJS Single Processor BG/L PerformanceOriginal Code vs. Tuned for BG/L

0

2

4

6

8

10

12

15,625 31,250 62,500 125,000 250,000 500,000

Number of Atoms per BG/L CPU

Imp

rov

em

en

t (%

) good cache reuse

LJS Molecular Dynamics PerformanceFixed Problem Size of 1 Billion Atoms

2k 4k 8k 16k 32k 64k

Number of BG/L CPUs

Tim

e p

er s

ing

le it

erat

ion

(m

s)

Compute Time [ms]

Communications Time [ms]

LJS Speedup BG/L vs. ASCI Red 3200 Nodes1 Billion Atom Problem

0

10

20

30

40

50

60

70

80

2k 4k 8k 16k 32k 64k

Number of BlueGene/L Nodes

Sp

eed

up

LJS Communications Time

500,000 Atoms per BG/L Node

0

10

20

30

40

50

60

4x4x4 (64 BGL Nodes) 8x8x8 (512 BGL Nodes) 16x16x16 (4096 BGL Nodes)

BG/L Configuration

Co

mm

un

icat

ion

s T

ime

Per

Iter

atio

n (

mse

cs)

Physical Nearest Neighbor Mapping

Random Mapping

What is QMC and Why is it a Good Fit for BG/L?

• QMC is a finite all-electron Quantum Monte Carlo code used to determine quantum properties of materials with extremely high accuracy

• Developed at Caltech by Bill Goddard’s ASCI Material Properties group

• Interesting Characteristics– Low memory requirements– After initialization, highly parallel and scalable– Minimal set of MPI calls required

• Non blocking p2p, reduction, probe, communicator, collective calls– No communications during QMC working steps– Communicating convergence statistics is 7200 bytes regardless of problem size and

node count– Code already ported to many platforms (Linux, AIX, IRIX, etc.)

• C++ and MPI sources

Iterative QMC Algorithm

For each processor do:Steps = Total Steps / number of processorsGenerate walkers Equilibrate walkersfor each step

generate QMC statisticssend QMC statistics to master node

QMC Communications TimeFor 100,000 Steps Per Node(Reduce Using the Torus)

0.001

0.01

0.1

18x8x8 (512) 16x16x16 (4K) 32x16x16 (8K) 32x32x16 (16K) 32x32x32 (32K) 64x32x32 (64K)

BG/L Configuration

Tim

e (

secon

ds)

Future Application Porting and Analysis for BG/L

• ASCI solid dynamics code simulating the mechanical response of polycrystalline materials, such as tantalum

• Address memory constraints, grain load imbalance and MPI_Waitall() efficiency as we port/tune to BG/L– good stress test for BG/L

robustness

•Scalable simulation of polycrystalline response with assumed grain shape. The grain shape corresponds to the space-filling polyhedra corresponding to the Wigner-Seitz cell of a BCC crystal. The 390 grain example shown here was run on LLNL’s IBMSP3, frost.

Documents

Application Performance Analysis on Blue Gene/L