78
Techniques for Developing Efficient Petascale Applications Laxmikant (Sanjay) Kale http://charm.cs.uiuc.e du Parallel Programming Laboratory Department of Computer Science

Techniques for Developing Efficient Petascale Applications

  • Upload
    inge

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

Techniques for Developing Efficient Petascale Applications. Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign. Outline. Basic Techniques for attaining good performance - PowerPoint PPT Presentation

Citation preview

Page 1: Techniques for Developing Efficient Petascale Applications

Techniques for Developing Efficient Petascale Applications

Laxmikant (Sanjay) Kale

http://charm.cs.uiuc.eduParallel Programming Laboratory

Department of Computer Science

University of Illinois at Urbana Champaign

Page 2: Techniques for Developing Efficient Petascale Applications

2

Outline• Basic Techniques for attaining good performance

• Scalability analysis of Algorithms

• Measurements and Tools

• Communication optimizations:– Communication basic– Overlapping communication and computation– Alpha-beta optimizations

• Combining and pipelining– (topology-awareness)

• Sequential optimizations

• (Load balancing)04/21/23 2Performance Techniques04/21/23

Page 3: Techniques for Developing Efficient Petascale Applications

3

Parallel Objects,

Adaptive Runtime System

Libraries and Tools

Examples based on multiple applications:

Molecular Dynamics

Crack Propagation

Space-time meshes

Computational Cosmology

Rocket Simulation

Protein Folding

Dendritic Growth

Quantum Chemistry (QM/MM)

Performance Techniques04/21/23

Page 4: Techniques for Developing Efficient Petascale Applications

4

Analyze Performance with both: Simple as well as

Sophisticated Tools

Performance Techniques04/21/23

Page 5: Techniques for Developing Efficient Petascale Applications

Simple techniques• Timers: wall timer (time.h)

• Counters: Use papi library raw counters, ..– Esp. useful:

• number of floating point operations,

• cache misses (L2, L1, ..)

• Memory accesses

• Output method:– “printf” (or cout) can be expensive

– Store timer values into an array or buffer, and print at the end

Performance Techniques 504/21/23

Page 6: Techniques for Developing Efficient Petascale Applications

Sophisticated Tools• Many tools exist• Have some learning curve, but can be beneficial• Example tools:

– Jumpshot

– TAU

– Scalasca

– Projections

– Vampir ($$)

• PMPI interface: – Allows you to intercept MPI calls

• So you can write your own tools

– PMPI interface for projections:

• git://charm.cs.uiuc.edu/PMPI_Projections.git

Performance Techniques 604/21/23

Page 7: Techniques for Developing Efficient Petascale Applications

7

Example: Projections Performance Analysis Tool

• Automatic instrumentation via runtime

• Graphical visualizations• More insightful feedback

– because runtime understands application events better

Performance Techniques04/21/23

Page 8: Techniques for Developing Efficient Petascale Applications

8

Exploit sophisticated Performance analysis tools

• We use a tool called “Projections”

• Many other tools exist

• Need for scalable analysis

• A not-so-recent example:– Trying to identify the next performance obstacle for NAMD

• Running on 8192 processors, with 92,000 atom simulation

• Test scenario: without PME

• Time is 3 ms per step, but lower bound is 1.6ms or so..

Performance Techniques04/21/23

Page 9: Techniques for Developing Efficient Petascale Applications

9Performance Techniques04/21/23

Page 10: Techniques for Developing Efficient Petascale Applications

10Performance Techniques04/21/23

Page 11: Techniques for Developing Efficient Petascale Applications

11Performance Techniques04/21/23

Page 12: Techniques for Developing Efficient Petascale Applications

12

Performance Tuning withPatience and Perseverance

Performance Techniques04/21/23

Page 13: Techniques for Developing Efficient Petascale Applications

13

Performance Tuning with Perseverance

• Recognize multi-dimensional nature of the performance space

• Don’t stop optimizing until you know for sure why it cannot be speeded up further– Measure, measure, measure ...

Performance Techniques04/21/23

Page 14: Techniques for Developing Efficient Petascale Applications

14

Shallow valleys, high peaks, nicely overlapped PME

green: communication

Red: integration Blue/Purple: electrostatics

turquoise: angle/dihedral

Orange: PME

94% efficiency

Apo-A1, on BlueGene/L, 1024 procs

Charm++’s “Projections” Analysis too

Time intervals on x axis, activity added across processors on Y axisl

Performance Techniques04/21/23

Page 15: Techniques for Developing Efficient Petascale Applications

15

Cray XT3, 512 processors: Initial runs

Clearly, needed further tuning, especially PME.

But, had more potential (much faster processors)

76% efficiency

Performance Techniques04/21/23

Page 16: Techniques for Developing Efficient Petascale Applications

16

On Cray XT3, 512 processors: after optimizations

96% efficiency

Performance Techniques04/21/23

Page 17: Techniques for Developing Efficient Petascale Applications

17

Communication Issues

Performance Techniques04/21/23

Page 18: Techniques for Developing Efficient Petascale Applications

Recap: Communication Basics: Point-to-point

Sending processorSending co-processor

Network

Receiving co-processor

Receiving processor

Each component has a per-message cost, and per byte cost

04/21/23 18Performance Techniques

Each cost, for a n-byte message = ά + n β

Important metrics: Overhead at processor, co-processor Network latency Network bandwidth consumed

Number of hops traversed

Page 19: Techniques for Developing Efficient Petascale Applications

Communication Basics

04/21/23 Performance Techniques 19

• Message Latency: time between the application sending the message and receiving it on the other processor

• Send overhead: time for which the sending processor was “occupied” with the message

• Receive overhead: the time for which the receiving processor was “occupied” with the message

• Network latency

Page 20: Techniques for Developing Efficient Petascale Applications

Communication: Diagnostic Techniques

• A simple technique: (find “grainsize”)– Count the number of messages per second of computation per processor!

(max, average)– Count number of bytes– Calculate: computation per message (and per byte)

• Use profiling tools:– Identify time spent in different communication operations– Classified by modules

• Examine idle time using time-line displays– On important processors– Determine the causes

• Be careful with “synchronization overhead”– May be load balancing masquerading as sync overhead. – Common mistake.

04/21/23 20Performance Techniques

Page 21: Techniques for Developing Efficient Petascale Applications

Communication: Problems and Issues• Too small a Grainsize

– Total Computation time / total number of messages

– Separated by phases, modules, etc.

• Too many, but short messages– vs. tradeoff

• Processors wait too long

• Later: – Locality of communication

• Local vs. non-local

• How far is non-local? (Does that matter?)

– Synchronization

– Global (Collective) operations• All-to-all operations, gather, scatter

• We will focus on communication cost (grainsize)

04/21/23 21Performance Techniques

Page 22: Techniques for Developing Efficient Petascale Applications

Communication: Solution Techniques• Overview:

– Overlap with Computation

• Manual

• Automatic and adaptive, using virtualization

– Increasing grainsize

– -reducing optimizations

• Message combining

• communication patterns

– Controlled Pipelining

– Locality enhancement: decomposition control

• Local-remote and bw reduction

– Asynchronous reductions

– Improved Collective-operation implementations

04/21/23 22Performance Techniques

Page 23: Techniques for Developing Efficient Petascale Applications

• Problem: – Processors wait for too long at “receive” statements

• Idea: – Instead of waiting for data, do useful work

– Issue: How to create such work?

• Can’t depend on the data to be received

• Routine communication optimizations in MPI– Move sends up and receives down

• Keep data dependencies in mind..

– Moving receive down has a cost: system needs to buffer message

• Use irecvs, but be careful

• irecv allows you to post a buffer for a recv, but not wait for it

04/21/23 23Performance Techniques

Overlapping Communication-Computation

Page 24: Techniques for Developing Efficient Petascale Applications

Major analytical/theoretical techniques• Typically involves simple algebraic formulas, and ratios

– Typical variables are:

• data size (N), number of processors (P), machine constants

– Model performance of individual operations, components, algorithms in terms of the above

• Be careful to characterize variations across processors, and model them with (typically) max operators

– E.g. max{Load I}

– Remember that constants are important in practical parallel computing

• Be wary of asymptotic analysis: use it, but carefully

• Scalability analysis:– Isoefficiency

04/21/23 24Performance Techniques

Page 25: Techniques for Developing Efficient Petascale Applications

25

Analyze Scalability of the Algorithm(say via the iso-efficiency metric)

Performance Techniques04/21/23

Page 26: Techniques for Developing Efficient Petascale Applications

Scalability• The Program should scale up to use a large

number of processors. – But what does that mean?

• An individual simulation isn’t truly scalable

• Better definition of scalability:– If I double the number of processors, I should be able to retain

parallel efficiency by increasing the problem size

04/21/23 26Performance Techniques

Page 27: Techniques for Developing Efficient Petascale Applications

27

Isoefficiency Analysis• An algorithm (*) is scalable if

– If you double the number of processors available to it, it can retain the same parallel efficiency by increasing the size of the problem by some amount

– Not all algorithms are scalable in this sense..

– Isoefficiency is the rate at which the problem size must be increased, as a function of number of processors, to keep the same efficiency

– Use η(p,N) = η(x.p, y.N) to get this equation

Parallel efficiency= T1/(Tp*P)

T1 : Time on one processor

Tp: Time on P processors

Problem

size

processors

Equal efficiency curves

Performance Techniques04/21/23

Page 28: Techniques for Developing Efficient Petascale Applications

28

Gauss-Jacobi Relaxation

while (maxError > Threshold) {

Re-apply Boundary conditions

maxError = 0;

for i = 0 to N-1 {

for j = 0 to N-1 {

B[i,j] = 0.2 * (A[i,j] + A[i,j-1] +

A[i,j+1] + A[i+1, j] + A[i-1,j]) ;

if (|B[i,j]- A[i,j]| > maxError)

maxError = |B[i,j]- A[i,j]|

}

}

swap B and A

}

Sequential Pseudocode: Decomposition by:

Row

Blocks

Or Column

04/21/23 Performance Techniques

Page 29: Techniques for Developing Efficient Petascale Applications

Isoefficiency of Jacobi Relaxation

Row decomposition• Computation per proc:

• Communication:

• Ratio:

• Efficiency:

• Isoefficiency:

Block decomposition• Computation per proc:

• Communication:

• Ratio

• Efficiency

• Isoefficiency

04/21/23 29Performance Techniques

Page 30: Techniques for Developing Efficient Petascale Applications

Isoefficiency of Jacobi Relaxation

Row decomposition• Computation per PE:

– A * N * (N/P)

• Communication– 16 * N

• Comm-to-comp Ratio:– (16 * P) / (A * N) = γ

• Efficiency:– 1 / (1 + γ)

• Isoefficiency: – N4

– problem-size = N2

– = (problem-size)^2

Block decomposition• Computation per PE:

– A * N * (N/P)

• Communication:– 32 * N / P1/2

• Comm-to-comp Ratio– (32 * P1/2) / (A * N)

• Efficiency

• Isoefficiency– N2

– Linear in problem size

04/21/23 30Performance Techniques

Page 31: Techniques for Developing Efficient Petascale Applications

3104/21/23 CharmWorkshop2007 31

NAMD: A Production MD program

NAMD• Fully featured program

• NIH-funded development

• Distributed free of charge (~20,000 registered users)

• Binaries and source code

• Installed at NSF centers

• User training and support

• Large published simulations

Performance Techniques04/21/23

Page 32: Techniques for Developing Efficient Petascale Applications

32

Molecular Dynamics in NAMD• Collection of [charged] atoms, with bonds

– Newtonian mechanics

– Thousands of atoms (10,000 – 5,000,000)

• At each time-step– Calculate forces on each atom

• Bonds:

• Non-bonded: electrostatic and van der Waal’s– Short-distance: every timestep

– Long-distance: using PME (3D FFT)

– Multiple Time Stepping : PME every 4 timesteps

– Calculate velocities and advance positions

• Challenge: femtosecond time-step, millions needed!

Collaboration with K. Schulten, R. Skeel, and coworkersPerformance Techniques04/21/23

Page 33: Techniques for Developing Efficient Petascale Applications

33

Traditional Approaches: non isoefficient

• Replicated Data:– All atom coordinates stored on each processor

• Communication/Computation ratio: P log P

• Partition the Atoms array across processors– Nearby atoms may not be on the same processor

– C/C ratio: O(P)

• Distribute force matrix to processors– Matrix is sparse, non uniform,

– C/C Ratio: sqrt(P)

Performance Techniques04/21/23

Page 34: Techniques for Developing Efficient Petascale Applications

34

Spatial Decomposition Via Charm

•Atoms distributed to cubes based on their location

• Size of each cube :

•Just a bit larger than cut-off radius

•Communicate only with neighbors

•Work: for each pair of nbr objects

•C/C ratio: O(1)

•However:

•Load Imbalance

•Limited Parallelism

Cells, Cubes or“Patches”

Charm++ is useful to handle this

Performance Techniques04/21/23

Page 35: Techniques for Developing Efficient Petascale Applications

35

Object Based Parallelization for MD:

Force Decomposition + Spatial Decomposition

•Now, we have many objects to load balance:

•Each diamond can be assigned to any proc.

• Number of diamonds (3D):

–14·Number of Patches

–2-away variation:

–Half-size cubes

–5x5x5 interactions

–3-away interactions: 7x7x7Performance Techniques04/21/23

Page 36: Techniques for Developing Efficient Petascale Applications

Strong Scaling on JaguarPF

6,720 cores

53,760 cores

107,520 cores

224,076 cores

36Performance Techniques04/21/23

Page 37: Techniques for Developing Efficient Petascale Applications

37

Gauss-Seidel Relaxation

While (maxError > Threshold) {

Re-apply Boundary conditions

maxError = 0;

for i = 0 to N-1 {

for j = 0 to N-1 {

old = A[i, j]

A[i, j] = 0.2 * (A[i,j] + A[i,j-1] +A[i,j+1]

+ A[i+1,j] + A[i-1,j]) ;

if (|A[i,j]-old| > maxError)

maxError = |A[i,j]-old|

}

}

}

Sequential Pseudocode: No old-new arrays..

Sequentially, how well does this work?

It works much better!

How to parallelize this?

Spring 2009 CS420: Parallel Algorithms

Page 38: Techniques for Developing Efficient Petascale Applications

38

How do we parallelize Gauss-Seidel?

• Visualize the flow of values

• Not the control flow:– That goes row-by-row

• Flow of dependences: which values depend on which values

• Does that give us a clue on how to parallelize?

Spring 2009 CS420: Parallel Algorithms

Page 39: Techniques for Developing Efficient Petascale Applications

39

Parallelizing Gauss Seidel

• Some ideas– Row decomposition, with pipelining

– Square over-decomposition

• Assign many squares to a processor (essentially same?)

Spring 2009 CS420: Parallel Algorithms

PE 0PE 1PE 2

Page 40: Techniques for Developing Efficient Petascale Applications

N

N

W

N/P

# Columns = N/W# Rows = P

11

22

22

...

...

...

P

...

...

...

P

P

P# Of Phases

P

P

P

P ... ... ...

... ... ...

... ... ...

... ... ...

NWNW

NWNW

NWNW

NWNW

N + 1W N + 1W

...N + 1W N + 1W

...N + 1W N + 1W

N/W

N +PW

+ P (-1)

Row decomposition, with pipelining

Page 41: Techniques for Developing Efficient Petascale Applications

Time

# ProcsUsed

P

0 P NW

N + P -1W

Page 42: Techniques for Developing Efficient Petascale Applications

Red-Black Squares Method

Spring 2009 CS420: Parallel Algorithms 42

• Red squares calculate values based on the black squares– Then black squares use values from red squares

– Now red ones can be done in parallel and then black ones can be done in parallel Each square locally can do Gauss-

Seidel computation

Page 43: Techniques for Developing Efficient Petascale Applications

Communication : alpha reducing optimizations• When you are sending too many tiny messages:

– Alpha cost is high (a microsecond per msg, for example)

– How to reduce it?

• Simple combining: – Combine messages going to the same destination

– Cost: delay (lesser pipelining)

• More complex scenario:– AllToAll: everyone wants to send a short message to everyone

else

– Direct method: . (P-1) +.(P-1).m

– For small m, the cost dominates

04/21/23 Performance Techniques 43

Page 44: Techniques for Developing Efficient Petascale Applications

All to all via Mesh

Organize processors in a 2D (virtual) grid

Phase 1:

Each processor sends messages within its row

Phase 2:

Each processor sends messages within its column

Message from (x1,y1) to (x2,y2) goes via (x1,y2)

2. messages instead of P-1 1P

For us: 26 messages instead of 192

1P

1P

Page 45: Techniques for Developing Efficient Petascale Applications

0

10

20

30

40

50

60

16 32 64 96 128 192 256 512 1024 1280 1536 2048Processors

Tim

e (

ms

)

MPI

Mesh

Hypercube

3d Grid

All to all on Lemieux for 76 byte Msg.

Page 46: Techniques for Developing Efficient Petascale Applications

All to all on Lemieux 1024 processors

0

100

200

300

400

500

600

700

800

900

76 276 476 876 1276 1676 2076 3076 4076 6076 8076Message Size (Bytes)

Tim

e (

ms

)

Mesh

Mesh Compute

Bigger benefit: CPU is occupied for a much shorter time!

Page 47: Techniques for Developing Efficient Petascale Applications

Namd Performance on Lemieux

Page 48: Techniques for Developing Efficient Petascale Applications

Impact on Application Performance

0

20

40

60

80

100

120

140

Step Time

256 512 1024

Processors

MeshDirectMPI

Namd Performance on Lemieux, with the transpose step implemented using different all-to-all algorithms

Page 49: Techniques for Developing Efficient Petascale Applications

49

Sequential Performance Issues

Performance Techniques04/21/23

Page 50: Techniques for Developing Efficient Petascale Applications

Example program

• Imagine a sequential program running using a large array, A• For each I, A[i] = A[i] + A[some other index]• How long should the program take, if each addition is a ns• What is the performance difference you expect, depending on how

the other index is chosen?

for (i=0, index2=0; i<size; i++) { index2 += SOME_NUMBER; // smaller than size if (index2 > size) index2 -= size; A[i] += A[index2];}

Spring 2009 50CS420: Cache Hierarchies

for (i=0; i<size-1; i++) { A[i] += A[i+1];}

Page 51: Techniques for Developing Efficient Petascale Applications

Caches and Cache Performance

Spring 2009 CS420: Cache Hierarchies 51

• Remember the von Neumann model

CPU

MemoryMemory

Registers

Cache

CPU

Page 52: Techniques for Developing Efficient Petascale Applications

Why and how does a cache help?• Only because of the principle of locality

– Programs tend to access the same and/or nearby data repeatedly

– Temporal and spatial locality

• Size of cache

• Multiple levels of cache

• Performance impact of caches– Designing programs for good sequential performance

Spring 2009 52CS420: Cache Hierarchies

Page 53: Techniques for Developing Efficient Petascale Applications

Reality today: multi-level caches

Spring 2009 CS420: Cache Hierarchies 53

• Remember the von Neumann model

CPU

Memory

Cache

CPU

Memory

Caches

Page 54: Techniques for Developing Efficient Petascale Applications

Example: Intel’s Nehalem

Spring 2009 CS420: Cache Hierarchies 54

• Nehalem architecture, core i7 chip:– 64 KB L1 instruction and 64 KB L1 data cache per core

– 256 KB L2 cache (combined instruction and data) per core

– 8 MB L3 (combined instruction and data) "inclusive", shared by all cores

• Still, even L1 cache is several cycles – (reportedly 4 cycles, inreased from 3 before)

– L2: 10 cycles

Page 55: Techniques for Developing Efficient Petascale Applications

A little bit about microprocessor architecture

Spring 2009 CS420: Cache Hierarchies 55

• Architecture over the last 2-3 decades was driven by the need to make clock cycle go faster and faster– Pipelining developed as an essential technique early on.

– Each instruction execution is pipelinesd:

• Fetch, decode, execute, stages at least

• In addition, floating point operations, which take longer to calculate, have their own separate pipeline

• L1 cache accesses in Nehalem are pipelined – so even though it takes 4 cycles to get the result, you can keep issuing a new load every cycle, and you wouldn’t notice a difference (almost) if they are all found in L1 cache (i.e. are “hit”s)

Page 56: Techniques for Developing Efficient Petascale Applications

Another issue: SIMD vectors• Hardware is capable of executing multiple

floting point instructions per cycle– Need to enable that, by using SIMD vector instructions

– Example: Intel’s SSE and IBM’s AltiVec

• Compilers try to automate it, – but are not always successful

• Learn manual vectorization– Or use libraries that help

04/21/23 Performance Techniques 56

movaps xmm0, [v1] ;xmm0 = v1.w | v1.z | v1.y | v1.x addps xmm0, [v2] ;xmm0 = v1.w+v2.w | v1.z+v2.z | v1.y+v2.y | v1.x+v2.x movaps [vec_res], xmm0

Example from Wikipedia:

Page 57: Techniques for Developing Efficient Petascale Applications

Broad Approach To Performance Tuning• Understand (for a given appliction)

– Fraction of peak performance

• (10% is good for many apps!)

– Parallel efficiency:

• Speedup plots

• Strong scaling: keep problem size fixed

• Weak scaling: increase problem size with processors

– These help you decide where to focus

• Sequential optimizations => fraction of peak– Use right compiler flags (basic: -O3)

• Parallel inefficiency: – Grainsize, Communication costs, idle time, critical paths,

– load imbalances

• One way to recognize it: wait time at barriers!

04/21/23 Performance Techniques 57

Page 58: Techniques for Developing Efficient Petascale Applications

58

Decouple decomposition from Physical Processors

Performance Techniques04/21/23

Page 59: Techniques for Developing Efficient Petascale Applications

59

Migratable Objects (aka Processor Virtualization)

User View

System implementation

Programmer: [Over] decomposition into virtual processors

Runtime: Assigns VPs to processors

Enables adaptive runtime strategies

Implementations: Charm++, AMPI

• Software engineering– Number of virtual processors can be

independently controlled– Separate VPs for different modules

• Message driven execution– Adaptive overlap of communication– Predictability :

• Automatic out-of-core– Asynchronous reductions

• Dynamic mapping– Heterogeneous clusters

• Vacate, adjust to speed, share– Automatic checkpointing– Change set of processors used– Automatic dynamic load balancing– Communication optimization

Benefits

Performance Techniques04/21/23

Page 60: Techniques for Developing Efficient Petascale Applications

60

Migratable Objects (aka Processor Virtualization)

• Software engineering– Number of virtual processors can be

independently controlled– Separate VPs for different modules

• Message driven execution– Adaptive overlap of communication– Predictability :

• Automatic out-of-core– Asynchronous reductions

• Dynamic mapping– Heterogeneous clusters

• Vacate, adjust to speed, share– Automatic checkpointing– Change set of processors used– Automatic dynamic load balancing– Communication optimization

Benefits

Real Processors

MPI processes

Virtual Processors (user-level migratable threads)

Programmer: [Over] decomposition into virtual processors

Runtime: Assigns VPs to processors

Enables adaptive runtime strategies

Implementations: Charm++, AMPI

Performance Techniques04/21/23

Page 61: Techniques for Developing Efficient Petascale Applications

61

Parallel Decomposition and Processors

• MPI-style encourages– Decomposition into P pieces, where P is the number of physical

processors available

– If your natural decomposition is a cube, then the number of processors must be a cube

– …

• Charm++/AMPI style “virtual processors”– Decompose into natural objects of the application

– Let the runtime map them to processors

– Decouple decomposition from load balancing

Performance Techniques04/21/23

Page 62: Techniques for Developing Efficient Petascale Applications

6204/21/23 LSU PetaScale 62

Decomposition independent of numCores

• Rocket simulation example under traditional MPI vs. Charm++/AMPI framework

– Benefit: load balance, communication optimizations, modularity

Solid

Fluid

Solid

Fluid

Solid

Fluid. . .

1 2 P

Solid1

Fluid1

Solid2

Fluid2

Solidn

Fluidm. . .

Solid3. . .

Performance Techniques04/21/23

Page 63: Techniques for Developing Efficient Petascale Applications

63

OpenAtomCar-Parinello ab initio MD

Collabrative IT project with: R. Car, M. Klein, M. Tuckerman, Glenn Martyna, N. Nystrom, ..

Specific software project (leanCP): Glenn Martyna, Mark Tuckerman, L.V. Kale and co-workers (E. Bohm, Yan Shi, Ramkumar Vadali)

Funding : NSF-CHE, NSF-CS, NSF-ITR, IBMFunding : NSF-CHE, NSF-CS, NSF-ITR, IBMPerformance Techniques04/21/23

Page 64: Techniques for Developing Efficient Petascale Applications

New OpenAtom Collaboration

• Principle Investigators – M. Tuckerman (NYU)– L.V. Kale (UIUC)– G. Martyna (IBM TJ Watson)– K. Schulten (UIUC)– J. Dongarra (UTK/ORNL)

• Current effort is focused on – QMMM via integration with NAMD2– ORNL Cray XT4 Jaguar (31,328 cores) – ALCF IBM Blue Gene/P (163,840 cores)

Performance Techniques 6404/21/23

Page 65: Techniques for Developing Efficient Petascale Applications

Ab initio Molecular Dynamics, electronic structure simulation enables the study of many important systems

Molecular Clusters : Nanowires:

Semiconductor Surfaces: 3D-Solids/Liquids:

Performance Techniques

6504/21/23

Page 66: Techniques for Developing Efficient Petascale Applications

Quantum Chemistry

• Car-Parinello Molecular Dynamics– High precision: AIMD molecular dynamics uses

quantum mechanical descriptions of electronic structure to determine forces between atoms. Thereby permitting accurate atomistic descriptions of chemical reactions.

– PPL's OpenAtom project features a unique parallel decomposition of the Car-Parinello method. Using Charm++ virtualization we can efficiently scale small (32 molecule) systems to thousands of processors.

Performance Techniques 6604/21/23

Page 67: Techniques for Developing Efficient Petascale Applications

Computation Flow

67Performance Techniques04/21/23

Page 68: Techniques for Developing Efficient Petascale Applications

Torus Aware Object Mapping

68Performance Techniques04/21/23

Page 69: Techniques for Developing Efficient Petascale Applications

OpenAtom Performance

Performance Techniques 6904/21/23

Page 70: Techniques for Developing Efficient Petascale Applications

Benefits of Topology Mapping

Watson Blue Gene/L(CO mode)

PSC BigBen (XT3)(SN and VN mode)

Performance Techniques 7004/21/23

Page 71: Techniques for Developing Efficient Petascale Applications

7104/21/23 LSU PetaScale 71

Use Dynamic Load BalancingBased on the

Principle of Persistence

Principle of persistence

Computational loads and communication patterns tend to persist, even in dynamic computations

So, recent past is a good predictor or near future

Performance Techniques04/21/23

Page 72: Techniques for Developing Efficient Petascale Applications

72

Load Balancing Steps

Regular Timesteps

Instrumented Timesteps

Detailed, aggressive Load Balancing

Refinement Load Balancing

Performance Techniques04/21/23

Page 73: Techniques for Developing Efficient Petascale Applications

73

Processor Utilization against Time on 128 and 1024 processors

On 128 processor, a single load balancing step suffices, but

On 1024 processors, we need a “refinement” step.

Load Balancing

Aggressive Load Balancing

Refinement Load

Balancing

Performance Techniques04/21/23

Page 74: Techniques for Developing Efficient Petascale Applications

74

ChaNGa: Parallel Gravity

• Collaborative project (NSF ITR)– with Prof. Tom Quinn, Univ. of Washington

• Components: gravity, gas dynamics• Barnes-Hut tree codes

– Oct tree is natural decomposition:• Geometry has better aspect ratios, and so you “open” fewer

nodes up.• But is not used because it leads to bad load balance• Assumption: one-to-one map between sub-trees and

processors• Binary trees are considered better load balanced

– With Charm++: Use Oct-Tree, and let Charm++ map subtrees to processors

Performance Techniques04/21/23

Page 75: Techniques for Developing Efficient Petascale Applications

75

Load balancing with GreedyLB

dwarf 5M on 1,024 BlueGene/L processors

5.6s 6.1s

Messages x1000 Bytes transferred (MB)

0

5000

10000

15000

20000

25000

30000

Main title

Before LB

After LB

Performance Techniques04/21/23

Page 76: Techniques for Developing Efficient Petascale Applications

76

Load balancing with OrbRefineLB

dwarf 5M on 1,024 BlueGene/L processors

5.6s 5.0s

Performance Techniques04/21/23

Page 77: Techniques for Developing Efficient Petascale Applications

ChaNGa: Parallel Gravity Code

Developed in Collaboration with Tom Quinn (Univ. Washington) using Charm++

ChaNGa Preliminary Performance

77Performance Techniques04/21/23

Page 78: Techniques for Developing Efficient Petascale Applications

78

Summary

• Exciting times ahead• Petascale computers

– unprecedented opportunities for progress in science and engg.– Petascale Applications will require a large toolbox, with

• Algorithms, Adaptive Runtime System, Performance tools, …• Object-based decomposition• Dynamic Load balancing• Scalable Performance analysis

• Early performance development via BigSim

My Research:http://charm.cs.uiuc.edu

Blue Waters: http://www.ncsa.uiuc.edu/BlueWaters/

Performance Techniques04/21/23