Techniques for Developing Efficient Petascale Applications

Techniques for Developing Efficient Petascale Applications

Laxmikant (Sanjay) Kale

http://charm.cs.uiuc.eduParallel Programming Laboratory

Department of Computer Science

University of Illinois at Urbana Champaign

http://charm.cs.uiuc.edu/

2

Outline• Basic Techniques for attaining good performance

• Scalability analysis of Algorithms

• Measurements and Tools

• Communication optimizations:– Communication basic– Overlapping communication and computation– Alpha-beta optimizations

• Combining and pipelining– (topology-awareness)

• Sequential optimizations

• (Load balancing)04/21/23 2Performance Techniques04/21/23

3

Parallel Objects,

Adaptive Runtime System

Libraries and Tools

Examples based on multiple applications:

Molecular Dynamics

Crack Propagation

Space-time meshes

Computational Cosmology

Rocket Simulation

Protein Folding

Dendritic Growth

Quantum Chemistry (QM/MM)

Performance Techniques04/21/23

4

Analyze Performance with both: Simple as well as

Sophisticated Tools


Simple techniques• Timers: wall timer (time.h)

• Counters: Use papi library raw counters, ..– Esp. useful:

• number of floating point operations,

• cache misses (L2, L1, ..)

• Memory accesses

• Output method:– “printf” (or cout) can be expensive

– Store timer values into an array or buffer, and print at the end

Performance Techniques 504/21/23

Sophisticated Tools• Many tools exist• Have some learning curve, but can be beneficial• Example tools:

– Jumpshot

– TAU

– Scalasca

– Projections

– Vampir ($$)

• PMPI interface: – Allows you to intercept MPI calls

• So you can write your own tools

– PMPI interface for projections:

• git://charm.cs.uiuc.edu/PMPI_Projections.git


7

Example: Projections Performance Analysis Tool

• Automatic instrumentation via runtime

• Graphical visualizations• More insightful feedback

– because runtime understands application events better


8

Exploit sophisticated Performance analysis tools

• We use a tool called “Projections”

• Many other tools exist

• Need for scalable analysis

• A not-so-recent example:– Trying to identify the next performance obstacle for NAMD

• Running on 8192 processors, with 92,000 atom simulation

• Test scenario: without PME

• Time is 3 ms per step, but lower bound is 1.6ms or so..


9Performance Techniques04/21/23



12

Performance Tuning withPatience and Perseverance


13

Performance Tuning with Perseverance

• Recognize multi-dimensional nature of the performance space

• Don’t stop optimizing until you know for sure why it cannot be speeded up further– Measure, measure, measure ...


14

Shallow valleys, high peaks, nicely overlapped PME

green: communication

Red: integration Blue/Purple: electrostatics

turquoise: angle/dihedral

Orange: PME

94% efficiency

Apo-A1, on BlueGene/L, 1024 procs

Charm++’s “Projections” Analysis too

Time intervals on x axis, activity added across processors on Y axisl


15

Cray XT3, 512 processors: Initial runs

Clearly, needed further tuning, especially PME.

But, had more potential (much faster processors)

76% efficiency


16

On Cray XT3, 512 processors: after optimizations

96% efficiency


17

Communication Issues


Recap: Communication Basics: Point-to-point

Sending processorSending co-processor

Network

Receiving co-processor

Receiving processor

Each component has a per-message cost, and per byte cost

04/21/23 18Performance Techniques

Each cost, for a n-byte message = ά + n β

Important metrics: Overhead at processor, co-processor Network latency Network bandwidth consumed

Number of hops traversed

Communication Basics

04/21/23 Performance Techniques 19

• Message Latency: time between the application sending the message and receiving it on the other processor

• Send overhead: time for which the sending processor was “occupied” with the message

• Receive overhead: the time for which the receiving processor was “occupied” with the message

• Network latency

Communication: Diagnostic Techniques

• A simple technique: (find “grainsize”)– Count the number of messages per second of computation per processor!

(max, average)– Count number of bytes– Calculate: computation per message (and per byte)

• Use profiling tools:– Identify time spent in different communication operations– Classified by modules

• Examine idle time using time-line displays– On important processors– Determine the causes

• Be careful with “synchronization overhead”– May be load balancing masquerading as sync overhead. – Common mistake.


Communication: Problems and Issues• Too small a Grainsize

– Total Computation time / total number of messages

– Separated by phases, modules, etc.

• Too many, but short messages– vs. tradeoff

• Processors wait too long

• Later: – Locality of communication

• Local vs. non-local

• How far is non-local? (Does that matter?)

– Synchronization

– Global (Collective) operations• All-to-all operations, gather, scatter

• We will focus on communication cost (grainsize)


Communication: Solution Techniques• Overview:

– Overlap with Computation

• Manual

• Automatic and adaptive, using virtualization

– Increasing grainsize

– -reducing optimizations

• Message combining

• communication patterns

– Controlled Pipelining

– Locality enhancement: decomposition control

• Local-remote and bw reduction

– Asynchronous reductions

– Improved Collective-operation implementations


• Problem: – Processors wait for too long at “receive” statements

• Idea: – Instead of waiting for data, do useful work

– Issue: How to create such work?

• Can’t depend on the data to be received

• Routine communication optimizations in MPI– Move sends up and receives down

• Keep data dependencies in mind..

– Moving receive down has a cost: system needs to buffer message

• Use irecvs, but be careful

• irecv allows you to post a buffer for a recv, but not wait for it


Overlapping Communication-Computation

Major analytical/theoretical techniques• Typically involves simple algebraic formulas, and ratios

– Typical variables are:

• data size (N), number of processors (P), machine constants

– Model performance of individual operations, components, algorithms in terms of the above

• Be careful to characterize variations across processors, and model them with (typically) max operators

– E.g. max{Load I}

– Remember that constants are important in practical parallel computing

• Be wary of asymptotic analysis: use it, but carefully

• Scalability analysis:– Isoefficiency


25

Analyze Scalability of the Algorithm(say via the iso-efficiency metric)


Scalability• The Program should scale up to use a large

number of processors. – But what does that mean?

• An individual simulation isn’t truly scalable

• Better definition of scalability:– If I double the number of processors, I should be able to retain

parallel efficiency by increasing the problem size


27

Isoefficiency Analysis• An algorithm (*) is scalable if

– If you double the number of processors available to it, it can retain the same parallel efficiency by increasing the size of the problem by some amount

– Not all algorithms are scalable in this sense..

– Isoefficiency is the rate at which the problem size must be increased, as a function of number of processors, to keep the same efficiency

– Use η(p,N) = η(x.p, y.N) to get this equation

Parallel efficiency= T1/(Tp*P)

T1 : Time on one processor

Tp: Time on P processors

Problem

size

processors

Equal efficiency curves


28

Gauss-Jacobi Relaxation

while (maxError > Threshold) {

Re-apply Boundary conditions

maxError = 0;

for i = 0 to N-1 {

for j = 0 to N-1 {

B[i,j] = 0.2 * (A[i,j] + A[i,j-1] +

A[i,j+1] + A[i+1, j] + A[i-1,j]) ;

if (|B[i,j]- A[i,j]| > maxError)

maxError = |B[i,j]- A[i,j]|

}

}

swap B and A

}

Sequential Pseudocode: Decomposition by:

Row

Blocks

Or Column

04/21/23 Performance Techniques

Isoefficiency of Jacobi Relaxation

Row decomposition• Computation per proc:

• Communication:

• Ratio:

• Efficiency:

• Isoefficiency:

Block decomposition• Computation per proc:

• Communication:

• Ratio

• Efficiency

• Isoefficiency


Isoefficiency of Jacobi Relaxation

Row decomposition• Computation per PE:

– A * N * (N/P)

• Communication– 16 * N

• Comm-to-comp Ratio:– (16 * P) / (A * N) = γ

• Efficiency:– 1 / (1 + γ)

• Isoefficiency: – N4

– problem-size = N2

– = (problem-size)^2

Block decomposition• Computation per PE:

– A * N * (N/P)

• Communication:– 32 * N / P1/2

• Comm-to-comp Ratio– (32 * P1/2) / (A * N)

• Efficiency

• Isoefficiency– N2

– Linear in problem size


3104/21/23 CharmWorkshop2007 31

NAMD: A Production MD program

NAMD• Fully featured program

• NIH-funded development

• Distributed free of charge (~20,000 registered users)

• Binaries and source code

• Installed at NSF centers

• User training and support

• Large published simulations


32

Molecular Dynamics in NAMD• Collection of [charged] atoms, with bonds

– Newtonian mechanics

– Thousands of atoms (10,000 – 5,000,000)

• At each time-step– Calculate forces on each atom

• Bonds:

• Non-bonded: electrostatic and van der Waal’s– Short-distance: every timestep

– Long-distance: using PME (3D FFT)

– Multiple Time Stepping : PME every 4 timesteps

– Calculate velocities and advance positions

• Challenge: femtosecond time-step, millions needed!

Collaboration with K. Schulten, R. Skeel, and coworkersPerformance Techniques04/21/23

33

Traditional Approaches: non isoefficient

• Replicated Data:– All atom coordinates stored on each processor

• Communication/Computation ratio: P log P

• Partition the Atoms array across processors– Nearby atoms may not be on the same processor

– C/C ratio: O(P)

• Distribute force matrix to processors– Matrix is sparse, non uniform,

– C/C Ratio: sqrt(P)


34

Spatial Decomposition Via Charm

•Atoms distributed to cubes based on their location

• Size of each cube :

•Just a bit larger than cut-off radius

•Communicate only with neighbors

•Work: for each pair of nbr objects

•C/C ratio: O(1)

•However:

•Load Imbalance

•Limited Parallelism

Cells, Cubes or“Patches”

Charm++ is useful to handle this


35

Object Based Parallelization for MD:

Force Decomposition + Spatial Decomposition

•Now, we have many objects to load balance:

•Each diamond can be assigned to any proc.

• Number of diamonds (3D):

–14·Number of Patches

–2-away variation:

–Half-size cubes

–5x5x5 interactions

–3-away interactions: 7x7x7Performance Techniques04/21/23

Strong Scaling on JaguarPF

6,720 cores

53,760 cores

107,520 cores

224,076 cores


37

Gauss-Seidel Relaxation

While (maxError > Threshold) {

Re-apply Boundary conditions

maxError = 0;

for i = 0 to N-1 {

for j = 0 to N-1 {

old = A[i, j]

A[i, j] = 0.2 * (A[i,j] + A[i,j-1] +A[i,j+1]

+ A[i+1,j] + A[i-1,j]) ;

if (|A[i,j]-old| > maxError)

maxError = |A[i,j]-old|

}

}

}

Sequential Pseudocode: No old-new arrays..

Sequentially, how well does this work?

It works much better!

How to parallelize this?

Spring 2009 CS420: Parallel Algorithms

38

How do we parallelize Gauss-Seidel?

• Visualize the flow of values

• Not the control flow:– That goes row-by-row

• Flow of dependences: which values depend on which values

• Does that give us a clue on how to parallelize?


39

Parallelizing Gauss Seidel

• Some ideas– Row decomposition, with pipelining

– Square over-decomposition

• Assign many squares to a processor (essentially same?)


PE 0PE 1PE 2

N

N

W

N/P

# Columns = N/W# Rows = P

11

22

22

...

...

...

P

...

...

...

P

P

P# Of Phases

P

P

P

P ... ... ...

... ... ...

... ... ...

... ... ...

NWNW

NWNW

NWNW

NWNW

N + 1W N + 1W

...N + 1W N + 1W

...N + 1W N + 1W

N/W

N +PW

+ P (-1)

Row decomposition, with pipelining

Time

# ProcsUsed

P

0 P NW

N + P -1W

Red-Black Squares Method

Spring 2009 CS420: Parallel Algorithms 42

• Red squares calculate values based on the black squares– Then black squares use values from red squares

– Now red ones can be done in parallel and then black ones can be done in parallel Each square locally can do Gauss-

Seidel computation

Communication : alpha reducing optimizations• When you are sending too many tiny messages:

– Alpha cost is high (a microsecond per msg, for example)

– How to reduce it?

• Simple combining: – Combine messages going to the same destination

– Cost: delay (lesser pipelining)

• More complex scenario:– AllToAll: everyone wants to send a short message to everyone

else

– Direct method: . (P-1) +.(P-1).m

– For small m, the cost dominates


All to all via Mesh

Organize processors in a 2D (virtual) grid

Phase 1:

Each processor sends messages within its row

Phase 2:

Each processor sends messages within its column

Message from (x1,y1) to (x2,y2) goes via (x1,y2)

2. messages instead of P-1 1P

For us: 26 messages instead of 192

1P

1P

0

10

20

30

40

50

60

16 32 64 96 128 192 256 512 1024 1280 1536 2048Processors

Tim

e (

ms

)

MPI

Mesh

Hypercube

3d Grid

All to all on Lemieux for 76 byte Msg.

All to all on Lemieux 1024 processors

0

100

200

300

400

500

600

700

800

900

76 276 476 876 1276 1676 2076 3076 4076 6076 8076Message Size (Bytes)

Tim

e (

ms

)

Mesh

Mesh Compute

Bigger benefit: CPU is occupied for a much shorter time!

Namd Performance on Lemieux

Impact on Application Performance

0

20

40

60

80

100

120

140

Step Time

256 512 1024

Processors

MeshDirectMPI

Namd Performance on Lemieux, with the transpose step implemented using different all-to-all algorithms

49

Sequential Performance Issues


Example program

• Imagine a sequential program running using a large array, A• For each I, A[i] = A[i] + A[some other index]• How long should the program take, if each addition is a ns• What is the performance difference you expect, depending on how

the other index is chosen?

for (i=0, index2=0; i<size; i++) { index2 += SOME_NUMBER; // smaller than size if (index2 > size) index2 -= size; A[i] += A[index2];}

Spring 2009 50CS420: Cache Hierarchies

for (i=0; i<size-1; i++) { A[i] += A[i+1];}

Caches and Cache Performance

Spring 2009 CS420: Cache Hierarchies 51

• Remember the von Neumann model

CPU

MemoryMemory

Registers

Cache

CPU

Why and how does a cache help?• Only because of the principle of locality

– Programs tend to access the same and/or nearby data repeatedly

– Temporal and spatial locality

• Size of cache

• Multiple levels of cache

• Performance impact of caches– Designing programs for good sequential performance

Spring 2009 52CS420: Cache Hierarchies

Reality today: multi-level caches


• Remember the von Neumann model

CPU

Memory

Cache

CPU

Memory

Caches

Example: Intel’s Nehalem


• Nehalem architecture, core i7 chip:– 64 KB L1 instruction and 64 KB L1 data cache per core

– 256 KB L2 cache (combined instruction and data) per core

– 8 MB L3 (combined instruction and data) "inclusive", shared by all cores

• Still, even L1 cache is several cycles – (reportedly 4 cycles, inreased from 3 before)

– L2: 10 cycles

http://en.wikipedia.org/wiki/Kilobyte

A little bit about microprocessor architecture


• Architecture over the last 2-3 decades was driven by the need to make clock cycle go faster and faster– Pipelining developed as an essential technique early on.

– Each instruction execution is pipelinesd:

• Fetch, decode, execute, stages at least

• In addition, floating point operations, which take longer to calculate, have their own separate pipeline

• L1 cache accesses in Nehalem are pipelined – so even though it takes 4 cycles to get the result, you can keep issuing a new load every cycle, and you wouldn’t notice a difference (almost) if they are all found in L1 cache (i.e. are “hit”s)

Another issue: SIMD vectors• Hardware is capable of executing multiple

floting point instructions per cycle– Need to enable that, by using SIMD vector instructions

– Example: Intel’s SSE and IBM’s AltiVec

• Compilers try to automate it, – but are not always successful

• Learn manual vectorization– Or use libraries that help


movaps xmm0, [v1] ;xmm0 = v1.w | v1.z | v1.y | v1.x addps xmm0, [v2] ;xmm0 = v1.w+v2.w | v1.z+v2.z | v1.y+v2.y | v1.x+v2.x movaps [vec_res], xmm0

Example from Wikipedia:

Broad Approach To Performance Tuning• Understand (for a given appliction)

– Fraction of peak performance

• (10% is good for many apps!)

– Parallel efficiency:

• Speedup plots

• Strong scaling: keep problem size fixed

• Weak scaling: increase problem size with processors

– These help you decide where to focus

• Sequential optimizations => fraction of peak– Use right compiler flags (basic: -O3)

• Parallel inefficiency: – Grainsize, Communication costs, idle time, critical paths,

– load imbalances

• One way to recognize it: wait time at barriers!


58

Decouple decomposition from Physical Processors


59

Migratable Objects (aka Processor Virtualization)

User View

System implementation

Programmer: [Over] decomposition into virtual processors

Runtime: Assigns VPs to processors

Enables adaptive runtime strategies

Implementations: Charm++, AMPI

• Software engineering– Number of virtual processors can be

independently controlled– Separate VPs for different modules

• Message driven execution– Adaptive overlap of communication– Predictability :

• Automatic out-of-core– Asynchronous reductions

• Dynamic mapping– Heterogeneous clusters

• Vacate, adjust to speed, share– Automatic checkpointing– Change set of processors used– Automatic dynamic load balancing– Communication optimization

Benefits


60

Migratable Objects (aka Processor Virtualization)

• Software engineering– Number of virtual processors can be

independently controlled– Separate VPs for different modules

• Message driven execution– Adaptive overlap of communication– Predictability :

• Automatic out-of-core– Asynchronous reductions

• Dynamic mapping– Heterogeneous clusters

• Vacate, adjust to speed, share– Automatic checkpointing– Change set of processors used– Automatic dynamic load balancing– Communication optimization

Benefits

Real Processors

MPI processes

Virtual Processors (user-level migratable threads)

Programmer: [Over] decomposition into virtual processors

Runtime: Assigns VPs to processors

Enables adaptive runtime strategies

Implementations: Charm++, AMPI


61

Parallel Decomposition and Processors

• MPI-style encourages– Decomposition into P pieces, where P is the number of physical

processors available

– If your natural decomposition is a cube, then the number of processors must be a cube

– …

• Charm++/AMPI style “virtual processors”– Decompose into natural objects of the application

– Let the runtime map them to processors

– Decouple decomposition from load balancing


6204/21/23 LSU PetaScale 62

Decomposition independent of numCores

• Rocket simulation example under traditional MPI vs. Charm++/AMPI framework

– Benefit: load balance, communication optimizations, modularity

Solid

Fluid

Solid

Fluid

Solid

Fluid. . .

1 2 P

Solid1

Fluid1

Solid2

Fluid2

Solidn

Fluidm. . .

Solid3. . .


63

OpenAtomCar-Parinello ab initio MD

Collabrative IT project with: R. Car, M. Klein, M. Tuckerman, Glenn Martyna, N. Nystrom, ..

Specific software project (leanCP): Glenn Martyna, Mark Tuckerman, L.V. Kale and co-workers (E. Bohm, Yan Shi, Ramkumar Vadali)

Funding : NSF-CHE, NSF-CS, NSF-ITR, IBMFunding : NSF-CHE, NSF-CS, NSF-ITR, IBMPerformance Techniques04/21/23

New OpenAtom Collaboration

• Principle Investigators – M. Tuckerman (NYU)– L.V. Kale (UIUC)– G. Martyna (IBM TJ Watson)– K. Schulten (UIUC)– J. Dongarra (UTK/ORNL)

• Current effort is focused on – QMMM via integration with NAMD2– ORNL Cray XT4 Jaguar (31,328 cores) – ALCF IBM Blue Gene/P (163,840 cores)


Ab initio Molecular Dynamics, electronic structure simulation enables the study of many important systems

Molecular Clusters : Nanowires:

Semiconductor Surfaces: 3D-Solids/Liquids:

Performance Techniques

6504/21/23

Quantum Chemistry

• Car-Parinello Molecular Dynamics– High precision: AIMD molecular dynamics uses

quantum mechanical descriptions of electronic structure to determine forces between atoms. Thereby permitting accurate atomistic descriptions of chemical reactions.

– PPL's OpenAtom project features a unique parallel decomposition of the Car-Parinello method. Using Charm++ virtualization we can efficiently scale small (32 molecule) systems to thousands of processors.


Computation Flow


Torus Aware Object Mapping


OpenAtom Performance


Benefits of Topology Mapping

Watson Blue Gene/L(CO mode)

PSC BigBen (XT3)(SN and VN mode)


7104/21/23 LSU PetaScale 71

Use Dynamic Load BalancingBased on the

Principle of Persistence

Principle of persistence

Computational loads and communication patterns tend to persist, even in dynamic computations

So, recent past is a good predictor or near future


72

Load Balancing Steps

Regular Timesteps

Instrumented Timesteps

Detailed, aggressive Load Balancing

Refinement Load Balancing


73

Processor Utilization against Time on 128 and 1024 processors

On 128 processor, a single load balancing step suffices, but

On 1024 processors, we need a “refinement” step.

Load Balancing

Aggressive Load Balancing

Refinement Load

Balancing


74

ChaNGa: Parallel Gravity

• Collaborative project (NSF ITR)– with Prof. Tom Quinn, Univ. of Washington

• Components: gravity, gas dynamics• Barnes-Hut tree codes

– Oct tree is natural decomposition:• Geometry has better aspect ratios, and so you “open” fewer

nodes up.• But is not used because it leads to bad load balance• Assumption: one-to-one map between sub-trees and

processors• Binary trees are considered better load balanced

– With Charm++: Use Oct-Tree, and let Charm++ map subtrees to processors


75

Load balancing with GreedyLB

dwarf 5M on 1,024 BlueGene/L processors

5.6s 6.1s

Messages x1000 Bytes transferred (MB)

0

5000

10000

15000

20000

25000

30000

Main title

Before LB

After LB


76

Load balancing with OrbRefineLB

dwarf 5M on 1,024 BlueGene/L processors

5.6s 5.0s


ChaNGa: Parallel Gravity Code

Developed in Collaboration with Tom Quinn (Univ. Washington) using Charm++

ChaNGa Preliminary Performance


78

Summary

• Exciting times ahead• Petascale computers

– unprecedented opportunities for progress in science and engg.– Petascale Applications will require a large toolbox, with

• Algorithms, Adaptive Runtime System, Performance tools, …• Object-based decomposition• Dynamic Load balancing• Scalable Performance analysis

• Early performance development via BigSim

My Research:http://charm.cs.uiuc.edu

Blue Waters: http://www.ncsa.uiuc.edu/BlueWaters/


Documents

Techniques for Developing Efficient Petascale Applications