Upload
inge
View
34
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Techniques for Developing Efficient Petascale Applications. Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign. Outline. Basic Techniques for attaining good performance - PowerPoint PPT Presentation
Citation preview
Techniques for Developing Efficient Petascale Applications
Laxmikant (Sanjay) Kale
http://charm.cs.uiuc.eduParallel Programming Laboratory
Department of Computer Science
University of Illinois at Urbana Champaign
2
Outline• Basic Techniques for attaining good performance
• Scalability analysis of Algorithms
• Measurements and Tools
• Communication optimizations:– Communication basic– Overlapping communication and computation– Alpha-beta optimizations
• Combining and pipelining– (topology-awareness)
• Sequential optimizations
• (Load balancing)04/21/23 2Performance Techniques04/21/23
3
Parallel Objects,
Adaptive Runtime System
Libraries and Tools
Examples based on multiple applications:
Molecular Dynamics
Crack Propagation
Space-time meshes
Computational Cosmology
Rocket Simulation
Protein Folding
Dendritic Growth
Quantum Chemistry (QM/MM)
Performance Techniques04/21/23
4
Analyze Performance with both: Simple as well as
Sophisticated Tools
Performance Techniques04/21/23
Simple techniques• Timers: wall timer (time.h)
• Counters: Use papi library raw counters, ..– Esp. useful:
• number of floating point operations,
• cache misses (L2, L1, ..)
• Memory accesses
• Output method:– “printf” (or cout) can be expensive
– Store timer values into an array or buffer, and print at the end
Performance Techniques 504/21/23
Sophisticated Tools• Many tools exist• Have some learning curve, but can be beneficial• Example tools:
– Jumpshot
– TAU
– Scalasca
– Projections
– Vampir ($$)
• PMPI interface: – Allows you to intercept MPI calls
• So you can write your own tools
– PMPI interface for projections:
• git://charm.cs.uiuc.edu/PMPI_Projections.git
Performance Techniques 604/21/23
7
Example: Projections Performance Analysis Tool
• Automatic instrumentation via runtime
• Graphical visualizations• More insightful feedback
– because runtime understands application events better
Performance Techniques04/21/23
8
Exploit sophisticated Performance analysis tools
• We use a tool called “Projections”
• Many other tools exist
• Need for scalable analysis
• A not-so-recent example:– Trying to identify the next performance obstacle for NAMD
• Running on 8192 processors, with 92,000 atom simulation
• Test scenario: without PME
• Time is 3 ms per step, but lower bound is 1.6ms or so..
Performance Techniques04/21/23
9Performance Techniques04/21/23
10Performance Techniques04/21/23
11Performance Techniques04/21/23
12
Performance Tuning withPatience and Perseverance
Performance Techniques04/21/23
13
Performance Tuning with Perseverance
• Recognize multi-dimensional nature of the performance space
• Don’t stop optimizing until you know for sure why it cannot be speeded up further– Measure, measure, measure ...
Performance Techniques04/21/23
14
Shallow valleys, high peaks, nicely overlapped PME
green: communication
Red: integration Blue/Purple: electrostatics
turquoise: angle/dihedral
Orange: PME
94% efficiency
Apo-A1, on BlueGene/L, 1024 procs
Charm++’s “Projections” Analysis too
Time intervals on x axis, activity added across processors on Y axisl
Performance Techniques04/21/23
15
Cray XT3, 512 processors: Initial runs
Clearly, needed further tuning, especially PME.
But, had more potential (much faster processors)
76% efficiency
Performance Techniques04/21/23
16
On Cray XT3, 512 processors: after optimizations
96% efficiency
Performance Techniques04/21/23
17
Communication Issues
Performance Techniques04/21/23
Recap: Communication Basics: Point-to-point
Sending processorSending co-processor
Network
Receiving co-processor
Receiving processor
Each component has a per-message cost, and per byte cost
04/21/23 18Performance Techniques
Each cost, for a n-byte message = ά + n β
Important metrics: Overhead at processor, co-processor Network latency Network bandwidth consumed
Number of hops traversed
Communication Basics
04/21/23 Performance Techniques 19
• Message Latency: time between the application sending the message and receiving it on the other processor
• Send overhead: time for which the sending processor was “occupied” with the message
• Receive overhead: the time for which the receiving processor was “occupied” with the message
• Network latency
Communication: Diagnostic Techniques
• A simple technique: (find “grainsize”)– Count the number of messages per second of computation per processor!
(max, average)– Count number of bytes– Calculate: computation per message (and per byte)
• Use profiling tools:– Identify time spent in different communication operations– Classified by modules
• Examine idle time using time-line displays– On important processors– Determine the causes
• Be careful with “synchronization overhead”– May be load balancing masquerading as sync overhead. – Common mistake.
04/21/23 20Performance Techniques
Communication: Problems and Issues• Too small a Grainsize
– Total Computation time / total number of messages
– Separated by phases, modules, etc.
• Too many, but short messages– vs. tradeoff
• Processors wait too long
• Later: – Locality of communication
• Local vs. non-local
• How far is non-local? (Does that matter?)
– Synchronization
– Global (Collective) operations• All-to-all operations, gather, scatter
• We will focus on communication cost (grainsize)
04/21/23 21Performance Techniques
Communication: Solution Techniques• Overview:
– Overlap with Computation
• Manual
• Automatic and adaptive, using virtualization
– Increasing grainsize
– -reducing optimizations
• Message combining
• communication patterns
– Controlled Pipelining
– Locality enhancement: decomposition control
• Local-remote and bw reduction
– Asynchronous reductions
– Improved Collective-operation implementations
04/21/23 22Performance Techniques
• Problem: – Processors wait for too long at “receive” statements
• Idea: – Instead of waiting for data, do useful work
– Issue: How to create such work?
• Can’t depend on the data to be received
• Routine communication optimizations in MPI– Move sends up and receives down
• Keep data dependencies in mind..
– Moving receive down has a cost: system needs to buffer message
• Use irecvs, but be careful
• irecv allows you to post a buffer for a recv, but not wait for it
04/21/23 23Performance Techniques
Overlapping Communication-Computation
Major analytical/theoretical techniques• Typically involves simple algebraic formulas, and ratios
– Typical variables are:
• data size (N), number of processors (P), machine constants
– Model performance of individual operations, components, algorithms in terms of the above
• Be careful to characterize variations across processors, and model them with (typically) max operators
– E.g. max{Load I}
– Remember that constants are important in practical parallel computing
• Be wary of asymptotic analysis: use it, but carefully
• Scalability analysis:– Isoefficiency
04/21/23 24Performance Techniques
25
Analyze Scalability of the Algorithm(say via the iso-efficiency metric)
Performance Techniques04/21/23
Scalability• The Program should scale up to use a large
number of processors. – But what does that mean?
• An individual simulation isn’t truly scalable
• Better definition of scalability:– If I double the number of processors, I should be able to retain
parallel efficiency by increasing the problem size
04/21/23 26Performance Techniques
27
Isoefficiency Analysis• An algorithm (*) is scalable if
– If you double the number of processors available to it, it can retain the same parallel efficiency by increasing the size of the problem by some amount
– Not all algorithms are scalable in this sense..
– Isoefficiency is the rate at which the problem size must be increased, as a function of number of processors, to keep the same efficiency
– Use η(p,N) = η(x.p, y.N) to get this equation
Parallel efficiency= T1/(Tp*P)
T1 : Time on one processor
Tp: Time on P processors
Problem
size
processors
Equal efficiency curves
Performance Techniques04/21/23
28
Gauss-Jacobi Relaxation
while (maxError > Threshold) {
Re-apply Boundary conditions
maxError = 0;
for i = 0 to N-1 {
for j = 0 to N-1 {
B[i,j] = 0.2 * (A[i,j] + A[i,j-1] +
A[i,j+1] + A[i+1, j] + A[i-1,j]) ;
if (|B[i,j]- A[i,j]| > maxError)
maxError = |B[i,j]- A[i,j]|
}
}
swap B and A
}
Sequential Pseudocode: Decomposition by:
Row
Blocks
Or Column
04/21/23 Performance Techniques
Isoefficiency of Jacobi Relaxation
Row decomposition• Computation per proc:
• Communication:
• Ratio:
• Efficiency:
• Isoefficiency:
Block decomposition• Computation per proc:
• Communication:
• Ratio
• Efficiency
• Isoefficiency
04/21/23 29Performance Techniques
Isoefficiency of Jacobi Relaxation
Row decomposition• Computation per PE:
– A * N * (N/P)
• Communication– 16 * N
• Comm-to-comp Ratio:– (16 * P) / (A * N) = γ
• Efficiency:– 1 / (1 + γ)
• Isoefficiency: – N4
– problem-size = N2
– = (problem-size)^2
Block decomposition• Computation per PE:
– A * N * (N/P)
• Communication:– 32 * N / P1/2
• Comm-to-comp Ratio– (32 * P1/2) / (A * N)
• Efficiency
• Isoefficiency– N2
– Linear in problem size
04/21/23 30Performance Techniques
3104/21/23 CharmWorkshop2007 31
NAMD: A Production MD program
NAMD• Fully featured program
• NIH-funded development
• Distributed free of charge (~20,000 registered users)
• Binaries and source code
• Installed at NSF centers
• User training and support
• Large published simulations
Performance Techniques04/21/23
32
Molecular Dynamics in NAMD• Collection of [charged] atoms, with bonds
– Newtonian mechanics
– Thousands of atoms (10,000 – 5,000,000)
• At each time-step– Calculate forces on each atom
• Bonds:
• Non-bonded: electrostatic and van der Waal’s– Short-distance: every timestep
– Long-distance: using PME (3D FFT)
– Multiple Time Stepping : PME every 4 timesteps
– Calculate velocities and advance positions
• Challenge: femtosecond time-step, millions needed!
Collaboration with K. Schulten, R. Skeel, and coworkersPerformance Techniques04/21/23
33
Traditional Approaches: non isoefficient
• Replicated Data:– All atom coordinates stored on each processor
• Communication/Computation ratio: P log P
• Partition the Atoms array across processors– Nearby atoms may not be on the same processor
– C/C ratio: O(P)
• Distribute force matrix to processors– Matrix is sparse, non uniform,
– C/C Ratio: sqrt(P)
Performance Techniques04/21/23
34
Spatial Decomposition Via Charm
•Atoms distributed to cubes based on their location
• Size of each cube :
•Just a bit larger than cut-off radius
•Communicate only with neighbors
•Work: for each pair of nbr objects
•C/C ratio: O(1)
•However:
•Load Imbalance
•Limited Parallelism
Cells, Cubes or“Patches”
Charm++ is useful to handle this
Performance Techniques04/21/23
35
Object Based Parallelization for MD:
Force Decomposition + Spatial Decomposition
•Now, we have many objects to load balance:
•Each diamond can be assigned to any proc.
• Number of diamonds (3D):
–14·Number of Patches
–2-away variation:
–Half-size cubes
–5x5x5 interactions
–3-away interactions: 7x7x7Performance Techniques04/21/23
Strong Scaling on JaguarPF
6,720 cores
53,760 cores
107,520 cores
224,076 cores
36Performance Techniques04/21/23
37
Gauss-Seidel Relaxation
While (maxError > Threshold) {
Re-apply Boundary conditions
maxError = 0;
for i = 0 to N-1 {
for j = 0 to N-1 {
old = A[i, j]
A[i, j] = 0.2 * (A[i,j] + A[i,j-1] +A[i,j+1]
+ A[i+1,j] + A[i-1,j]) ;
if (|A[i,j]-old| > maxError)
maxError = |A[i,j]-old|
}
}
}
Sequential Pseudocode: No old-new arrays..
Sequentially, how well does this work?
It works much better!
How to parallelize this?
Spring 2009 CS420: Parallel Algorithms
38
How do we parallelize Gauss-Seidel?
• Visualize the flow of values
• Not the control flow:– That goes row-by-row
• Flow of dependences: which values depend on which values
• Does that give us a clue on how to parallelize?
Spring 2009 CS420: Parallel Algorithms
39
Parallelizing Gauss Seidel
• Some ideas– Row decomposition, with pipelining
– Square over-decomposition
• Assign many squares to a processor (essentially same?)
Spring 2009 CS420: Parallel Algorithms
PE 0PE 1PE 2
N
N
W
N/P
# Columns = N/W# Rows = P
11
22
22
...
...
...
P
...
...
...
P
P
P# Of Phases
P
P
P
P ... ... ...
... ... ...
... ... ...
... ... ...
NWNW
NWNW
NWNW
NWNW
N + 1W N + 1W
...N + 1W N + 1W
...N + 1W N + 1W
N/W
N +PW
+ P (-1)
Row decomposition, with pipelining
Time
# ProcsUsed
P
0 P NW
N + P -1W
Red-Black Squares Method
Spring 2009 CS420: Parallel Algorithms 42
• Red squares calculate values based on the black squares– Then black squares use values from red squares
– Now red ones can be done in parallel and then black ones can be done in parallel Each square locally can do Gauss-
Seidel computation
Communication : alpha reducing optimizations• When you are sending too many tiny messages:
– Alpha cost is high (a microsecond per msg, for example)
– How to reduce it?
• Simple combining: – Combine messages going to the same destination
– Cost: delay (lesser pipelining)
• More complex scenario:– AllToAll: everyone wants to send a short message to everyone
else
– Direct method: . (P-1) +.(P-1).m
– For small m, the cost dominates
04/21/23 Performance Techniques 43
All to all via Mesh
Organize processors in a 2D (virtual) grid
Phase 1:
Each processor sends messages within its row
Phase 2:
Each processor sends messages within its column
Message from (x1,y1) to (x2,y2) goes via (x1,y2)
2. messages instead of P-1 1P
For us: 26 messages instead of 192
1P
1P
0
10
20
30
40
50
60
16 32 64 96 128 192 256 512 1024 1280 1536 2048Processors
Tim
e (
ms
)
MPI
Mesh
Hypercube
3d Grid
All to all on Lemieux for 76 byte Msg.
All to all on Lemieux 1024 processors
0
100
200
300
400
500
600
700
800
900
76 276 476 876 1276 1676 2076 3076 4076 6076 8076Message Size (Bytes)
Tim
e (
ms
)
Mesh
Mesh Compute
Bigger benefit: CPU is occupied for a much shorter time!
Namd Performance on Lemieux
Impact on Application Performance
0
20
40
60
80
100
120
140
Step Time
256 512 1024
Processors
MeshDirectMPI
Namd Performance on Lemieux, with the transpose step implemented using different all-to-all algorithms
49
Sequential Performance Issues
Performance Techniques04/21/23
Example program
• Imagine a sequential program running using a large array, A• For each I, A[i] = A[i] + A[some other index]• How long should the program take, if each addition is a ns• What is the performance difference you expect, depending on how
the other index is chosen?
for (i=0, index2=0; i<size; i++) { index2 += SOME_NUMBER; // smaller than size if (index2 > size) index2 -= size; A[i] += A[index2];}
Spring 2009 50CS420: Cache Hierarchies
for (i=0; i<size-1; i++) { A[i] += A[i+1];}
Caches and Cache Performance
Spring 2009 CS420: Cache Hierarchies 51
• Remember the von Neumann model
CPU
MemoryMemory
Registers
Cache
CPU
Why and how does a cache help?• Only because of the principle of locality
– Programs tend to access the same and/or nearby data repeatedly
– Temporal and spatial locality
• Size of cache
• Multiple levels of cache
• Performance impact of caches– Designing programs for good sequential performance
Spring 2009 52CS420: Cache Hierarchies
Reality today: multi-level caches
Spring 2009 CS420: Cache Hierarchies 53
• Remember the von Neumann model
CPU
Memory
Cache
CPU
Memory
Caches
Example: Intel’s Nehalem
Spring 2009 CS420: Cache Hierarchies 54
• Nehalem architecture, core i7 chip:– 64 KB L1 instruction and 64 KB L1 data cache per core
– 256 KB L2 cache (combined instruction and data) per core
– 8 MB L3 (combined instruction and data) "inclusive", shared by all cores
• Still, even L1 cache is several cycles – (reportedly 4 cycles, inreased from 3 before)
– L2: 10 cycles
A little bit about microprocessor architecture
Spring 2009 CS420: Cache Hierarchies 55
• Architecture over the last 2-3 decades was driven by the need to make clock cycle go faster and faster– Pipelining developed as an essential technique early on.
– Each instruction execution is pipelinesd:
• Fetch, decode, execute, stages at least
• In addition, floating point operations, which take longer to calculate, have their own separate pipeline
• L1 cache accesses in Nehalem are pipelined – so even though it takes 4 cycles to get the result, you can keep issuing a new load every cycle, and you wouldn’t notice a difference (almost) if they are all found in L1 cache (i.e. are “hit”s)
Another issue: SIMD vectors• Hardware is capable of executing multiple
floting point instructions per cycle– Need to enable that, by using SIMD vector instructions
– Example: Intel’s SSE and IBM’s AltiVec
• Compilers try to automate it, – but are not always successful
• Learn manual vectorization– Or use libraries that help
04/21/23 Performance Techniques 56
movaps xmm0, [v1] ;xmm0 = v1.w | v1.z | v1.y | v1.x addps xmm0, [v2] ;xmm0 = v1.w+v2.w | v1.z+v2.z | v1.y+v2.y | v1.x+v2.x movaps [vec_res], xmm0
Example from Wikipedia:
Broad Approach To Performance Tuning• Understand (for a given appliction)
– Fraction of peak performance
• (10% is good for many apps!)
– Parallel efficiency:
• Speedup plots
• Strong scaling: keep problem size fixed
• Weak scaling: increase problem size with processors
– These help you decide where to focus
• Sequential optimizations => fraction of peak– Use right compiler flags (basic: -O3)
• Parallel inefficiency: – Grainsize, Communication costs, idle time, critical paths,
– load imbalances
• One way to recognize it: wait time at barriers!
04/21/23 Performance Techniques 57
58
Decouple decomposition from Physical Processors
Performance Techniques04/21/23
59
Migratable Objects (aka Processor Virtualization)
User View
System implementation
Programmer: [Over] decomposition into virtual processors
Runtime: Assigns VPs to processors
Enables adaptive runtime strategies
Implementations: Charm++, AMPI
• Software engineering– Number of virtual processors can be
independently controlled– Separate VPs for different modules
• Message driven execution– Adaptive overlap of communication– Predictability :
• Automatic out-of-core– Asynchronous reductions
• Dynamic mapping– Heterogeneous clusters
• Vacate, adjust to speed, share– Automatic checkpointing– Change set of processors used– Automatic dynamic load balancing– Communication optimization
Benefits
Performance Techniques04/21/23
60
Migratable Objects (aka Processor Virtualization)
• Software engineering– Number of virtual processors can be
independently controlled– Separate VPs for different modules
• Message driven execution– Adaptive overlap of communication– Predictability :
• Automatic out-of-core– Asynchronous reductions
• Dynamic mapping– Heterogeneous clusters
• Vacate, adjust to speed, share– Automatic checkpointing– Change set of processors used– Automatic dynamic load balancing– Communication optimization
Benefits
Real Processors
MPI processes
Virtual Processors (user-level migratable threads)
Programmer: [Over] decomposition into virtual processors
Runtime: Assigns VPs to processors
Enables adaptive runtime strategies
Implementations: Charm++, AMPI
Performance Techniques04/21/23
61
Parallel Decomposition and Processors
• MPI-style encourages– Decomposition into P pieces, where P is the number of physical
processors available
– If your natural decomposition is a cube, then the number of processors must be a cube
– …
• Charm++/AMPI style “virtual processors”– Decompose into natural objects of the application
– Let the runtime map them to processors
– Decouple decomposition from load balancing
Performance Techniques04/21/23
6204/21/23 LSU PetaScale 62
Decomposition independent of numCores
• Rocket simulation example under traditional MPI vs. Charm++/AMPI framework
– Benefit: load balance, communication optimizations, modularity
Solid
Fluid
Solid
Fluid
Solid
Fluid. . .
1 2 P
Solid1
Fluid1
Solid2
Fluid2
Solidn
Fluidm. . .
Solid3. . .
Performance Techniques04/21/23
63
OpenAtomCar-Parinello ab initio MD
Collabrative IT project with: R. Car, M. Klein, M. Tuckerman, Glenn Martyna, N. Nystrom, ..
Specific software project (leanCP): Glenn Martyna, Mark Tuckerman, L.V. Kale and co-workers (E. Bohm, Yan Shi, Ramkumar Vadali)
Funding : NSF-CHE, NSF-CS, NSF-ITR, IBMFunding : NSF-CHE, NSF-CS, NSF-ITR, IBMPerformance Techniques04/21/23
New OpenAtom Collaboration
• Principle Investigators – M. Tuckerman (NYU)– L.V. Kale (UIUC)– G. Martyna (IBM TJ Watson)– K. Schulten (UIUC)– J. Dongarra (UTK/ORNL)
• Current effort is focused on – QMMM via integration with NAMD2– ORNL Cray XT4 Jaguar (31,328 cores) – ALCF IBM Blue Gene/P (163,840 cores)
Performance Techniques 6404/21/23
Ab initio Molecular Dynamics, electronic structure simulation enables the study of many important systems
Molecular Clusters : Nanowires:
Semiconductor Surfaces: 3D-Solids/Liquids:
Performance Techniques
6504/21/23
Quantum Chemistry
• Car-Parinello Molecular Dynamics– High precision: AIMD molecular dynamics uses
quantum mechanical descriptions of electronic structure to determine forces between atoms. Thereby permitting accurate atomistic descriptions of chemical reactions.
– PPL's OpenAtom project features a unique parallel decomposition of the Car-Parinello method. Using Charm++ virtualization we can efficiently scale small (32 molecule) systems to thousands of processors.
Performance Techniques 6604/21/23
Computation Flow
67Performance Techniques04/21/23
Torus Aware Object Mapping
68Performance Techniques04/21/23
OpenAtom Performance
Performance Techniques 6904/21/23
Benefits of Topology Mapping
Watson Blue Gene/L(CO mode)
PSC BigBen (XT3)(SN and VN mode)
Performance Techniques 7004/21/23
7104/21/23 LSU PetaScale 71
Use Dynamic Load BalancingBased on the
Principle of Persistence
Principle of persistence
Computational loads and communication patterns tend to persist, even in dynamic computations
So, recent past is a good predictor or near future
Performance Techniques04/21/23
72
Load Balancing Steps
Regular Timesteps
Instrumented Timesteps
Detailed, aggressive Load Balancing
Refinement Load Balancing
Performance Techniques04/21/23
73
Processor Utilization against Time on 128 and 1024 processors
On 128 processor, a single load balancing step suffices, but
On 1024 processors, we need a “refinement” step.
Load Balancing
Aggressive Load Balancing
Refinement Load
Balancing
Performance Techniques04/21/23
74
ChaNGa: Parallel Gravity
• Collaborative project (NSF ITR)– with Prof. Tom Quinn, Univ. of Washington
• Components: gravity, gas dynamics• Barnes-Hut tree codes
– Oct tree is natural decomposition:• Geometry has better aspect ratios, and so you “open” fewer
nodes up.• But is not used because it leads to bad load balance• Assumption: one-to-one map between sub-trees and
processors• Binary trees are considered better load balanced
– With Charm++: Use Oct-Tree, and let Charm++ map subtrees to processors
Performance Techniques04/21/23
75
Load balancing with GreedyLB
dwarf 5M on 1,024 BlueGene/L processors
5.6s 6.1s
Messages x1000 Bytes transferred (MB)
0
5000
10000
15000
20000
25000
30000
Main title
Before LB
After LB
Performance Techniques04/21/23
76
Load balancing with OrbRefineLB
dwarf 5M on 1,024 BlueGene/L processors
5.6s 5.0s
Performance Techniques04/21/23
ChaNGa: Parallel Gravity Code
Developed in Collaboration with Tom Quinn (Univ. Washington) using Charm++
ChaNGa Preliminary Performance
77Performance Techniques04/21/23
78
Summary
• Exciting times ahead• Petascale computers
– unprecedented opportunities for progress in science and engg.– Petascale Applications will require a large toolbox, with
• Algorithms, Adaptive Runtime System, Performance tools, …• Object-based decomposition• Dynamic Load balancing• Scalable Performance analysis
• Early performance development via BigSim
My Research:http://charm.cs.uiuc.edu
Blue Waters: http://www.ncsa.uiuc.edu/BlueWaters/
Performance Techniques04/21/23