Upload
hadieu
View
246
Download
0
Embed Size (px)
Citation preview
1
Detail at scale in performance analysis
Jesus LabartaDirector Computer Sciences Dept.
BSC
Jesus Labarta, Detail@scale, EuroMPI, September 2010 2
• On the title
• Performance analysis
• Scale
• Detail
• Some examples
• Visualizing variability
• Relevant information
• Instrumentation and sampling
Outline
2
Jesus Labarta, Detail@scale, EuroMPI, September 2010 3
Performance analysis tools objective
Who can I blame?
Generate nice color plots
Jesus Labarta, Detail@scale, EuroMPI, September 2010 4
Performance analysis tools objective
Fly with instruments
Understand our systems
How is my application performing?
Can I describe it in a simple way? Quantitatively?
Is there anything I can do to improve its performance
What? Preferably with minimum effort/cost
3
Jesus Labarta, Detail@scale, EuroMPI, September 2010 5
Scale and Detail: typical perception
• Scalability: It is all about size
• Space: #cores
• Time
• Detail: Granularity / #metrics
• Routine loop lines
• Metrics: time, message sizes, hardware counters,…
• Size x Detail unmanageable. Scalability problem !!!
• drop detail
• Main practices:
• Data handling mechanisms (i.e. parallelize the tool)
• Profiles, aggregates,…
Jesus Labarta, Detail@scale, EuroMPI, September 2010 6
Performance analysis tools objective
Fly with instruments
Understand our systems
How is my application performing?
Can I describe it in a simple way? Quantitatively?
Is there anything I can do to improve its performance
What? Preferably with minimum effort/cost
Infor
mation
, not
data
4
Jesus Labarta, Detail@scale, EuroMPI, September 2010 7
This talk
• Scalability is more an issue of dynamic range than absolute size
• Details ARE important
• To understand
• variability in space and time
• Microscopic causes of macroscopic effect
• We need to be able to handle/measure/analyze different levels of detail
• Some example techniques
Jesus Labarta, Detail@scale, EuroMPI, September 2010 8
Scalability
• Scalability is more an issue of dynamic range than absolute size
• Is more a matter of intelligence (data processing) than force (data handling)• First what functionality is useful, then how far can I go in size• Many performance issues do give signs at small sizes (other suddenly appear at a given
size)
106
5
Jesus Labarta, Detail@scale, EuroMPI, September 2010 9
CEPBA – tools framework
ParaverParaver
PeekPerfPeekPerf Data Display Data Display ToolsTools
..prvprv++
..pcfpcf
..trftrf
MachineMachine descriptiondescription
Time Analysis, filters
..cfgcfg
StatsStats GenGen
..prvprvValgrindValgrind
DyninstDyninst, PAPI, PAPI
InstrInstr. . LevelLevelSimulatorsSimulators
how2gen.xmlhow2gen.xml
..vizviz
..txttxt
..cubecube..xlsxls
MRNETMRNET
XMLXMLcontrolcontrol
ExtraeExtrae
DIMEMASVENUS (IBM-ZRL)
Trace handling & displaySimulators
Open Source (Linux and windows)
http://www.bsc.es/paraver
Jesus Labarta, Detail@scale, EuroMPI, September 2010 10
The butterfly effect
• Sensitivity to initial conditions
• Huge impacts of small causes
• High non linearities with accumulative effects
a “Does the flap of a butterfly’s wings in Brazil set
off a tornado in Texas?”
Common in computer systems behavior
6
Jesus Labarta, Detail@scale, EuroMPI, September 2010 11
Interconnects … a valley of butterflies64 nodes, G=8, 4MB
Externalcontention
Internalcontention
512 nodes, 4MB
Propagation ofinternal contentionBubble propagation
Dependence on appl.phase (comm. Pattern)
All2all - 32
1μs delay in arrival 1.5 ms longer call duration
Protocol /data messages interaction in adapter
Jesus Labarta, Detail@scale, EuroMPI, September 2010 12
Examples
• Analyzing variability
• Histograms
• Scatter plots
• hardware counts: all in one
• Can be done at scale: Selective data emission
• Communication, Load balance, micro load imbalance, OS noise
• Sampling + instrumentation
7
Jesus Labarta, Detail@scale, EuroMPI, September 2010 13
Visualizing variability
Jesus Labarta, Detail@scale, EuroMPI, September 2010 14
Visualizing variability: Histograms
• Variability is out there, often more than we are aware of. (i.e. Load balance)
• Histograms of any metric
Useful Duration
Instructions
IPC
L2 miss ratio
Courtesy Dimitri Komatitsch
SPECFEM3D
8
Jesus Labarta, Detail@scale, EuroMPI, September 2010 15
Visualizing variability: Histograms
• Six months later ….
Useful Duration
Instructions
IPC
L2 miss ratio
Jesus Labarta, Detail@scale, EuroMPI, September 2010 16
Visualizing variability: scatter plots
• Burst = continuous computation region
• between exit of an MPI call and entry to the next, instrumented routine, …
• Scatter plot on some relevant metrics
• Instructions: idea of computational complexity, computational load imbalance,…
• IPC: Idea of absolute performance and performance imbalance
• Automatically Identify clusters
WRFGROMACSSPECFEM3D
9
Jesus Labarta, Detail@scale, EuroMPI, September 2010 17
Visualizing variability: scatter plots
• Time/space Distribution
WRF@128 cores
Jesus Labarta, Detail@scale, EuroMPI, September 2010 18
• Limited set Hardware counters
• How can we have a complete/precise/accurate characterization of hardware counters for the different regions of a program?
• From a single run?
Detail as completeness of metrics
10
Jesus Labarta, Detail@scale, EuroMPI, September 2010 19
Emmiting “relevant” information
Jesus Labarta, Detail@scale, EuroMPI, September 2010 20
Emitting “relevant” data
• Detail for what is important, software counters(*) for what is not that important
• What is important?
• First order approach: Computation !!!
• MPI: a gas. Fills whatever space you give it. Very often not the major cause of problems
• Major computation bursts (i.e. > X ms)
• Entry and exit timestamps and hardware counters
• Communication phases.
• Software counters:
• # MPI calls, aggregated bytes, %time in MPI, …
(*) Jesús Labarta, Judit Giménez, Eloy Martínez, Pedro González, Harald Servat, Germán Llort, Xavier Aguilar: Scalability of tracing and visualization tools, PARCO 2005
11
Jesus Labarta, Detail@scale, EuroMPI, September 2010 21
GADGET Case A @ BGP 1024 processesUsefulduration
% MPI time
# collectives
Collective bytes
# p2p
p2p bytes
p2p BW
167gravtree.c
188density.c
246hydra.c
385pm_periodic.c
0transpose_mpi.c
Speedup
0,000
1,000
2,000
3,000
4,000
5,000
6,000
0 2000 4000 6000 8000 10000
processors
S(P) Model Speedup
Jesus Labarta, Detail@scale, EuroMPI, September 2010 22
GADGET Case A @ BGP 2048 processesUsefulduration
% MPI time
# collectives
Collective bytes
# p2p
p2p bytes
p2p BW
167gravtree.c
188density.c
246hydra.c
385pm_periodic.c
0transpose_mpi.c
Speedup
0,000
1,000
2,000
3,000
4,000
5,000
6,000
0 2000 4000 6000 8000 10000
processors
S(P) Model Speedup
167gravtree.c
188density.c
246hydra.c
385pm_periodic.c
0transpose_mpi.c
Speedup
0,000
1,000
2,000
3,000
4,000
5,000
6,000
0 2000 4000 6000 8000 10000
processors
S(P) Model Speedup
12
Jesus Labarta, Detail@scale, EuroMPI, September 2010 23
GADGET Case A @ BGP 4096 processesUsefulduration
% MPI time
# collectives
Collective bytes
# p2p
p2p bytes
p2p BW
167gravtree.c
188density.c
246hydra.c
385pm_periodic.c
0transpose_mpi.c
Speedup
0,000
1,000
2,000
3,000
4,000
5,000
6,000
0 2000 4000 6000 8000 10000
processors
S(P) Model Speedup
Jesus Labarta, Detail@scale, EuroMPI, September 2010 24
PFLOTRAN @ jugene1 iteration
8K cores
12K cores
16K cores
13
Jesus Labarta, Detail@scale, EuroMPI, September 2010 25
PFLOTRAN @ jugene – network traffic
Bytes onX dimension
Bytes onY dimension
Bytes onZ dimension
Bytes on3 dimensions 400K
Imbalance on link/direction utilization will limit
communication performance
Jesus Labarta, Detail@scale, EuroMPI, September 2010 26
PFLOTRAN @ jugene - Detailed network traffic
• Zoomed region in previous slide
Collective send bytes110K
Bytes out of node400K
Bandwidth<15MB/s
How much network bandwidth do we need?
Can we improve the way we manage and use networks?
14
Jesus Labarta, Detail@scale, EuroMPI, September 2010 27
PFLOTRAN @ jaguar
Color indicates cluster IDLength indicates computation burst length
Jacobian KSPSolve
K. Huck et all. “Analysis of PFLOTRAN on Jaguar” CScADS – Workshop on Performance Tools for Petascale Computing August 2-5, 2010
Outliers as small as ~0 seconds!
Jesus Labarta, Detail@scale, EuroMPI, September 2010 28
PFLOTRAN @ Jaguar: OS noise impactDefault (pin to core) – 488 seconds
Explicit Pin to Core (“fastest”) – 463 seconds
Pin to CPU (NUMA) – 455 seconds
No pinning (slowest) – 620 seconds
Color indicates Cycles per microsecond
(timelines not to scale)
15
Jesus Labarta, Detail@scale, EuroMPI, September 2010 29
PFLOTRAN @ Jaguar: OS noise impact– zoomed viewDefault
Pin to Core (“fastest”)
Color indicates Cycles per microsecond
Pre-emptions have significant effect In FLOW stage
...but not in the TRAN stage
(timelines not to scale)
Jesus Labarta, Detail@scale, EuroMPI, September 2010 30
PFLOTRAN @ Jaguar: “Spare” core results – no improvement
682 nodes, 7502 total cores – 538 seconds
744 nodes, 8184 total cores – 448 seconds
682 nodes, 6820 total cores – 566 seconds
819 nodes, 8184 total cores (last 6 unused) – 536 seconds
150 Seconds!
(timelines not to scale)
16
Jesus Labarta, Detail@scale, EuroMPI, September 2010 31
Example
• PEPC 16384 tasks on Jaguar
Duration of the computation bursts
# of MPI collective operations
Jesus Labarta, Detail@scale, EuroMPI, September 2010 32
PEPC @ jugene: 8K cores
MPI calls
Useful durartion
Microscopic load imbalance!!!!
17
Jesus Labarta, Detail@scale, EuroMPI, September 2010 33
Variability in microscopic behavior
• GROMACS: Only computation phases parallelized with SMPSs
SMPSs tasks and MPI calls ( ~ multispectral)
Jesus Labarta, Detail@scale, EuroMPI, September 2010 34
Variability in microscopic behavior
Four loops/routinesSequential order
18
Jesus Labarta, Detail@scale, EuroMPI, September 2010 35
Instrumentation + sampling
Jesus Labarta, Detail@scale, EuroMPI, September 2010 36
• Events correlated to specific program activity
• Start/exit iterations, functions, loops,…
• Different intervals:
• May be very large, may be very short
• Variable precision
• Captured data:: Hardware counters, call arguments, call path,….
• Accurate statistics: profiles, …
Instrumentation
Start Iter
fA fBMPICall
fA fBMPICall
Start Iter
MFLOPS
19
Jesus Labarta, Detail@scale, EuroMPI, September 2010 37
• Events uncorrelated to program activity (at least not specific)
• Time (or counter) overflow
• Controlled granularity:
• Sufficiently large to minimize overhead
• Guaranteed acquisition interval/precision
• Statistical projection
• %time (or metric) = f( %counts )
• Assuming no correlation, sufficiently large #samples
Sampling
fA fBMPICall
fA fBMPICall
MFLOPS
Jesus Labarta, Detail@scale, EuroMPI, September 2010 38
• Both
• Guaranteed interval
• Captured data:
• Hardware counters (since previous probe)
• call path
• Call arguments in some probes
Instrumentation + sampling
fA fBMPICall
fA fBMPICall
Start Iter Start Iter
MFLOPS
20
Jesus Labarta, Detail@scale, EuroMPI, September 2010 39
Instrumentation + Sampling
• High sampling frequency (>> Nyquist)
• Guaranteed detail. Probably useful for many analyses.
• Large data size
Jesus Labarta, Detail@scale, EuroMPI, September 2010 40
Safe sampledfunctions
MFLOPS at each interval
Instrumented MPI calls
Instrumentation + Sampling
• High sampling frequency (>> Nyquist)
• Guaranteed detail. Probably useful for many analyses.
• Large data size
21
Jesus Labarta, Detail@scale, EuroMPI, September 2010 41
Safe sampledfunctions
MFLOPS at each interval
Instrumented MPI calls
Instrumentation + Sampling
• High sampling frequency (>> Nyquist)
• Guaranteed detail. Probably useful for many analyses.
• Large data size
Jesus Labarta, Detail@scale, EuroMPI, September 2010 42
Sampling frequency
• Trade-off:
• Too low no detail
• Too high too much overhead
• Challenge: Can we get
• lot of detail, very fine grain information :
• i.e. “instantaneous” performance metric rates
• With very little overhead:
• ie. sampling a few times per second
• Work by Harald Servat
22
Jesus Labarta, Detail@scale, EuroMPI, September 2010 43
• Instrumentation Reference
• Identify different instances of a region for which to obtain detailed time evolution of metrics
• Stationary behaviour assumed
• Target region:
• Iteration
• Routine
• Routine excluding MPI calls
• …
New roles
fA fBMPICall
fA fBMPICall
Harald Servat et all.. “Detailed performance analysis using coarse grain sampling” PROPER, 2009
H. Servat “Folding: providing detailed performance metrics using coarse grain sampling” UPC-DAC-RR-2010-37
Jesus Labarta, Detail@scale, EuroMPI, September 2010 44
• Sampling role relative data
• Guarantee granularity
• Provide data to increase granularity
New roles
fA fBMPICall
fA fBMPICall
fA fA
fAfA
23
Jesus Labarta, Detail@scale, EuroMPI, September 2010 45
Folding counters: Projecting
• Cumulative count since reference
• Variance in duration–Eliminate outliers
–Scale
fA fA
fAfA
fA fA
fA
fA
fA
Jesus Labarta, Detail@scale, EuroMPI, September 2010 46
Folding counters: Fitting
• Eliminate outliers
• Kriging interpolation
24
Jesus Labarta, Detail@scale, EuroMPI, September 2010 47
Impact of the number of folded instances
• The more samples being fold, the more detailed results
• Longer executions
• Increase frequency
• Reach stability?
• Example:
• NAS BT class B copy_faces
• showing from 10 to 200 iterations
• 20 samples per second @ SGI Altix
Jesus Labarta, Detail@scale, EuroMPI, September 2010 48
Impact of the number of folded instances
• Experiments comparing few samples per second to 1000 times higher sampling frequency.
• Not necessary to fold a very big number of instances potential application even in slowly time varying programs.
25
Jesus Labarta, Detail@scale, EuroMPI, September 2010 49
Emitted data
• Timelines
• Performance counters:
• Sample again fitted function and inject synthetic events into trace
• Call stack
• Truncated by specifying routines of interest
Jesus Labarta, Detail@scale, EuroMPI, September 2010 50
Emitted data
• Plots, statistics
• Time, IPC,…
• Could think of emitting an analytical expression
• Scalability impact !!!!
• Even if generating traces
• Example (Gadget2 using 128 tasks)
• 100 its, 5 samples/s during 90minutes ~ 236MB
• Folding on 1 iteration @ 200 samples/s ~ 64 MB
NAS BT
ALYA
SIESTA
MIPS MFLOPS
26
Jesus Labarta, Detail@scale, EuroMPI, September 2010 51
PfloTran (data obtained with 5 samples/s)
Jesus Labarta, Detail@scale, EuroMPI, September 2010 52
PEPC (data obtained with 5 samples/s)
27
Jesus Labarta, Detail@scale, EuroMPI, September 2010 53
Summary
Jesus Labarta, Detail@scale, EuroMPI, September 2010 54
Summary
• Performance tools are more and more needed !!!!!• To tune our applications, to design our system software.• To understand what really happens, how our systems really behave, …
• Great progress is taking place• Functionality• Scalability: Dynamic range
• Detail IS important and can be obtained/handled• A lot of open research
I have seen things you people wouldn't believe...Roy Batty – Blade Runner
Seeing is believing ... measuring is betterFree adaptation of a Spanish saying