1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September 2010 2 On the title Performance analysis Scale Detail Some examples Visualizing variability Relevant information Instrumentation and sampling Outline

Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

  • Upload

  • View

  • Download

Embed Size (px)

Citation preview

Page 1: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Detail at scale in performance analysis

Jesus LabartaDirector Computer Sciences Dept.


Jesus Labarta, Detail@scale, EuroMPI, September 2010 2

• On the title

• Performance analysis

• Scale

• Detail

• Some examples

• Visualizing variability

• Relevant information

• Instrumentation and sampling


Page 2: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 3

Performance analysis tools objective

Who can I blame?

Generate nice color plots

Jesus Labarta, Detail@scale, EuroMPI, September 2010 4

Performance analysis tools objective

Fly with instruments

Understand our systems

How is my application performing?

Can I describe it in a simple way? Quantitatively?

Is there anything I can do to improve its performance

What? Preferably with minimum effort/cost

Page 3: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 5

Scale and Detail: typical perception

• Scalability: It is all about size

• Space: #cores

• Time

• Detail: Granularity / #metrics

• Routine loop lines

• Metrics: time, message sizes, hardware counters,…

• Size x Detail unmanageable. Scalability problem !!!

• drop detail

• Main practices:

• Data handling mechanisms (i.e. parallelize the tool)

• Profiles, aggregates,…

Jesus Labarta, Detail@scale, EuroMPI, September 2010 6

Performance analysis tools objective

Fly with instruments

Understand our systems

How is my application performing?

Can I describe it in a simple way? Quantitatively?

Is there anything I can do to improve its performance

What? Preferably with minimum effort/cost



, not


Page 4: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 7

This talk

• Scalability is more an issue of dynamic range than absolute size

• Details ARE important

• To understand

• variability in space and time

• Microscopic causes of macroscopic effect

• We need to be able to handle/measure/analyze different levels of detail

• Some example techniques

Jesus Labarta, Detail@scale, EuroMPI, September 2010 8


• Scalability is more an issue of dynamic range than absolute size

• Is more a matter of intelligence (data processing) than force (data handling)• First what functionality is useful, then how far can I go in size• Many performance issues do give signs at small sizes (other suddenly appear at a given



Page 5: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 9

CEPBA – tools framework


PeekPerfPeekPerf Data Display Data Display ToolsTools




MachineMachine descriptiondescription

Time Analysis, filters


StatsStats GenGen


DyninstDyninst, PAPI, PAPI

InstrInstr. . LevelLevelSimulatorsSimulators









Trace handling & displaySimulators

Open Source (Linux and windows)


Jesus Labarta, Detail@scale, EuroMPI, September 2010 10

The butterfly effect

• Sensitivity to initial conditions

• Huge impacts of small causes

• High non linearities with accumulative effects

a “Does the flap of a butterfly’s wings in Brazil set

off a tornado in Texas?”

Common in computer systems behavior

Page 6: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 11

Interconnects … a valley of butterflies64 nodes, G=8, 4MB



512 nodes, 4MB

Propagation ofinternal contentionBubble propagation

Dependence on appl.phase (comm. Pattern)

All2all - 32

1μs delay in arrival 1.5 ms longer call duration

Protocol /data messages interaction in adapter

Jesus Labarta, Detail@scale, EuroMPI, September 2010 12


• Analyzing variability

• Histograms

• Scatter plots

• hardware counts: all in one

• Can be done at scale: Selective data emission

• Communication, Load balance, micro load imbalance, OS noise

• Sampling + instrumentation

Page 7: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 13

Visualizing variability

Jesus Labarta, Detail@scale, EuroMPI, September 2010 14

Visualizing variability: Histograms

• Variability is out there, often more than we are aware of. (i.e. Load balance)

• Histograms of any metric

Useful Duration



L2 miss ratio

Courtesy Dimitri Komatitsch


Page 8: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 15

Visualizing variability: Histograms

• Six months later ….

Useful Duration



L2 miss ratio

Jesus Labarta, Detail@scale, EuroMPI, September 2010 16

Visualizing variability: scatter plots

• Burst = continuous computation region

• between exit of an MPI call and entry to the next, instrumented routine, …

• Scatter plot on some relevant metrics

• Instructions: idea of computational complexity, computational load imbalance,…

• IPC: Idea of absolute performance and performance imbalance

• Automatically Identify clusters


Page 9: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 17

Visualizing variability: scatter plots

• Time/space Distribution

WRF@128 cores

Jesus Labarta, Detail@scale, EuroMPI, September 2010 18

• Limited set Hardware counters

• How can we have a complete/precise/accurate characterization of hardware counters for the different regions of a program?

• From a single run?

Detail as completeness of metrics

Page 10: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 19

Emmiting “relevant” information

Jesus Labarta, Detail@scale, EuroMPI, September 2010 20

Emitting “relevant” data

• Detail for what is important, software counters(*) for what is not that important

• What is important?

• First order approach: Computation !!!

• MPI: a gas. Fills whatever space you give it. Very often not the major cause of problems

• Major computation bursts (i.e. > X ms)

• Entry and exit timestamps and hardware counters

• Communication phases.

• Software counters:

• # MPI calls, aggregated bytes, %time in MPI, …

(*) Jesús Labarta, Judit Giménez, Eloy Martínez, Pedro González, Harald Servat, Germán Llort, Xavier Aguilar: Scalability of tracing and visualization tools, PARCO 2005

Page 11: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 21

GADGET Case A @ BGP 1024 processesUsefulduration

% MPI time

# collectives

Collective bytes

# p2p

p2p bytes

p2p BW














0 2000 4000 6000 8000 10000


S(P) Model Speedup

Jesus Labarta, Detail@scale, EuroMPI, September 2010 22

GADGET Case A @ BGP 2048 processesUsefulduration

% MPI time

# collectives

Collective bytes

# p2p

p2p bytes

p2p BW














0 2000 4000 6000 8000 10000


S(P) Model Speedup














0 2000 4000 6000 8000 10000


S(P) Model Speedup

Page 12: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 23

GADGET Case A @ BGP 4096 processesUsefulduration

% MPI time

# collectives

Collective bytes

# p2p

p2p bytes

p2p BW














0 2000 4000 6000 8000 10000


S(P) Model Speedup

Jesus Labarta, Detail@scale, EuroMPI, September 2010 24

PFLOTRAN @ jugene1 iteration

8K cores

12K cores

16K cores

Page 13: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 25

PFLOTRAN @ jugene – network traffic

Bytes onX dimension

Bytes onY dimension

Bytes onZ dimension

Bytes on3 dimensions 400K

Imbalance on link/direction utilization will limit

communication performance

Jesus Labarta, Detail@scale, EuroMPI, September 2010 26

PFLOTRAN @ jugene - Detailed network traffic

• Zoomed region in previous slide

Collective send bytes110K

Bytes out of node400K


How much network bandwidth do we need?

Can we improve the way we manage and use networks?

Page 14: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 27

PFLOTRAN @ jaguar

Color indicates cluster IDLength indicates computation burst length

Jacobian KSPSolve

K. Huck et all. “Analysis of PFLOTRAN on Jaguar” CScADS – Workshop on Performance Tools for Petascale Computing August 2-5, 2010

Outliers as small as ~0 seconds!

Jesus Labarta, Detail@scale, EuroMPI, September 2010 28

PFLOTRAN @ Jaguar: OS noise impactDefault (pin to core) – 488 seconds

Explicit Pin to Core (“fastest”) – 463 seconds

Pin to CPU (NUMA) – 455 seconds

No pinning (slowest) – 620 seconds

Color indicates Cycles per microsecond

(timelines not to scale)

Page 15: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 29

PFLOTRAN @ Jaguar: OS noise impact– zoomed viewDefault

Pin to Core (“fastest”)

Color indicates Cycles per microsecond

Pre-emptions have significant effect In FLOW stage

...but not in the TRAN stage

(timelines not to scale)

Jesus Labarta, Detail@scale, EuroMPI, September 2010 30

PFLOTRAN @ Jaguar: “Spare” core results – no improvement

682 nodes, 7502 total cores – 538 seconds

744 nodes, 8184 total cores – 448 seconds

682 nodes, 6820 total cores – 566 seconds

819 nodes, 8184 total cores (last 6 unused) – 536 seconds

150 Seconds!

(timelines not to scale)

Page 16: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 31


• PEPC 16384 tasks on Jaguar

Duration of the computation bursts

# of MPI collective operations

Jesus Labarta, Detail@scale, EuroMPI, September 2010 32

PEPC @ jugene: 8K cores

MPI calls

Useful durartion

Microscopic load imbalance!!!!

Page 17: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 33

Variability in microscopic behavior

• GROMACS: Only computation phases parallelized with SMPSs

SMPSs tasks and MPI calls ( ~ multispectral)

Jesus Labarta, Detail@scale, EuroMPI, September 2010 34

Variability in microscopic behavior

Four loops/routinesSequential order

Page 18: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 35

Instrumentation + sampling

Jesus Labarta, Detail@scale, EuroMPI, September 2010 36

• Events correlated to specific program activity

• Start/exit iterations, functions, loops,…

• Different intervals:

• May be very large, may be very short

• Variable precision

• Captured data:: Hardware counters, call arguments, call path,….

• Accurate statistics: profiles, …


Start Iter

fA fBMPICall

fA fBMPICall

Start Iter


Page 19: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 37

• Events uncorrelated to program activity (at least not specific)

• Time (or counter) overflow

• Controlled granularity:

• Sufficiently large to minimize overhead

• Guaranteed acquisition interval/precision

• Statistical projection

• %time (or metric) = f( %counts )

• Assuming no correlation, sufficiently large #samples


fA fBMPICall

fA fBMPICall


Jesus Labarta, Detail@scale, EuroMPI, September 2010 38

• Both

• Guaranteed interval

• Captured data:

• Hardware counters (since previous probe)

• call path

• Call arguments in some probes

Instrumentation + sampling

fA fBMPICall

fA fBMPICall

Start Iter Start Iter


Page 20: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 39

Instrumentation + Sampling

• High sampling frequency (>> Nyquist)

• Guaranteed detail. Probably useful for many analyses.

• Large data size

Jesus Labarta, Detail@scale, EuroMPI, September 2010 40

Safe sampledfunctions

MFLOPS at each interval

Instrumented MPI calls

Instrumentation + Sampling

• High sampling frequency (>> Nyquist)

• Guaranteed detail. Probably useful for many analyses.

• Large data size

Page 21: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 41

Safe sampledfunctions

MFLOPS at each interval

Instrumented MPI calls

Instrumentation + Sampling

• High sampling frequency (>> Nyquist)

• Guaranteed detail. Probably useful for many analyses.

• Large data size

Jesus Labarta, Detail@scale, EuroMPI, September 2010 42

Sampling frequency

• Trade-off:

• Too low no detail

• Too high too much overhead

• Challenge: Can we get

• lot of detail, very fine grain information :

• i.e. “instantaneous” performance metric rates

• With very little overhead:

• ie. sampling a few times per second

• Work by Harald Servat

Page 22: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 43

• Instrumentation Reference

• Identify different instances of a region for which to obtain detailed time evolution of metrics

• Stationary behaviour assumed

• Target region:

• Iteration

• Routine

• Routine excluding MPI calls

• …

New roles

fA fBMPICall

fA fBMPICall

Harald Servat et all.. “Detailed performance analysis using coarse grain sampling” PROPER, 2009

H. Servat “Folding: providing detailed performance metrics using coarse grain sampling” UPC-DAC-RR-2010-37

Jesus Labarta, Detail@scale, EuroMPI, September 2010 44

• Sampling role relative data

• Guarantee granularity

• Provide data to increase granularity

New roles

fA fBMPICall

fA fBMPICall

fA fA


Page 23: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 45

Folding counters: Projecting

• Cumulative count since reference

• Variance in duration–Eliminate outliers


fA fA


fA fA




Jesus Labarta, Detail@scale, EuroMPI, September 2010 46

Folding counters: Fitting

• Eliminate outliers

• Kriging interpolation

Page 24: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 47

Impact of the number of folded instances

• The more samples being fold, the more detailed results

• Longer executions

• Increase frequency

• Reach stability?

• Example:

• NAS BT class B copy_faces

• showing from 10 to 200 iterations

• 20 samples per second @ SGI Altix

Jesus Labarta, Detail@scale, EuroMPI, September 2010 48

Impact of the number of folded instances

• Experiments comparing few samples per second to 1000 times higher sampling frequency.

• Not necessary to fold a very big number of instances potential application even in slowly time varying programs.

Page 25: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 49

Emitted data

• Timelines

• Performance counters:

• Sample again fitted function and inject synthetic events into trace

• Call stack

• Truncated by specifying routines of interest

Jesus Labarta, Detail@scale, EuroMPI, September 2010 50

Emitted data

• Plots, statistics

• Time, IPC,…

• Could think of emitting an analytical expression

• Scalability impact !!!!

• Even if generating traces

• Example (Gadget2 using 128 tasks)

• 100 its, 5 samples/s during 90minutes ~ 236MB

• Folding on 1 iteration @ 200 samples/s ~ 64 MB





Page 26: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 51

PfloTran (data obtained with 5 samples/s)

Jesus Labarta, Detail@scale, EuroMPI, September 2010 52

PEPC (data obtained with 5 samples/s)

Page 27: Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September


Jesus Labarta, Detail@scale, EuroMPI, September 2010 53


Jesus Labarta, Detail@scale, EuroMPI, September 2010 54


• Performance tools are more and more needed !!!!!• To tune our applications, to design our system software.• To understand what really happens, how our systems really behave, …

• Great progress is taking place• Functionality• Scalability: Dynamic range

• Detail IS important and can be obtained/handled• A lot of open research

I have seen things you people wouldn't believe...Roy Batty – Blade Runner

Seeing is believing ... measuring is betterFree adaptation of a Spanish saying