Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September

1

Detail at scale in performance analysis

Jesus LabartaDirector Computer Sciences Dept.

BSC

Jesus Labarta, Detail@scale, EuroMPI, September 2010 2

• On the title

• Performance analysis

• Scale

• Detail

• Some examples

• Visualizing variability

• Relevant information

• Instrumentation and sampling

Outline

2


Performance analysis tools objective

Who can I blame?

Generate nice color plots



Fly with instruments

Understand our systems

How is my application performing?

Can I describe it in a simple way? Quantitatively?

Is there anything I can do to improve its performance

What? Preferably with minimum effort/cost

3


Scale and Detail: typical perception

• Scalability: It is all about size

• Space: #cores

• Time

• Detail: Granularity / #metrics

• Routine loop lines

• Metrics: time, message sizes, hardware counters,…

• Size x Detail unmanageable. Scalability problem !!!

• drop detail

• Main practices:

• Data handling mechanisms (i.e. parallelize the tool)

• Profiles, aggregates,…



Fly with instruments

Understand our systems

How is my application performing?

Can I describe it in a simple way? Quantitatively?

Is there anything I can do to improve its performance

What? Preferably with minimum effort/cost

Infor

mation

, not

data

4


This talk

• Scalability is more an issue of dynamic range than absolute size

• Details ARE important

• To understand

• variability in space and time

• Microscopic causes of macroscopic effect

• We need to be able to handle/measure/analyze different levels of detail

• Some example techniques


Scalability

• Scalability is more an issue of dynamic range than absolute size

• Is more a matter of intelligence (data processing) than force (data handling)• First what functionality is useful, then how far can I go in size• Many performance issues do give signs at small sizes (other suddenly appear at a given

size)

106

5


CEPBA – tools framework

ParaverParaver

PeekPerfPeekPerf Data Display Data Display ToolsTools

..prvprv++

..pcfpcf

..trftrf

MachineMachine descriptiondescription

Time Analysis, filters

..cfgcfg

StatsStats GenGen

..prvprvValgrindValgrind

DyninstDyninst, PAPI, PAPI

InstrInstr. . LevelLevelSimulatorsSimulators

how2gen.xmlhow2gen.xml

..vizviz

..txttxt

..cubecube..xlsxls

MRNETMRNET

XMLXMLcontrolcontrol

ExtraeExtrae

DIMEMASVENUS (IBM-ZRL)

Trace handling & displaySimulators

Open Source (Linux and windows)

http://www.bsc.es/paraver


The butterfly effect

• Sensitivity to initial conditions

• Huge impacts of small causes

• High non linearities with accumulative effects

a “Does the flap of a butterfly’s wings in Brazil set

off a tornado in Texas?”

Common in computer systems behavior

6


Interconnects … a valley of butterflies64 nodes, G=8, 4MB

Externalcontention

Internalcontention

512 nodes, 4MB

Propagation ofinternal contentionBubble propagation

Dependence on appl.phase (comm. Pattern)

All2all - 32

1μs delay in arrival 1.5 ms longer call duration

Protocol /data messages interaction in adapter


Examples

• Analyzing variability

• Histograms

• Scatter plots

• hardware counts: all in one

• Can be done at scale: Selective data emission

• Communication, Load balance, micro load imbalance, OS noise

• Sampling + instrumentation

7


Visualizing variability


Visualizing variability: Histograms

• Variability is out there, often more than we are aware of. (i.e. Load balance)

• Histograms of any metric

Useful Duration

Instructions

IPC

L2 miss ratio

Courtesy Dimitri Komatitsch

SPECFEM3D

8


Visualizing variability: Histograms

• Six months later ….

Useful Duration

Instructions

IPC

L2 miss ratio


Visualizing variability: scatter plots

• Burst = continuous computation region

• between exit of an MPI call and entry to the next, instrumented routine, …

• Scatter plot on some relevant metrics

• Instructions: idea of computational complexity, computational load imbalance,…

• IPC: Idea of absolute performance and performance imbalance

• Automatically Identify clusters

WRFGROMACSSPECFEM3D

9


Visualizing variability: scatter plots

• Time/space Distribution

WRF@128 cores


• Limited set Hardware counters

• How can we have a complete/precise/accurate characterization of hardware counters for the different regions of a program?

• From a single run?

Detail as completeness of metrics

10


Emmiting “relevant” information


Emitting “relevant” data

• Detail for what is important, software counters(*) for what is not that important

• What is important?

• First order approach: Computation !!!

• MPI: a gas. Fills whatever space you give it. Very often not the major cause of problems

• Major computation bursts (i.e. > X ms)

• Entry and exit timestamps and hardware counters

• Communication phases.

• Software counters:

• # MPI calls, aggregated bytes, %time in MPI, …

(*) Jesús Labarta, Judit Giménez, Eloy Martínez, Pedro González, Harald Servat, Germán Llort, Xavier Aguilar: Scalability of tracing and visualization tools, PARCO 2005

11


GADGET Case A @ BGP 1024 processesUsefulduration

% MPI time

# collectives

Collective bytes

# p2p

p2p bytes

p2p BW

167gravtree.c

188density.c

246hydra.c

385pm_periodic.c

0transpose_mpi.c

Speedup

0,000

1,000

2,000

3,000

4,000

5,000

6,000

0 2000 4000 6000 8000 10000

processors

S(P) Model Speedup



% MPI time

# collectives

Collective bytes

# p2p

p2p bytes

p2p BW

167gravtree.c

188density.c

246hydra.c

385pm_periodic.c

0transpose_mpi.c

Speedup

0,000

1,000

2,000

3,000

4,000

5,000

6,000

0 2000 4000 6000 8000 10000

processors

S(P) Model Speedup

167gravtree.c

188density.c

246hydra.c

385pm_periodic.c

0transpose_mpi.c

Speedup

0,000

1,000

2,000

3,000

4,000

5,000

6,000

0 2000 4000 6000 8000 10000

processors

S(P) Model Speedup

12



% MPI time

# collectives

Collective bytes

# p2p

p2p bytes

p2p BW

167gravtree.c

188density.c

246hydra.c

385pm_periodic.c

0transpose_mpi.c

Speedup

0,000

1,000

2,000

3,000

4,000

5,000

6,000

0 2000 4000 6000 8000 10000

processors

S(P) Model Speedup


PFLOTRAN @ jugene1 iteration

8K cores

12K cores

16K cores

13


PFLOTRAN @ jugene – network traffic

Bytes onX dimension

Bytes onY dimension

Bytes onZ dimension

Bytes on3 dimensions 400K

Imbalance on link/direction utilization will limit

communication performance


PFLOTRAN @ jugene - Detailed network traffic

• Zoomed region in previous slide

Collective send bytes110K

Bytes out of node400K

Bandwidth<15MB/s

How much network bandwidth do we need?

Can we improve the way we manage and use networks?

14


PFLOTRAN @ jaguar

Color indicates cluster IDLength indicates computation burst length

Jacobian KSPSolve

K. Huck et all. “Analysis of PFLOTRAN on Jaguar” CScADS – Workshop on Performance Tools for Petascale Computing August 2-5, 2010

Outliers as small as ~0 seconds!


PFLOTRAN @ Jaguar: OS noise impactDefault (pin to core) – 488 seconds

Explicit Pin to Core (“fastest”) – 463 seconds

Pin to CPU (NUMA) – 455 seconds

No pinning (slowest) – 620 seconds

Color indicates Cycles per microsecond

(timelines not to scale)

15


PFLOTRAN @ Jaguar: OS noise impact– zoomed viewDefault

Pin to Core (“fastest”)

Color indicates Cycles per microsecond

Pre-emptions have significant effect In FLOW stage

...but not in the TRAN stage



PFLOTRAN @ Jaguar: “Spare” core results – no improvement

682 nodes, 7502 total cores – 538 seconds



819 nodes, 8184 total cores (last 6 unused) – 536 seconds

150 Seconds!


16


Example

• PEPC 16384 tasks on Jaguar

Duration of the computation bursts

# of MPI collective operations


PEPC @ jugene: 8K cores

MPI calls

Useful durartion

Microscopic load imbalance!!!!

17


Variability in microscopic behavior

• GROMACS: Only computation phases parallelized with SMPSs

SMPSs tasks and MPI calls ( ~ multispectral)


Variability in microscopic behavior

Four loops/routinesSequential order

18


Instrumentation + sampling


• Events correlated to specific program activity

• Start/exit iterations, functions, loops,…

• Different intervals:

• May be very large, may be very short

• Variable precision

• Captured data:: Hardware counters, call arguments, call path,….

• Accurate statistics: profiles, …

Instrumentation

Start Iter

fA fBMPICall

fA fBMPICall

Start Iter

MFLOPS

19


• Events uncorrelated to program activity (at least not specific)

• Time (or counter) overflow

• Controlled granularity:

• Sufficiently large to minimize overhead

• Guaranteed acquisition interval/precision

• Statistical projection

• %time (or metric) = f( %counts )

• Assuming no correlation, sufficiently large #samples

Sampling

fA fBMPICall

fA fBMPICall

MFLOPS


• Both

• Guaranteed interval

• Captured data:

• Hardware counters (since previous probe)

• call path

• Call arguments in some probes

Instrumentation + sampling

fA fBMPICall

fA fBMPICall

Start Iter Start Iter

MFLOPS

20


Instrumentation + Sampling

• High sampling frequency (>> Nyquist)

• Guaranteed detail. Probably useful for many analyses.

• Large data size


Safe sampledfunctions

MFLOPS at each interval

Instrumented MPI calls




• Large data size

21


Safe sampledfunctions

MFLOPS at each interval

Instrumented MPI calls




• Large data size


Sampling frequency

• Trade-off:

• Too low no detail

• Too high too much overhead

• Challenge: Can we get

• lot of detail, very fine grain information :

• i.e. “instantaneous” performance metric rates

• With very little overhead:

• ie. sampling a few times per second

• Work by Harald Servat

22


• Instrumentation Reference

• Identify different instances of a region for which to obtain detailed time evolution of metrics

• Stationary behaviour assumed

• Target region:

• Iteration

• Routine

• Routine excluding MPI calls

• …

New roles

fA fBMPICall

fA fBMPICall

Harald Servat et all.. “Detailed performance analysis using coarse grain sampling” PROPER, 2009

H. Servat “Folding: providing detailed performance metrics using coarse grain sampling” UPC-DAC-RR-2010-37


• Sampling role relative data

• Guarantee granularity

• Provide data to increase granularity

New roles

fA fBMPICall

fA fBMPICall

fA fA

fAfA

23


Folding counters: Projecting

• Cumulative count since reference

• Variance in duration–Eliminate outliers

–Scale

fA fA

fAfA

fA fA

fA

fA

fA


Folding counters: Fitting

• Eliminate outliers

• Kriging interpolation

24


Impact of the number of folded instances

• The more samples being fold, the more detailed results

• Longer executions

• Increase frequency

• Reach stability?

• Example:

• NAS BT class B copy_faces

• showing from 10 to 200 iterations

• 20 samples per second @ SGI Altix


Impact of the number of folded instances

• Experiments comparing few samples per second to 1000 times higher sampling frequency.

• Not necessary to fold a very big number of instances potential application even in slowly time varying programs.

25


Emitted data

• Timelines

• Performance counters:

• Sample again fitted function and inject synthetic events into trace

• Call stack

• Truncated by specifying routines of interest


Emitted data

• Plots, statistics

• Time, IPC,…

• Could think of emitting an analytical expression

• Scalability impact !!!!

• Even if generating traces

• Example (Gadget2 using 128 tasks)

• 100 its, 5 samples/s during 90minutes ~ 236MB

• Folding on 1 iteration @ 200 samples/s ~ 64 MB

NAS BT

ALYA

SIESTA

MIPS MFLOPS

26


PfloTran (data obtained with 5 samples/s)


PEPC (data obtained with 5 samples/s)

27


Summary


Summary

• Performance tools are more and more needed !!!!!• To tune our applications, to design our system software.• To understand what really happens, how our systems really behave, …

• Great progress is taking place• Functionality• Scalability: Dynamic range

• Detail IS important and can be obtained/handled• A lot of open research

I have seen things you people wouldn't believe...Roy Batty – Blade Runner

Seeing is believing ... measuring is betterFree adaptation of a Spanish saying

Documents

Detail at scale in performance analysis · 1 Detail at scale in performance analysis Jesus Labarta Director Computer Sciences Dept. BSC Jesus Labarta, Detail@scale, EuroMPI, September