Performance Tools (Paraver/Dimemas)

www.bsc.es

Enes workshop on exascale techs. Hamburg, March 18th 2014

Jesús Labarta, Judit Gimenez BSC

Our Tools

!   Since 1991

!   Based on traces

!   Open Source –  http://www.bsc.es/paraver

!   Core tools: –  Paraver (paramedir) – offline trace analysis –  Dimemas – message passing simulator –  Extrae – instrumentation

!   Focus –  Detail, flexibility, intelligence

0 3.5 s

A “different” view point

!   Look at structure … –  Of behavior, not syntax

–  Differentiated or repetitive patterns in time and space

–  Focus on computation regions (Burst)

LB Ser Trf Eff 0.83 0.97 0.80 0.87 0.90 0.78 0.88 0.82 0.73 0.88 0.72 0.63

A “different” view point

!   … and fundamental metrics

adv2 (gather–fft-scatter)* mono

Useful user function @ NMMB

M. Casas et al, “Automatic analysis of speedup of MPI applications”. ICS 2008.

LB Ser Trf Eff 0.83 0.97 0.80 0.87 0.90 0.78 0.88 0.97 0.84 0.73 0.88 0.96 0.75 0.61

More on structure and concurrency

Scalability tradeoffs between processes at different phases

How to find out:

Discussion with developer Automatic? V. Subotic et al, “Automatic exploration of

potential parallelism in sequential applications”. ISC 2014.

Huge potentials of concurrency and overlap to:

tolerate latencies

spread load across resource cores and network !!

You may even want to constrain potential concurrency !!!

More on structure and concurrency and syntax

Taskify with OmpSs

OpenMP 4.0 accelerator features in OmpSs

Performance analytics

Using Clustering to identify structure

Completed Instructions

J. Gonzalez et al, “Automatic Detection of Parallel Applications Computation” Phases. (IPDPS 2009)

!   Full per region HWC characterization from a single run

Projecting hardware counters based on clustering

Miss ratios Instruction mix Stalls

!   Frame sequence: clustered scatterplot as core counts increases

Tracking structural evolution

64 128 192

256 384 512

64 128 192

256 384 512

G.Llort et all, “On the Usefulness of Object Tracking Techniques in Performance Analysis”, SC 2013

OpenMX Strong scaling

!   … to get extreme detail with minimal overhead

!   Different roles –  Instrumentation delimits regions –  Sampling report progress within region

Mixing instrumentation and sampling …

Iteration #1 Iteration #2 Iteration #3

Synthetic Iteration

Harald Servat et al. “Unveiling Internal Evolution of Parallel Application Computation Phases” ICPP 2011

Harald Servat et al. “Detailed performance analysis using coarse grain sampling” PROPER@EUROPAR, 2009

  Instructions evolution for routine copy_faces of NAS MPI BT.B

  Red crosses represent the folded samples and show the completed instructions from the start of the routine

  Green line is the curve fitting of the folded samples and is used to reintroduce the values into the tracefile

  Blue line is the derivative of the curve fitting over time (counter rate)

Folding hardware counters

17.20 M instructions ~ 1000 MIPS

MPI call

Combined clustering + folding

!   Instantaneous values !   All metrics !   From a single run !   “No” overhead

CGPOP -1D

CESM v18 – v19 trace

!   User functions not instrumented

ATM: 384 LND: 16 ICE: 32 OCN: 10 CPL: 128

2.54 GB

5 200 ms

2.55 GB 4.5 MB

11.5 MB

CESM CAM v18

Convect_shallow_tend

Microp_driver_tend

aer_rad_props_sw

aer_rads_prop_lw

rrtmg_sg

rad_rrtmg_lw

CESM CAM v19

Convect_shallow_tend

Svp_water

M_list_mp_init_

Vertical_diffusion

rrtmg_sw

rad_rrtmg_lw Microp_driver_tend aer_rad_props_sw

Aerosol_dryed_intr_

Dimemas

Dimemas: Coarse grain, Trace driven simulation

!   Simulation: Highly non linear model –  Linear components

•  Point to point communication

•  Sequential processor performance –  Global CPU speed –  Per block/subroutine

–  Non linear components •  Synchronization semantics

–  Blocking receives –  Rendezvous

•  Resource contention –  CPU –  Communication subsystem

»  links (half/full duplex), busses CPU

Local Memory

CPU Local

Memory

CPU Local

Memory

Ideal machine

!   The impossible machine: BW = ∞, L = 0 !   Actually describes/characterizes Intrinsic application behavior

–  Load balance problems? –  Dependence problems?

waitall

sendrec

alltoall

Real run

Ideal network

Allgather +

sendrecv allreduce GADGET @ Nehalem cluster

256 processes

Impact on practical machines?

The potential of hybrid/accelerator parallelization

!   Hybrid parallelization –  Speedup SELECTED regions by the

CPUratio factor !   We do need to overcome the hybrid

Amdahl’s law –  asynchrony + Load balancing

mechanisms !!!

93.67% 97.49% 99.11%

Code region

GADGET, 128 procs

Conclusion

!   BSC tools –  Extremely powerful visualization and analysis capabilities

–  Performance Analytics •  Performance data is big data

–  Management –  analytics

–  Capturing knowledge and methodologies in algorithmic workflows

!   Useful insight for informed decisions on code refactoring

http://www.bsc.es/paraver tools@bsc.es

THANKS

Insight

!   Observations / highly probable speculations / good questions –  about fundamental behavior –  Suggesting possibilities for optimization

!   Identification of specific poor performance sequential code !   Bimodal behavior in alternating “iterations?” !   Bimodal behavior in space:

–  Day-night imbalance –  Moving load imbalance

•  Separate cause and potential solution

!   Repetitive fine grain structure within phase –  2 / 3 sub iterations? parallelizable? Potential source for overlap of

communication/computation?

A call for Performance analytics

!   Data acquisition –  A lot of data is captured

!   Presentation –  Profile: a few (or not so few) pre computed first order statistics

•  Far too summarized –  Trace visualization

•  No summarization at all

Need for intelligent data processing

to derive actual insight

CESM CLM v18

CESM POP v18

Measuring Parallel efficiency

Performance Tools (Paraver/Dimemas)

Documents

Statistical Tools, Performance Verification

ClusteringSuite Introductory Manual - BSC-CNS · The ClusteringSuite is the ... use the information provided by the Performance Hardware ... Extrae and Paraver are two of the core

Linux Kernel and Performance Tools support for AMD PMUscscads.rice.edu/.../performance-tools/linux-kernel... · Linux Kernel and Performance Tools support for AMD PMUs Robert Richter

Performance Profiling Tools & Tricks

Analysis of AIX traces with Paraver of AIX traces with Paraver Judit Gimenez, Jesus Labarta (CEPBA-UPC) Terry Jones (LLNL)

Key Performance Indicators: Valuable Tools for Measuring Performance

Multicore Performance and Tools

LDAP Performance Tools User’s Guide · 4 Using the LDAP Performance Tools The LDAP Performance Tools contains the following utilities: addrate, authrate, delrate, modrate, and searchrate

Tools for Measuring Performance

E2E Performance Tools: Internet2 Performance Architecture ...€¦ · 2 Current ProjectsCurrent Projects • Performance Tools • BWCTL • NDT • OWAMP • Thrulay • Performance

Using Performance Tools 0503

Linux Performance Analysis and Tools - · PDF fileLinux Performance Analysis and Tools Lead Performance Engineer brendan@joyent.com Brendan Gregg @brendangregg

Performance Monitoring Tools

Parallel Performance Toolsipcc.cs.uoregon.edu/lectures/lecture-14-tools.pdf– Periscope – mpiP – Paraver – PerfExpert – TAU Modeling and prediction – Prophesy – MuMMI

The BSC Tools: Extrae and Paraver - POP CoE

JS performance tools

Paraver internals and details - docencia.ac.upc.edu

Linux&Performance&Tools& - Universidade do Minhogec.di.uminho.pt/.../Material/velocity2015linuxperftools.pdf · 2018-02-20 · Linux&Performance&Tools& Brendan Gregg Senior Performance

Debuggers & Performance Tools · tz-t Debuggers and Performance Tools February 2015 | Markus Geimer, Alexandre Strube

Performance Tools