Performance Tools (Paraver/Dimemas)

Preview:

Citation preview

www.bsc.es

Enes workshop on exascale techs. Hamburg, March 18th 2014

Jesús Labarta, Judit Gimenez BSC

Performance Tools (Paraver/Dimemas)

2

Our Tools

!   Since 1991

!   Based on traces

!   Open Source –  http://www.bsc.es/paraver

!   Core tools: –  Paraver (paramedir) – offline trace analysis –  Dimemas – message passing simulator –  Extrae – instrumentation

!   Focus –  Detail, flexibility, intelligence

3

0 3.5 s

A “different” view point

!   Look at structure … –  Of behavior, not syntax

–  Differentiated or repetitive patterns in time and space

–  Focus on computation regions (Burst)

4

LB Ser Trf Eff 0.83 0.97 0.80 0.87 0.90 0.78 0.88 0.82 0.73 0.88 0.72 0.63

A “different” view point

!   … and fundamental metrics

adv2 (gather–fft-scatter)* mono

Useful user function @ NMMB

M. Casas et al, “Automatic analysis of speedup of MPI applications”. ICS 2008.

LB Ser Trf Eff 0.83 0.97 0.80 0.87 0.90 0.78 0.88 0.97 0.84 0.73 0.88 0.96 0.75 0.61

5

More on structure and concurrency

Scalability tradeoffs between processes at different phases

?

6

More on structure and concurrency

How to find out:

Discussion with developer Automatic? V. Subotic et al, “Automatic exploration of

potential parallelism in sequential applications”. ISC 2014.

7

More on structure and concurrency

8

More on structure and concurrency

Huge potentials of concurrency and overlap to:

tolerate latencies

spread load across resource cores and network !!

9

More on structure and concurrency

You may even want to constrain potential concurrency !!!

10

More on structure and concurrency and syntax

WIP:

Taskify with OmpSs

OpenMP 4.0 accelerator features in OmpSs

11

Performance analytics

12

Using Clustering to identify structure

IPC

Completed Instructions

J. Gonzalez et al, “Automatic Detection of Parallel Applications Computation” Phases. (IPDPS 2009)

13

!   Full per region HWC characterization from a single run

Projecting hardware counters based on clustering

Miss ratios Instruction mix Stalls

14

!   Frame sequence: clustered scatterplot as core counts increases

Tracking structural evolution

64   128   192  

256   384   512  

64   128   192  

256   384   512  

G.Llort et all, “On the Usefulness of Object Tracking Techniques in Performance Analysis”, SC 2013

OpenMX Strong scaling

15

!   … to get extreme detail with minimal overhead

!   Different roles –  Instrumentation delimits regions –  Sampling report progress within region

Mixing instrumentation and sampling …

Iteration #1 Iteration #2 Iteration #3

Synthetic Iteration

Harald Servat et al. “Unveiling Internal Evolution of Parallel Application Computation Phases” ICPP 2011

Harald Servat et al. “Detailed performance analysis using coarse grain sampling” PROPER@EUROPAR, 2009

16

  Instructions evolution for routine copy_faces of NAS MPI BT.B

  Red crosses represent the folded samples and show the completed instructions from the start of the routine

  Green line is the curve fitting of the folded samples and is used to reintroduce the values into the tracefile

  Blue line is the derivative of the curve fitting over time (counter rate)

Folding hardware counters

17

17.20 M instructions ~ 1000 MIPS

24.92 M instructions ~ 1100 MIPS

32.53 M instructions ~ 1200 MIPS

MPI call

MPI call

Combined clustering + folding

!   Instantaneous values !   All metrics !   From a single run !   “No” overhead

CGPOP -1D

18

CESM v18 – v19 trace

!   User functions not instrumented

ATM: 384 LND: 16 ICE: 32 OCN: 10 CPL: 128

2.54 GB

160 s

5 200 ms

2.55 GB 4.5 MB

11.5 MB

570

19

CESM CAM v18

Convect_shallow_tend

Microp_driver_tend

aer_rad_props_sw

aer_rads_prop_lw

rrtmg_sg

rad_rrtmg_lw

20

CESM CAM v19

Convect_shallow_tend

Svp_water

M_list_mp_init_

Vertical_diffusion

rrtmg_sw

rad_rrtmg_lw Microp_driver_tend aer_rad_props_sw

Aerosol_dryed_intr_

21

Dimemas

22

Dimemas: Coarse grain, Trace driven simulation

!   Simulation: Highly non linear model –  Linear components

•  Point to point communication

•  Sequential processor performance –  Global CPU speed –  Per block/subroutine

–  Non linear components •  Synchronization semantics

–  Blocking receives –  Rendezvous

•  Resource contention –  CPU –  Communication subsystem

»  links (half/full duplex), busses CPU

Local Memory

B

CPU

CPU

L

CPU

CPU

CPU Local

Memory

L

CPU

CPU

CPU Local

Memory

L

23

Ideal machine

!   The impossible machine: BW = ∞, L = 0 !   Actually describes/characterizes Intrinsic application behavior

–  Load balance problems? –  Dependence problems?

waitall

sendrec

alltoall

Real run

Ideal network

Allgather +

sendrecv allreduce GADGET @ Nehalem cluster

256 processes

Impact on practical machines?

24

The potential of hybrid/accelerator parallelization

!   Hybrid parallelization –  Speedup SELECTED regions by the

CPUratio factor !   We do need to overcome the hybrid

Amdahl’s law –  asynchrony + Load balancing

mechanisms !!!

93.67% 97.49% 99.11%

Code region

%el

apse

d tim

e

GADGET, 128 procs

25

Conclusion

!   BSC tools –  Extremely powerful visualization and analysis capabilities

–  Performance Analytics •  Performance data is big data

–  Management –  analytics

–  Capturing knowledge and methodologies in algorithmic workflows

!   Useful insight for informed decisions on code refactoring

http://www.bsc.es/paraver tools@bsc.es

THANKS

27

Insight

!   Observations / highly probable speculations / good questions –  about fundamental behavior –  Suggesting possibilities for optimization

!   Identification of specific poor performance sequential code !   Bimodal behavior in alternating “iterations?” !   Bimodal behavior in space:

–  Day-night imbalance –  Moving load imbalance

•  Separate cause and potential solution

!   Repetitive fine grain structure within phase –  2 / 3 sub iterations? parallelizable? Potential source for overlap of

communication/computation?

28

A call for Performance analytics

!   Data acquisition –  A lot of data is captured

!   Presentation –  Profile: a few (or not so few) pre computed first order statistics

•  Far too summarized –  Trace visualization

•  No summarization at all

Need for intelligent data processing

to derive actual insight

29

CESM CLM v18

29

30

CESM POP v18

30

31

NMMB

32

Measuring Parallel efficiency

Recommended