49
Performance Performance CEPBA CEPBA Judit Gimenez P Judit Gimenez P judit@ Analisis with Analisis with A Tools A-Tools erformance Tools erformance Tools @bsc.es

Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

PerformancePerformance CEPBACEPBA

Judit Gimenez PJudit Gimenez – Pjudit@

Analisis withAnalisis with A ToolsA-Tools

erformance Toolserformance [email protected]

Page 2: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

CEPBA-Tools environment

• Paraver

• Dimemas

Research areas

• Time analysis

• ClusteringClustering

• On-line analysis

• Sampling• Sampling

Models

Tools integrationTools integration

How to? GROMACS analysis

Page 3: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

Traces!

.prv.prv

.pcf.pcfapplicationapplication.row.row

instrumentationinstrumentation

.trf.trf

ParaverParaver

DimemasDimemas

Since 1991

Page 4: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

• Variance is important

• Along time, across processors

• I f ti i i th d t il• Information is in the details

• Highly non linear systems

• Microscopic effects may have larmacroscopic impactmacroscopic impact

rge

Page 5: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

AlPPC / x86 clusters

Power 4/5/6 AIXAl

GL / BGP

Dimemas

me translators: TAU, KOJAK, OTF, HPC

y easy to add new programming mo

ltix Sltix(Supported by NASA AMES) By Rainer Kelle

PERUSEB R i K ll (H

Cell BE

By Rainer Keller (H& (UTK)

MPICH collective internals

C Toolkit....

odels

Page 6: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

R d tRaw data

tunable

eeing is believing

easuring is better

TimelinesId tifi f f tiIdentifier of functionHardware countsMiss ratiosPerformance (IPC, Mflops,…)

Routine duration• • •• • •

S i iStatisticsProfilesAverage miss ratio per routineg pHistogram of routine durationNumber of messages

• • •• • •

Page 7: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

•• From raw events Piece-wise constan

• Basic metricsInstructions

MPI callsMPI callsUseful duration

• Derived metrics

instrusefulIPC= instrcycles

�useful

• Models

L2 = cycles− instr / idea

nt functions of time plots / colors

Color encoding

MPI call Cost=

MPI callduration

bytes

preemptedtime=elapsed− cyccloc

alIPC

Page 8: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

• 2D T bl• 2D Tables • Accumulate over the different values of a co

• P fil C t i l ti li ( f ti• Profile: Categorical timeline (user function

• Histogram: Numeric timeline (duration, in• Diff t d t i d C l ti• Different data window – Correlations

• Average IPC @ each user function

• N b f i t ti f h f• Number of instructions for each range of • Communication Patterns• 3D extensions – 2 control windows3D extensions 2 control windows

Duration - IPC

While compu

ontrol window

i ll )n, mpi call...)

structions, IPC...)

d tidurations MPI calls profile

uting Communication pattern

Page 9: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

6 tasks trace

Page 10: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

• Simulation: Highly non linear model

• Linear componentsp

• Point to point communication

• Sequential processor performanceq p p

• Global CPU speed

• Per block/subroutinePer block/subroutine

• Non linear components

• Synchronization semantics

• Blocking receives

• Rendezvous

• Resource contention

• CPU

• Communication subsystem • links (half/full duplex), busses

B

Local

CPU

L

CPULocal

L

CPULoc

L

CPU

Local

MemoryCPU CPU

CPU

Local

Memory CPU

CPU

Loc

Mem

• GRID extensionsDedicated connectionsDedicated connections

External network (trafic)

Page 11: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

• Point to point model

SimresMPI_send

Logical TransferTransfer

LatencyUses CPUIndependent of size

MPI_recv

• Collectives modelCollectives model

• Barrier / Fan-in / Fan-out

• Cost Generic / Per callCost Generic / Per call

Processor timeBlock time

Comm. time

Time = Latency+ mBaBa

mulated contention for machine sources (links & buses)

Physical Transfer SizeBW

Process Blocked

MODEL_Bandwidth

SizeLatencyTime ∗⎟⎠⎞

⎜⎝⎛ +=

Machine 1 Machine 2

Fan in inte

Fan in extern

Fan out exte

Fan out inter

Page 12: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

• Simulations with 4 processes per node• NMM Iberia 4Km

• Not sensitive to latency

• 512 sensitive to contention?

• 256 MB/s OK• ARW Iberia 4 Km

• Not sensitive to latency

• sensitive to contention

• Need 1GB/sContention Impact (L=8; BW=2

1 2

0.8

1

1.2

omec

tivity

0.4

0.6

Spee

dup

vs. F

ull c

o

0

0.2

4 8 12 16 20 24 28

S

Impact of latency (BW=256; B=0)

0 998

1

1.002

l Lat

ency

0.994

0.996

0.998

up v

s.

Nom

inal

0.99

0.992

0 2 4 8 16 32

Spe

edu

Impact of BW (L=8; B=0)

256)

0.80

1.00

1.20

NMM 512

ARW 5120.40

0.60

0.80

Effic

ienc

y

NMM 256

ARW 256

NMM 128

ARW 1280.00

0.20

1 4 16 64 256 1024

32 36

Page 13: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

• MPIRE, 32 tasks, no network contention

L=25us BW=100MB/sL=25us, BW=100MB/s

L 25 BW 10MB/

L 1000 s BW 100MB/sL=1000us, BW=100MB/s

All windows same sca

Page 14: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

CEPBA-Tools environment

• Paraver

• Dimemas

Research areas

• Time analysis

• ClusteringClustering

• On-line analysis

• Sampling• Sampling

Models

Tools integrationTools integration

How to? GROMACS analysis

Page 15: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

l for data reductioniodic structure: Reference focus for detaomatically generate a representative sub

ta handling/summarization capability

y g p

• Software counters, filtering and cuttinggnals:

• To “clean” data: Flushing processes, %pree#msgs/BW (clogged system)

• T id tif t t #T k ti S• To identify structure: #Tasks computing, Suduration, #processes in MPI, average IPC,…

tomatizable through signal processing techniqg g p g q

• Mathematical morphology to clean up pertu

• Wavelet transform to identify coarse regionsWavelet transform to identify coarse regions

• Spectral analysis for detailed periodic patte

ailed analysisbtrace of few iterations

570 s2.2 GB

WRF-NMMPeninsula 4km

MPI, HWC 128 procs

emted time,

b t

570 s5 MB

um burst …ques:q

urbed regions

s 4.6 ss

ern36.5 MB

Page 16: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

• Useful for identifying and highlighting struct

• Cluster information injected in trace

• Phases within routines

• Different routines may have similar behav

• Compact trace encoding• Compact trace encoding

• Input to time analysis

CPMD

NAS BT

ture

vior

0.8

1

DBSCAN (Eps=0.01, MinPoints=20) clustering of trace WRF-128-PI.chop2.trf

0.4

0.6

Inst

ruct

ions

Com

plet

ed

WRF

0.2

In

Page 17: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

Focusing analysis

• Projection of HWC within a cluster

• Apply analysis to cluster level

• Statistics• CPI stack model• My favorite metrics

Page 18: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

• Based on MRNet• Based on MRNet

i … i+n10 … 300 … i+

m… 1i0

• 1st experiment: Collective duration thresholdp

245MB, >15500 col

• Current development

245MB, 15500 col

• Periodic frequency analysis

• P i di l t i h t

d <1MB, <

25MB, <

Collective internals

Page 19: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

A l i tAnalysis atAnalysis atAnalysis at

Clusters distribution

t i t 30t minute 10t minute 20t minute 30

CPI STACK model (generic

Page 20: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

• Useful

• To control granularity (not driven by MPI c

• To identify timeline profile, application str

• Adding sampling information on the tracefile

• based on the overflow mechanism offered

• period = f (cycles / instructions / cache m

• sampled information:

• call stack – as reference to source

• other hardware counters (not sampled)

•• interval between samples:

• High frequency sampling (> Nyquist)

• L f li• Low frequency sampling

calls, selected user functions)

ucture

e

d by PAPI

isses...)

Page 21: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

How to increase precision?Folding based on known periodic structure of application (tagged iterations) Applies to stationary applications

R lt t f it ti ith th tiResult: trace for one iteration with synthetic paraver events

Refer counts/timestamps to start of iteration

• Call stack

• Search for consecutive sequences of folded samples within same function and generate synthetic events

• Hardware countersHardware counters

• Noise reduction• Fit folded samples• Sample fitting curve to generate synthetic

events.

Page 22: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

• Useful to identify

• Internal behavior

• Density of L1 misses within MPI in an SM

Data

MP: when data actually arrives.

arrives

Page 23: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

CEPBA-Tools environment

• Paraver

• Dimemas

Research areas

• Time analysis

• ClusteringClustering

• On-line analysis

• Sampling• Sampling

Models

Tools integrationTools integration

How to? GROMACS analysis

Page 24: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

T

T i

T

Supi

LB=∑i=1

P

eff iCoeff i=

T iIPC

# LBP *max�eff i�

eff i T#instr

T ideal

Sup=

Migrating/local load imbalanceSerializationSerialization

CommicroLB=max �T i �

up= PP

� LBLB

� CommEffC Eff

� IPCIPC

�instr0

i t

Directly from real execution me

P0 LB0 CommEff 0 IPC 0 instr

ommEff= max �eff i �

Directly from real execution me

PP0

� macroLBmacroLB0

� microLBmicroLB0

� CommEffCommEff 0

� IPCIPC 0

�instr0

instr

Requires Dimemas simul

mmEff=T ideal

Page 25: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

Poor comReplication of computation Poor commmunication efficiencymmunication efficiency

Improved macrmicro load bala

CommuncMicro LoMacro LoIPCIPCComputat

Page 26: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

CEPBA-Tools environment

• Paraver

• Dimemas

Research areas

• Time analysis

• ClusteringClustering

• On-line analysis

• Sampling• Sampling

Models

Tools integrationTools integration

How to? GROMACS analysis

Page 27: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

Pattern Pattern AnalysisAnalysis

MPITraceMPITrace

DIMEMASVENUS (IBM-ZRL)

realideal_lFU

ideal_nFUideal_ROB

ideal_L1_cachesideal_L1_instr_cache

ideal

BT:C.64, task 20, cluster 1

MPSimMPSimreal_lFU

real_nFUreal_ROB

real_L1_cachesreal_L1_data_cache

real

0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3

ClusteringClusteringgg

bursts bursts selectionselection

fFfF

ValgrindValgrind

Page 28: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

• Profile presentation tools

• Reduction/aggregation of the performanc

• Time dimension disappeared / Space dim• Traces

• “All” data is there

ce dimensions

Profilestats C

mension sometimes

VarianceVarianceTimspa

Page 29: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

l t f V i d TAU tslator of Vampir and TAU traces

Page 30: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

HPCTHPCT

Pattern Pattern AnalysisAnalysis

Stats Stats

algrindalgrind

racerace FiltersFiltersPARAVPARAV

.prv.prvff

PARAVPARAV

.pcf.pcf.row.row

MPSiMPSi

ClusteringClustering

.trf.trf

MPSimMPSim

KOJAKKOJAK

TAUTAUGenGen

.cfg.cfg

VERVERVamVam

oftoft22prvprv

VERVER

DIMEMAS

VENUS (IBM-ZRL)

Page 31: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

CEPBA-Tools environment

• Paraver

• Dimemas

Research areas

• Time analysis

• ClusteringClustering

• On-line analysis

• Sampling• Sampling

Models

Tools integrationTools integration

How to? GROMACS analysis

Page 32: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

Test case: NUCLEOSOME

N

GaGaaG

2 4 8 16 32 64 128 256 512 1024

Cluster / MN perf.

2,5

3

3,5

1

1,5

2

Raw processor performance0

0,5

1 2 3 4 5 6 7 8

Raw processor performancevs

Communication efficiency?

NAMD 2.7

y

GMX4_mn_work_nucleo_nsdayGMX4_mn_stop_nucleo_nsdayayGMX4_cluster_work_nsday

Page 33: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

64 tasks: whole run 14GBfilter only “large enough” computations 25

h 1 it ti 13MBchop 1 iteration13MB

5MB

Page 34: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

64 tasks

256 tasks

64 tasks 256 tasks

Main scalability problems• Load balanceparallel eff

comm Eff • Computation balance

4% of code replication

comm. Effload balancemico load balancecomp. balance 4% of code replication

64 tasks already poor effic

pcomputationIPC balanceIPC y p

• Communication problem

Page 35: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

64 tasks

<

Page 36: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

256 tasks

T diff f

Page 37: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

Instructions histogram

64 tasks

256 tasks

Page 38: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

64 tasks

Page 39: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

64 tasks: globally balanced but some lo

Instructions imbal

IPC Imba

ocal imbalance

lance

alance

Page 40: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

MPI calls

64 tasks

256 tasks

Duration of the Computation

64 tasks

256 tasks

Page 41: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

64Load balance!!!

T id l

64 pro

– Trapezoidal– IPC ∝ Instr

– 2x instr20% IPC

256

– 20% IPC– IPC ∝ 1/Instr

ocs

procs

Long waits for FFTs

Page 42: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

Tasks phase parallel eff.64 all 0,55

FFTs 0 55FFTs 0,55part icles 0,54

256 all 0,31FFTs 0 35FFTs 0,35part icles 0,29

0,9

1

0,6

0,7

0,8

0,4

0,5

,

0 1

0,2

0,3

64 all 64 FFTs 64 part icles0

0,1

load balance comp. balance5 0,93 0,815 0 91 0 955 0,91 0,954 0,94 0,871 0,61 0,585 0 55 0 695 0,55 0,699 0,74 0,72

parallel eff.load balancecomp. balance

256 all 256 FFTs 256 part icles

Page 43: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

64 tasks

256 t k256 tasks

Page 44: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

64 tasks

Page 45: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

64 tasks

• Real

• Prediction nominal

• Prediction ideal– Not much gain– Significant serialization

• Dependences?

Page 46: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

Particles

MPI• The imbalance

computation region where OpenMP would help for

MPI_Re

_

p pload balance is between sendrecv calls in line 1533 in domdec.c

MPI Waitall:

MPI_Senredv: 1533@do

MPI_Waitall:

Color re

Allreduce: 286@pme pp.c

cv: 427@pme_pp.c

_ @p _pp

: 126@demdec network c

omdec.c

: 126@demdec_network.c

epresents MPI caller line

Page 47: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

FFTsCo

• Imbalance computation between Alltoallvs in line [email protected]– Strange?– Better selection of # fft

procs?procs?– OpenMP?

MPI_Sendrecv: 482@pdMPI_Sendrecv: 7

MPI_Sendrecv: 344@gmx_MPI AlltoMPI_Allto

MMM

lor represents MPI call

Color represents MPI caller line

pme.c@[email protected]

_parallel_3dfft.coallv: 241@fftgrid coallv: [email protected]_Sendrecv: [email protected] Sendrecv: [email protected]_Sendrecv: [email protected]

MPI_Recv: 204@pme_pp.c

Page 48: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

Main scalability problem is load unbala

• Mixing MPI+OpenMP with dynamic

Mixing of Particles and FFTs tasks maMixing of Particles and FFTs tasks ma

• Two different granularities

Even if was initially considered to havepoor communication efficiencypoor communication efficiency

ance

c run time load balance

akes things difficultakes things difficult

e a good performance, the 64 tasks have

Page 49: Performance Analisis withAnalisis with CEPBAA-Tools · •2DT blD Tables • Accumulate over the different values of a co • P fil C t i l ti li ( f tiProfile: Categorical timeline

Performance analysis: fun and colorful

Simple tools to analyze complex proble

Scalability requires tools to drive/suppo

Importance of the details to understand

Benefits from the tools interoperability

Tools freely available ([email protected]) B

l

ems

ort the analyst where to look at

d

Best effort support and training