31
TM Program Development Program Development Tools Tools SGI Performance Tools

TM Program Development Tools SGI Performance Tools

  • View
    219

  • Download
    4

Embed Size (px)

Citation preview

TM

Program Development ToolsProgram Development Tools

SGI Performance Tools

TM

Timing MechanismsTiming Mechanisms

•External timers provided by IRIXIRIX– timex– time– ssusage

• Internal Timers

TM

Timing MechanismsTiming Mechanisms

TM

The Event CountersThe Event Counters• The MIPS R1x000 cpu has on-chip event counters that are

used to monitor processor activity– there are 2 counters in R10000 processor– each counter has a selector for the events to be counted:

• events 0 - 15• events 16 - 31

– see man r10k_counters(5) for a complete description

• the R1x000 software counters are available through:– the IRIX perfex command– the profiling environment– a procedural interface (Fortran and C)

TM

The The perfexperfex Command Command perfex [-e e0][-e e1][-a][-mp][-y][-s][-x] cmd [args] where

– e0 and e1 are the event identifiers; should belong to non-overlapping groups (i.e. 0-15 one and 16-31 another)

– -a multiplex all events– -mp report counts for each thread, as well as totals– -y give timing estimates for each count and general statistics– -s start/stop counting on the USR1/USR2 signal– -x count events on exception as well

examples: perfex -e 2 -e 25 ./swp ………………. output from swp execution Issued loads ……………………………………………………. 13847619 Primary data cache misses …………………. 2310349

perfex -e 3 -e 26 ./swp ………………. output from swp execution Issued stores ……………………………………………………. 9176152 Secondary data cache misses ………………. 9029

Interpretation of event counters requires a model!

TM

perfex -tperfex -t: the Model Cost Table: the Model Cost Table Costs for IP27 processor MIPS R12000 CPU Cost Typical Minimum Maximum Event Counter Name ============================================================================================ 0 Cycles............................................. 1.00 clks 1.00 clks 1.00 clks 1 Decoded instructions............................... 0.00 clks 0.00 clks 1.00 clks 2 Decoded loads...................................... 1.00 clks 1.00 clks 1.00 clks 3 Decoded stores..................................... 1.00 clks 1.00 clks 1.00 clks 4 Miss handling table occupancy...................... 0.00 clks 0.00 clks 0.00 clks 5 Failed store conditionals.......................... 1.00 clks 1.00 clks 1.00 clks 6 Resolved conditional branches...................... 1.00 clks 1.00 clks 1.00 clks 7 Quadwords written back from scache........... 8.49 clks 5.90 clks 8.77 clks 8 Correctable scache data array ECC errors........... 0.00 clks 0.00 clks 1.00 clks 9 Primary instruction cache misses..............17.01 clks 4.34 clks 17.01 clks 10 Secondary instruction cache misses............99.89 clks 63.03 clks 99.89 clks 11 Instr misprediction from scache way prediction.. .. 0.00 clks 0.00 clks 1.00 clks 12 External interventions............................. 0.00 clks 0.00 clks 0.00 clks 13 External invalidations............................. 0.00 clks 0.00 clks 0.00 clks 14 ALU/FPU progress cycles............................ 1.00 clks 1.00 clks 1.00 clks 15 Graduated instructions............................. 0.00 clks 0.00 clks 1.00 clks 16 Executed prefetch instructions..................... 0.00 clks 0.00 clks 0.00 clks 17 Prefetch primary data cache misses................. 0.00 clks 0.00 clks 1.00 clks 18 Graduated loads.................................... 1.00 clks 1.00 clks 1.00 clks 19 Graduated stores................................... 1.00 clks 1.00 clks 1.00 clks 20 Graduated store conditionals....................... 1.00 clks 1.00 clks 1.00 clks 21 Graduated floating point instructions........ 1.00 clks 0.50 clks 52.00 clks 22 Quadwords written back from primary data cache..... 3.98 clks 3.14 clks 3.98 clks 23 TLB misses........................................77.78 clks 77.78 clks 77.78 clks 24 Mispredicted branches.............................. 7.28 clks 6.00 clks 8.81 clks 25 Primary data cache misses...................... 8.50 clks 2.17 clks 8.50 clks 26 Secondary data cache misses....................99.89 clks 63.03 clks 99.89 clks 27 Data misprediction from scache way prediction table 0.00 clks 0.00 clks 1.00 clks 28 State of intervention hits in scache.........……………. 0.00 clks 0.00 clks 0.00 clks 29 State of invalidation hits in scache............... 0.00 clks 0.00 clks 0.00 clks 30 Store/prefetch exclusive to clean block in scache.. 1.00 clks 1.00 clks 1.00 clks 31 Store/prefetch exclusive to shared block in scache. 1.00 clks 1.00 clks 1.00 clks

TM

perfex -a -yperfex -a -y: Multiplexing Events: Multiplexing Events Based on 300 MHz IP27 MIPS R12000 CPU Typical Minimum Maximum Event Counter Name Counter Value Time (sec) Time (sec) Time (sec) ========================================================================================================= 0 Cycles........................................... 5019528970688 16731.763236 16731.763236 16731.763236 16 Executed prefetch instructions................………... 8606259264 0.000000 0.000000 0.000000 4 Miss handling table occupancy.................….. 5443332412048 18144.441373 18144.441373 18144.441373 18 Graduated loads...............................…... 938555860624 3128.519535 3128.519535 3128.519535 2 Decoded loads..................................... 935411271824 3118.037573 3118.037573 3118.037573 25 Primary data cache misses......................... 83185778832 2356.930400 601.710467 2356.930400 21 Graduated floating point instructions............. 519085469744 1730.284899 865.142450 89974.814756 3 Decoded stores.................................... 413277338336 1377.591128 1377.591128 1377.591128 19 Graduated stores.................................. 411214746144 1370.715820 1370.715820 1370.715820 22 Quadwords written back from primary data cache.... 63439898464 841.635986 664.004271 841.635986 6 Resolved conditional branches..................... 231879145424 772.930485 772.930485 772.930485 26 Secondary data cache misses....................... 1951064336 649.639388 409.918617 649.639388 7 Quadwords written back from scache................ 13451750864 380.684549 264.551100 393.239517 23 TLB misses........................................ 1238242032 321.034884 321.034884 321.034884 9 Primary instruction cache misses.................. 1719354640 97.487408 24.873330 97.487408 24 Mispredicted branches............................. 3744555216 90.867873 74.891104 109.965105 10 Secondary instruction cache misses................ 26286992 8.752692 5.522897 8.752692 31 Store/prefetch exclusive to shared block in scache 1435428608 4.784762 4.784762 4.784762 20 Graduated store conditionals...................... 639716576 2.132389 2.132389 2.132389 5 Failed store conditionals......................... 54079168 0.180264 0.180264 0.180264 30 Store/prefetch exclusive to clean block in scache. 20579264 0.068598 0.068598 0.068598 1 Decoded instructions............................. 3312748123920 0.000000 0.000000 11042.493746 8 Correctable scache data array ECC errors......… 0 0.000000 0.000000 0.000000 11 Instruction mispred from scache way prediction table..166448576 0.000000 0.000000 0.554829 12 External interventions............................ 1584014720 0.000000 0.000000 0.000000 13 External invalidations............................ 3304023728 0.000000 0.000000 0.000000 14 ALU/FPU progress cycles........................... 0 0.000000 0.000000 0.000000 15 Graduated instructions........................... 3268746182224 0.000000 0.000000 10895.820607

TM

perfex -yperfex -y Statistics (R12000) Statistics (R12000) Statistics

========================================================================================= Graduated instructions/cycle................................................ 0.651206 Graduated floating point instructions/cycle................................. 0.103413 Graduated loads & stores/cycle.............................................. 0.268904 Graduated loads & stores/floating point instruction......................... 2.600286 Mispredicted branches/Resolved conditional branches......................... 0.016149 Graduated loads /Decoded loads ( and prefetches )........................... 0.994214 Graduated stores/Decoded stores............................................. 0.995009 Data mispredict/Data scache hits............................................ 0.036721 Instruction mispredict/Instruction scache hits.............................. 0.098312 L1 Cache Line Reuse......................................................... 15.225978 L2 Cache Line Reuse......................................................... 41.636102 L1 Data Cache Hit Rate...................................................... 0.938370 L2 Data Cache Hit Rate...................................................... 0.976546 Time accessing memory/Total time............................................ 0.467783 L1--L2 bandwidth used (MB/s, average per process)........................... 219.760658 Memory bandwidth used (MB/s, average per process)........................... 27.789316 MFLOPS (average per process)................................................ 31.023955 Cache misses in flight per cycle (average).................................. 1.084431 Prefetch cache miss rate.................................................... 0.606133

TM

Explanation of Statistics/1Explanation of Statistics/1• L1 cache line reuse: should be large >>1

– it is the number of times, on average that a primary data cache line is used after it has been moved into the cache

• L2 cache line reuse: should be large >>1– it is the average number of times the cache line was used after it moved into

the cache

• L1 data cache hit rate: should be ~1– the fraction of data accesses that finds data already resident in L1

• L2 data cache hit rate: should be ~1– the fraction of data accesses that finds data already resident in L2

loads + stores - missesmisses

L1 cache line reuse =

L1 cache misses - L2 cache missesL2 cache misses

L2 cache line reuse =

L1 cache missesloads + stores

L1 data cache hit rate = 1-

L2 cache missesL1 cache misses

L2 data cache hit rate = 1-

TM

Explanation of Statistics/2Explanation of Statistics/2• Instruction or data mispredictions/hits: should be small (~0)

• L1-L2 bandwidth used ([MB/s] average per process)– total number of bytes moved between L1 and L2 divided by total run time

• Memory bandwidth used ([MB/s] average per process)– total number of bytes moved between L2 an memory divided by total time

• Performance Mflop/s: larger is better– it is computed from the #fp instructions divided by total run time– the madd instruction counts as 1, so this statistic usually greatly

underestimates the performance in Mflop/s

• Time accessing memory/Total time: should be small (~0)

#bytes moved = (L1 cache miss * L1 cache line size + Quadwords written back *16 bytes)

#bytes moved = (L2 cache miss * L2 cache line size + Quadwords written back *16 bytes)

T(loads+stores+L1 miss+L2 miss+TLB miss)total run time

TM

Explanation of Statistics/3Explanation of Statistics/3 The R12000 processor gives additional information:• Miss handling table occupancy:

– average number of entries in MHT. A higher value indicates more outstanding cache misses. An interesting observation is to get

– it should be >1, since if it is close to 1 there might be dependency in the data that prevents the processor to issue multiple outstanding cache misses

– it could be <1 for code that fits into the cache– high values for Mh while the memory bandwidth is low would indicate a lot

of L1 misses into the same L2 cache line

• Prefetch miss rate:– rate of prefetch instructions missing in the first level cache

– should be close to 1: Every prefetch to be effective should miss in the cache.

MHTcyclesMh= = 0 .. 5

L1 miss prefetch(17)#prefetch instructions(16)Rp= = 0 .. 1

TM

R1x000 Important Events R1x000 Important Events

R10000 R12000Event Description Event Desctiption0 Cycles

4 Miss handling table occupancy

7 Quadwords written back from L29 L1 instruction cache misses10 L2 instruction cache misses12 External interventions13 External invalidations14 ALU/FPU forward progress cycles15 Graduated instructions

16 Prefetch instructions18 Graduated loads19 Graduated stores21 Graduated fp instructions22 Quadwords written back from L123 TLB misses25 L1 data cache misses26 L2 data cache misses28 External intervention hits in L229 External invalidation hits in L230 Store/prefetch exclusive to clean block in L231 Store/prefetch exclusive to shared block in L2

TM

Cache Coherency CountersCache Coherency Counters• The directory cache coherency protocol sends signals to nodes that

requested data from memory in the past

• Invalidations are transactions that convert cache lines from the “shared” to invalid state

• Interventions are transactions that convert cache line from the exclusive to “shared” or “invalid” (depending on whether the requestor asked for “shared” or “exclusive” access)

• Counters 12 and 13 tell how many interventions and invalidations have been received by the processor; counters 28 and 29 tell how many of them actually hit in the cache

• Invalidations and/or interventions are sent to a node because the cache line owner thinks that node has the cache line. However, if that processor has dropped the cache line earlier, the signal will not find the data in the cache and counters 12 or 13 will be updated, while 28 or 29 will not.

TM

Explanation: Useful CountersExplanation: Useful Counters• Cycles are proportional to time spend on the cpu

– spin times are added, but wait times are not

• Graduated instructions divided by cycles will give the IPC measure for the program. The R1x000 processor can process up to 4 IPC, a value of IPC >1.5 should be considered good.

• ALU/FPU progress cycles are the actual computation on the processor

• TLB misses is counting how many times the virtual translation entry has not been found in the Translation Look-aside Table. If that is a large contribution, larger size pages might be used in the application

TMEvent CountersEvent CountersProcedural InterfaceProcedural Interface C synopsis

Fortran synopsis

• The event specifiers e0 and e1 can also be set (or overruled) at run time by the environment variables – T5_EVENT0 and T5_EVENT1

• link with -lperfex

• the counter values c0 and c1 are 64 bit integers

int start_counters(int e0, int e1);int read_counters( int e0, long long *c0,

int e1, long long *c1);int print_counters(int e0, long long c0,

int e1, long long c1);

INTEGER*8 C0, C1INTEGER*4 E0, E1INTEGER*4 FUNCTION START_COUNTERS(E0, E1)INTEGER*4 FUNCTION READ_COUNTERS(E0, C0, E1, C1)INTEGER*4 FUNCTION PRINT_COUNTERS(E0, C0, E1, C1)

TM

perfexperfex Measurements of matmul Measurements of matmul

Statistics ========residence ======== L1 L2 MEM

Graduated instructions/cycle................ 1.85 1.82 1.71Graduated floating point instructions/cycle.… 0.80 0.94 0.88Graduated loads & stores/cycle............… 0.64 0.54 0.50Graduated loads & stores/floating point instr 0.81 0.57 0.57L1 Cache Line Reuse......................... 205758.87 4.91 4.1L2 Cache Line Reuse.......................... 64.92 13343.20 202.52

L1 Data Cache Hit Rate....................... 0.99 0.83 0.80L2 Data Cache Hit Rate....................... 0.98 0.99 0.99Time accessing memory/Total time............. 0.64 1.32 1.4L1--L2 bandwidth used (MB/s, average per proc) 0.05 907.08 977.7Memory bandwidth used (MB/s, average per proc) 0.05 0.32 30.4MFLOPS (average per process)................. 239.15 281.70 266.0Cache misses in flight per cycle (average)... 0.00 1.05 1.34

(N=20) (N=500) (N=2000)

}

The C matrix multiply program has been compiled with

-mips4 -O3 -OPT:alias=restrict -mp To measure MEM resident performance, use large pages: setenv PAGESIZE_STACK 1024Ksetenv PAGESIZE_DATA 1024K

Enabled when-mp is used

TM

The SpeedShop ProfilerThe SpeedShop Profiler• No recompilation or relinking is necessary

• Multiple performance experiments and views

• The following performance experiments are supported:– Program Counter (PC) sampling -pcsamp

• to get exclusive time per subroutine and per source line• no call stack information

– User Time -usertime• to get exclusive & inclusive times per subroutine (call stack)• no source line information

– Hardware counter profiling -”event”_hwc– Communication overhead of MPI -mpi– Floating point exception tracing -fpe– Tracing of I/O system call -io– malloc/free tracing -heap– basic-block counting experiment -ideal

see man speedshop for explanation of the experiments

• usage:– ssrun -exp [options] program [args]– prof “ssrun output file”

TM

SpeedShop ExperimentsSpeedShop Experiments

histogram

main

sub1

sub2

sub3

subn

instructions

0x000

0x100

0x1f0

0xcf00xdf00xe00

0xf30

PC Sampling Experiment:Record current instruction addressin a histogram (Overhead <1%)16 bit wide bins (e.g. -pcsamp)32 bit wide bins (e.g. -pcsampx)

Usertime Sampling Experiment:Record all addresses on call stack (Overhead <5%)

Program Counter (PC)

Time interrupt

Event interrupt

10 ms normal experiments (e.g. -pcsamp) 1 ms “fast” experiments (e.g. -fpcsamp)

Event dependent thresholds“fast” experiments exist

0x000 0xf30address

Usage counts

16 bit numberor

32 bit number

TM

ssrunssrun: Experiment Recording: Experiment Recording ssrun writes out performance measurements into a binary file which is named after the program:

ssrun -expname options progname prog-arg here the output will be binary files with the name progname.expname.ID#pid where ID is a letter code as follows:• m the master process created by ssrun• p slave processes in parallel execution (call to sproc())• f slave processes created by a call to fork()• e process created by a call to exec()• s a system call process

examples: > mpirun -np 2 ssrun -mpi cgmd

-rw-r--r-- 1 zacharov user 237672 Sep 10 13:50 cgmd.mpi.f318839 -rw-r--r-- 1 zacharov user 4733496 Sep 10 13:50 cgmd.mpi.f318854 -rw-r--r-- 1 zacharov user 11896 Sep 10 13:50 cgmd.mpi.m318787 > ssrun -usertime cgmd -rw-r--r-- 1 zacharov user 1613562 Sep 10 14:28 cgmd.usertime.m320140 -rw-r--r-- 1 zacharov user 1601776 Sep 10 14:28 cgmd.usertime.p320140

TM

profprof: Performance Interpretation: Performance Interpretation• prof is the command-line interpreter for ssrun experiments• Feedback file generation for compiler and linker• Use cvperf(1)(not prof) for heap and MPI trace analysis

prof options:• -heavy report the performance per source code line;

order in descending performance measure• -lines report the performance per source code line;

order in program sequence• -usage summary of the system resources used by the

program• -gprof show inclusive timing of callers and callees of each

function• -q n truncate after the first n routines or lines• -q n%truncate after the routine that takes >n% of total time• -q ncum% truncate after n% of cumulative time

warning: executable should not contain strip-ed parts

TM

PC Sampling Experiment/1PC Sampling Experiment/1 % ssrun -fpcsampx ./program.exe % prof -lines program.exe.fpcsampx.m1329 ------------------------------------------------------------------------- SpeedShop profile listing generated Tue Feb 10 17:53:44 1998 prof -lines program.exe.fpcsampx.m1329 program.exe (n32): Target program fpcsampx: Experiment name pc,4,1000,0:cu: Marching orders R10000 / R10010: CPU / FPU 4: Number of CPUs 195: Clock frequency (MHz.) Experiment notes-- From file program.exe.fpcsampx.m1329: Caliper point 0 at target begin, PID 1329 /usr/users/Examples/speedshop/./program.exe Caliper point 1 at exit(0) ------------------------------------------------------------------------- Summary of statistical PC sampling data (fpcsampx)-- 26758: Total samples 26.758: Accumulated time (secs.) 1.0: Time per sample (msecs.) 4: Sample bin width (bytes) ------------------------------------------------------------------------- Function list, in descending order by time ------------------------------------------------------------------------- [index] secs % cum.% samples function (dso: file, line)

[1] 9.180 34.3% 34.3% 9180 sub1 (prog.exe: program.f, 39) [2] 7.070 26.4% 60.7% 7070 update_sum (prog.exe: program.f,62) [3] 6.182 23.1% 83.8% 6182 sub3 (prog.exe: program.f, 112) [4] 1.968 7.4% 91.2% 1968 irand_ (libftn.so: rand_.c, 62) [5] 1.317 4.9% 96.1% 1317 rand_ (libftn.so: rand_.c, 67) [6] 1.040 3.9% 100.0% 1040 ssdemo (prog.exe: program.f, 1) 0.001 0.0% 100.0% 1 **OTHER** (excluded DSOs, rld,etc.)

26.758 100.0% 100.0% 26758 TOTAL

TM

PC Sampling Experiment/2PC Sampling Experiment/2 ------------------------------------------------------------------------- Line list, in descending order by function-time and then line number ------------------------------------------------------------------------- secs % cum.% samples function (dso: file, line)

0.002 0.0 0.0 2 sub1 (program.exe: program.f, 49) 0.007 0.0 0.0 7 sub1 (program.exe: program.f, 50) 0.930 3.5 3.5 930 sub1 (program.exe: program.f, 54) 8.241 30.8 34.3 8241 sub1 (program.exe: program.f, 55)

0.858 3.2 37.5 858 update_sum (program.exe: program.f, 72) 1.263 4.7 42.2 1263 update_sum (program.exe: program.f, 73) 4.949 18.5 60.7 4949 update_sum (program.exe: program.f, 74)

0.001 0.0 60.7 1 sub3 (program.exe: program.f, 122) 1.029 3.8 64.6 1029 sub3 (program.exe: program.f, 123) 5.152 19.3 83.8 5152 sub3 (program.exe: program.f, 124)

0.785 2.9 86.8 785 irand_ (libftn.so: rand_.c, 62) 1.183 4.4 91.2 1183 irand_ (libftn.so: rand_.c, 63)

0.076 0.3 91.5 76 rand_ (libftn.so: rand_.c, 67) 1.241 4.6 96.1 1241 rand_ (libftn.so: rand_.c, 69)

0.174 0.7 96.8 174 ssdemo (program.exe: program.f, 20) 0.866 3.2 100.0 866 ssdemo (program.exe: program.f, 21)

TMssrunssrun: Performance Experiments : Performance Experiments (hardware counter)(hardware counter) The R1x000 event counters can be used for performance analysis. All experiments are described in man ssrun(1) Useful experiments:• -[f]dsc_hwc

– Counts overflows of the 2nd level cache miss counter (counter 26)– the address is recorded if counter reaches 131 counts (29 for “fast”)

• -[f]gfp_hwc– overflows of the graduated floating point instruction counter (21)– the address is recorded if counter reaches 32771 counts (6553 for

“fast” experiments)• -prof_hwctime

– profiles the counter specified by the environment variable _SPEEDSHOP_HWC_COUNTER_PROF_NUMBER using call-stack sampling

– addresses are recorded based on the overflow of the env variable _SPEEDSHOP_HWC_COUNTER_OVERFLOW

• -prof_hwc– same as prof_hwctime for PC sampling with event counter overflow

TM

Hardware Counters Experiment/1Hardware Counters Experiment/1 % ssrun -ftlb_hwc program.exe ; prof program.exe.ftlb_hwc.m1335 ------------------------------------------------------------------------- SpeedShop profile listing generated Tue Feb 10 17:54:23 1998 prof -lines program.exe.ftlb_hwc.m1335 program.exe (n32): Target program ftlb_hwc: Experiment name hwc,23,53:cu: Marching orders R10000 / R10010: CPU / FPU 4: Number of CPUs 195: Clock frequency (MHz.) Experiment notes-- From file program.exe.ftlb_hwc.m1335: Caliper point 0 at target begin, PID 1335 /usr/users/Examples/speedshop/./program.exe Caliper point 1 at exit(0) ------------------------------------------------------------------------- Summary of R10K perf. counter overflow PC sampling data (ftlb_hwc)-- 412712: Total samples TLB misses (23): Counter name (number) 53: Counter overflow value 21873736: Total counts ------------------------------------------------------------------------- Function list, in descending order by counts ------------------------------------------------------------------------- [index] counts % cum.% samples function (dso: file, line)

[1] 10224919 46.7% 46.7% 192923 sub1 (program.exe: program.f, 39) [2] 5252035 24.0% 70.8% 99095 sub3 (program.exe: program.f, 112) [3] 5111691 23.4% 94.1% 96447 update_sum (program.exe: program.f, 62) [4] 1043729 4.8% 98.9% 19693 ssdemo (program.exe: program.f, 1) [5] 241097 1.1% 100.0% 4549 rand_ (libftn.so: rand_.c, 67) [6] 53 0.0% 100.0% 1 __checktraps (libc.so.1: stubfpeexit.c, 3) [7] 53 0.0% 100.0% 1 exceed_length (libftn.so: wrtfmt.c, 1421) [8] 53 0.0% 100.0% 1 wrt_I (libftn.so: wrtfmt.c, 102) [9] 53 0.0% 100.0% 1 do_Lio_com (libftn.so: lio.c, 803) 53 0.0% 100.0% 1 **OTHER** (includes excluded DSOs, rld, etc.)

21873736 100.0% 100.0% 412712 TOTAL

TM

Hardware Counters Experiment/2Hardware Counters Experiment/2 Use the environment variables (see man speedshop) : % setenv _SPEEDSHOP_HWC_COUNTER_NUMBER 6

% setenv _SPEEDSHOP_HWC_COUNTER_OVERFLOW 100

% ssrun -prof_hwc ./program.exe

% prof -lines program.exe.prof_hwc.m1336 ------------------------------------------------------------------------- SpeedShop profile listing generated Tue Feb 10 17:54:56 1998 prof -lines program.exe.prof_hwc.m1336 program.exe (n32): Target program prof_hwc: Experiment name hwc:cu: Marching orders R10000 / R10010: CPU / FPU 4: Number of CPUs 195: Clock frequency (MHz.) Experiment notes-- From file program.exe.prof_hwc.m1336: Caliper point 0 at target begin, PID 1336 /usr/users/Examples/speedshop/./program.exe Caliper point 1 at exit(0) ------------------------------------------------------------------------- Summary of R10K perf. counter overflow PC sampling data (prof_hwc)-- 2153328: Total samples Decoded branches (6): Counter name (number) 100: Counter overflow value 215332800: Total counts etc…

TM

User Time ExperimentUser Time Experiment SpeedShop profile listing generated Tue Feb 10 17:51:59 1998 prof -gprof program.exe.usertime.m1332 program.exe (n32): Target program usertime: Experiment name ut:cu: Marching orders R10000 / R10010: CPU / FPU 4: Number of CPUs 195: Clock frequency (MHz.) Experiment notes-- From file program.exe.usertime.m1332: Caliper point 0 at target begin, PID 1332 /usr/users/ruud/Examples/speedshop/./program.exe Caliper point 1 at exit(0) ------------------------------------------------------------------------- Summary of statistical callstack sampling data (usertime)-- 301: Total Samples 0: Samples with incomplete traceback 9.030: Accumulated Time (secs.) 30.0: Sample interval (msecs.) ------------------------------------------------------------------------- Function list, in descending order by exclusive time ------------------------------------------------------------------------- [index] excl.secs excl.% cum.% incl.secs incl.% samples procedure (dso: file, line) [4] 2.850 31.6% 31.6% 5.370 59.5% 179 sub1 (program.exe: program.f, 39) [6] 2.520 27.9% 59.5% 2.520 27.9% 84 update_sum (program.exe: program.f, 62) [5] 1.980 21.9% 81.4% 3.000 33.2% 100 sub3 (program.exe: program.f, 112) [8] 0.660 7.3% 88.7% 0.660 7.3% 22 irand_ (libftn.so: rand_.c, 62) [7] 0.480 5.3% 94.0% 1.140 12.6% 38 rand_ (libftn.so: rand_.c, 67) [1] 0.390 4.3% 98.3% 9.030 100.0% 301 ssdemo (program.exe: program.f, 1) [10] 0.090 1.0% 99.3% 0.090 1.0% 3 _write (libc.so.1: write.s, 15) [16] 0.030 0.3% 99.7% 0.030 0.3% 1 s_wsle64 (libftn.so: lio.c, 132) [17] 0.030 0.3% 100.0% 0.030 0.3% 1 s_stop (libftn.so: s_stop.c, 54) [2] 0.000 0.0% 100.0% 9.030 100.0% 301 __start (program.exe: crt1text.s, 103) [3] 0.000 0.0% 100.0% 9.030 100.0% 301 main (libftn.so: main.c, 76) [9] 0.000 0.0% 100.0% 0.120 1.3% 4 sub2 (program.exe: program.f, 81) [11] 0.000 0.0% 100.0% 0.090 1.0% 3 e_wsle (libftn.so: lio.c, 205)

TM

-usertime-usertime experiment experiment -gprof-gprof ------------------------------------------------------------------------- Butterfly function list, in descending order by inclusive time ------------------------------------------------------------------------- attrib.% attrib.time incl.time caller (callsite) [index] [index] incl.% incl.time self% self-time procedure [index] attrib.% attrib.time incl.time callee (callsite) [index] ------------------------------------------------------------------------- 100.0% 9.030 9.030 main (@0x0b35c004; libftn.so: main.c, 97) [3] [1] 100.0% 9.030 4.3% 0.390 ssdemo [1] 59.5% 5.370 5.370 sub1 (@0x10000ef0; prog.exe: program.f, 27) [4] 33.2% 3.000 3.000 sub3 (@0x10000f30; program.exe: program.f,33) [5] 1.3% 0.120 1.140 rand_ (@0x10000e20; program.exe: program.f, 21) [7] 1.3% 0.120 0.120 sub2 (@0x10000f14; program.exe: program.f, 31) [9] 0.3% 0.030 0.030 sstop (@0x10000f68; program.exe: program.f, 36)[17] ------------------------------------------------------------------------- [2] 100.0% 9.030 0.0% 0.000 __start [2] 100.0% 9.030 9.030 main (@0x10000d68; program.exe: crt1text.s, 166)[3] ------------------------------------------------------------------------- 100.0% 9.030 9.030 start (@0x10000d68; program.exe: crt1text.s,166)[2] [3] 100.0% 9.030 0.0% 0.000 main [3] 100.0% 9.030 9.030 ssdemo (@0x0b35c004; libftn.so: main.c, 97) [1] ------------------------------------------------------------------------- 59.5% 5.370 9.030 ssdemo (@0x10000ef0; program.exe: program.f, 27)[1] [4] 59.5% 5.370 31.6% 2.850 sub1 [4] 27.9% 2.520 2.520 update_sum(@0x10001024;program.exe:program.f,51)[6] ------------------------------------------------------------------------- 33.2% 3.000 9.030 ssdemo (@0x10000f30; program.exe: program.f, 33)[1] [5] 33.2% 3.000 21.9% 1.980 sub3 [5] 11.3% 1.020 1.140 rand_ (@0x10001470; program.exe: program.f, 124)[7] ------------------------------------------------------------------------- 27.9% 2.520 5.370 sub1 (@0x10001024; program.exe: program.f, 51) [4] [6] 27.9% 2.520 27.9% 2.520 update_sum [6] ------------------------------------------------------------------------- 1.3% 0.120 9.030 ssdemo (@0x10000e20; program.exe: program.f, 21)[1] 11.3% 1.020 3.000 sub3 (@0x10001470; program.exe: program.f, 124) [5] [7] 12.6% 1.140 5.3% 0.480 rand_ [7] 7.3% 0.660 0.660 irand_ (@0x0b340f6c; libftn.so: rand_.c, 69) [8] ------------------------------------------------------------------------- 7.3% 0.660 1.140 rand_ (@0x0b340f6c; libftn.so: rand_.c, 69) [7] [8] 7.3% 0.660 7.3% 0.660 irand_ [8] -------------------------------------------------------------------------

TM

SpeedShop Basic ToolsSpeedShop Basic Tools• ssrun

– basic performance data generation tool– outputs binary file with the performance experiment results per program or execution

thread• prof

– human readable interpretation of the ssrun binary files• sscord/ssorder and sswsextr

– used to extract the working set information from a program to re-order the program subroutines to pack frequently executed instructions on same memory pages (reduces TLB miss for instr.)

• ssusage: ssusage program prog-args– similar to time, but gives resident size and I/O characteristics

• squeeze and thrash– change amount of memory available to the user program

• ssdump– dump the binary performance data files in ASCII format

• fbdump– writes compiler feedback files from prof

• ssaggregate– combines multiple SpeedShop binary performance data files into one

TM

SpeedShop Performance ViewsSpeedShop Performance Views

• Function list, program graph, butterfly

• Source code with performance annotation

• Disassembly and basic blocks

• Architectural information

• Usage data (resource consumption on the machine)

• Heap analysis and memory leaks

• Working sets

• MP overhead view (Pthreads, MPI, OpenMP)

TM

WorkShop cvperf ViewsWorkShop cvperf Views

TM

SummarySummary There is a large set of performance measurement tools:

• to collect the running status of the machine

• to collect the resource usage of individual programs

There is a large set of performance analysis tools:• perfex, based on the processor event counters, valid for

evaluation of complete programs or parts of programs

• SpeedShop tools for command-line evaluation of the performance data