Upload
doreen-foster
View
220
Download
1
Tags:
Embed Size (px)
Citation preview
Profiling Tools on the NERSC Crays and IBM/SP
NERSC User Services
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
2
Outline
• Profiling Tools on NERSC platforms
– Cray PVP (killeen, seymour)
– Cray T3E (mcurie)
– IBM/SP (gseaborg)
• UNIX profiling/performance analysis tools
• References
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
3
Why Profile?
• Characterise application :
– Is code cpu bound?
– Is code I/O bound?
– Is code memory bound?
– Analyse communication patterns - D.M. codes
• Focus optimisation effort ... and ultimately..
• Improve performance and resource utilisation
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
4
Cray PVP/T3E - Application Characterization
• Job accounting (ja) • ja
• ./a.out
• ja -st -n a.out - see next slide for sample output
• Look out for :• Maximum Memory Used > available memory
• Total I/O wait time (locked+unlocked) > 50% User CPU time
• Multitasking breakdown for parallel codes
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
5
Job accounting : summary reportElapsed Time : 8 Seconds User CPU Time : 35.5939 Seconds Multitasking/ Multistreaming Breakdown (Concurrent CPUs * Connect seconds = CPU seconds)
1 * 0.0100 = 0.0100 2 * 0.0100 = 0.0200 3 * 0.0600 = 0.1800 4 * 8.8500 = 35.4000
(Avg.) (total) (total) 3.99 * 8.9300 = 35.6100
System CPU Time : 0.1226 Seconds I/O Wait Time (Locked) : 0.0000 I/O Wait Time (Unlocked) : 0.0000CPU Time Memory Integral : 5.3854 Mword-seconds Data Transferred : 0.0001 MWords Maximum memory used : 0.4746 MWords
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
6
HPM - Hardware Performance HPM - Hardware Performance MonitorMonitor
• Helps locate CPU related code bottlenecks• reports use of vector registers, instruction buffers,
memory ports
• hpm {options} ./a.out {prog_arguments}• options = -g2 -> memory access information
• options = -g3 -> vector register information
• Look for :• Ratio of Floating Ops/CPU second to CPU mem.
references per sec should reflect the FpOps in the code
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
7
Sample hpm output : (hpm -g0 ./a.out)Million inst/sec (MIPS) : 7.67 Instructions : 274017290Avg. clock periods/inst : 26.06% CP holding issue : 94.02 CP holding issue : 6714667737Inst.buffer fetches/sec : 0.04M Inst.buf. fetches: 1420802Floating adds/sec : 15.40M F.P. adds : 550002417Floating multiplies/sec : 24.36M F.P. multiplies : 870004996Floating reciprocal/sec : 0.28M F.P. reciprocals : 10000042Cache hits/sec : 0.00M Cache hits : 45893CPU mem. references/sec : 34.64M CPU references : 1236978495Floating ops/CPU second: 40.5M
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
8
Cray PVP : CPU Bound Codes: prof/profview
• Instruments code to provide % cpu time in function calls
• f90 -lprof prog.f90
• ./a.out -> generates prof.data
• prof -st ./a.out > prof.report
• Chart (over) indicates relative distribution of CPU execution time by function call– prof -x a.out > pgm.prof
– profview pgm.prof
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
9
Profview - Sample Output
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
10
I/O and Memory Bound Codes : procstat/procview
• procstat -m -i -R a.raw a.out
• procview a.raw
– I/O Analysis :
• Reports, Files -> All User Files (Long Report)
• Bytes Processed or I/O Wait Time
– Memory Analysis :
• Reports -> Processes -> Maximum Memory Used (Long Format)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
11
I/O Bound Codes : procview
• procview indicates which files consume most real time for I/O processing
Memory Bound Codes : procview– “High” (> 10% Elapsed
Time) Time to complete Memory requests may indicate memory bound code
– Use Graphs option to produce plot of Memory use over elapsed time of application
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
12
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
13
ATExpert - Autotasking ATExpert - Autotasking PredictionPrediction
• Analysis of source code to predict autotasking performance on dedicated Cray PVP
• f90 -eX -O3 -r4 -o {prog_name} prog.f90– ./a.out– atexpert -> shows predicted speed-up
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
14
ATExpert Sample outputATExpert Sample output
Indicates predicted speed-up of 4.3 on dedicated 8 processor PVP when source code is autotasked
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
15
Also available on Cray PVP Also available on Cray PVP • Flowtrace/flowview
• times (using Operating System timers) subroutines and functions during program execution
• jumptrace/jumpview• provides exact timing in function/subroutine by
analysis of machine instructions in program
• perftrace/perfview• times subroutines/functions based on statistics
gathered from HPM tool
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
16
Cray T3E - ApprenticeCray T3E - Apprentice• Locate performance problems /inefficiencies
• MPI and shared memory performance, load balance and communication, memory use
• Provides hardware performance information and tuning recommendations (Displays -> Observations)
• Compile/link• f90 -o {prog} -eA {prog_name.f90} -lapp
• cc -o {prog} -happrentice {prog_name.c} -lapp
• Run code to generate app.rif
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
17
Output from :
apprentice app.rif
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
18
Cray T3E - PATCray T3E - PAT
• Generates profile of CPU time in functions; load balance across PEs; h/w counter info.
• Compile and Link with PAT library• f90 -o exe -lpat {source.f} pat.cld
• Run program as normal• mpprun -n {procs} {exe} -> generate exe.pif
• pat executable exe.pif
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
19
Profile based on relative CPU time in function calls
Load Balance Histogram for routine “COLL”
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
20
Cray T3E - ACTS/TAUCray T3E - ACTS/TAU • Performance analysis of distributed/shared
memory applications (C++ in particular)• module load tau
• instrument programs with TAU macros
• add $(TAU_DEFS), $(TAULIBS) to compile/link
• run application; view tracefile with pprof, VAMPIR
• Reference• http://acts.nersc.gov/tau
• http://hpcf.nersc.gov/training/classes/Teleconf/1999july/Wu
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
21
Cray T3E - VampirCray T3E - Vampir • Analysis of message passing characteristics -
generates display of MPI activity over instrumented time period (e.g. sender, receiver, message size, elapsed time)
• module load VAMPIR; module load vampirtrace
• Facility to instrument with VAMPIRtrace calls
• Generate trace file using TAU or VAMPIRtrace
• Reference :• http://hpcf.nersc.gov/software/tools/vampir.html
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
22
IBM/SP - XprofilerIBM/SP - Xprofiler• Graphical interface for gprof profiles of
parallel applications – Compile and link code with “-g -pg”– poe ./a.out -procs {n}
• generates gmon.out.{n} file for each process
• may introduce significant (upto factor of 2) overhead
– (In $TMPDIR) xprofiler ./a.out gmon.out.*
• Report menu provides (gprof) text profile
• Source statement profiling shown
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
23
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
24
Statement level profile available by clicking on relevant function graphical output - use Show Source Code option
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
25
IBM/SP - Visualization Tool (VT)IBM/SP - Visualization Tool (VT)
• Message passing trace visualization
• Realtime system activity monitor (limited)
• MPI load balance overview : • poe ./a.out -procs {n} -tlevel=3
• copy a.out.trc to $TMPDIR
• (In $TMPDIR) Invoke vt
• In trace visualization mode, “Play” a.out.trc
• see next slide for sample of Interprocessor Communication during program execution
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
26
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
27
IBM/SP : system_statsIBM/SP : system_stats• IBM Internal Tool
• module load sptools
• instrument code with system_stats() call
• Link with $(SPTOOLS), run code as normal
• Sample output Summary of the utilization of system resources:node hostname wall(s) user(s) sys(s) size(KB) pswitches 0 gs01015 16.80 13.18 0.04 2748 2138 1 gs01015 16.80 16.07 0.04 2744 1868 2 gs01003 16.80 16.62 0.04 2740 1870 3 gs01003 16.80 16.56 0.03 2732 1841
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
28
IBM/SP - trace-mpi IBM/SP - trace-mpi • IBM Internal tool - Quantitative information
on MPI calls– module load USG ; module load trace-mpi– Fortran - add $(TRACE_MPIF) to build– C - add $(TRACE_MPI) to build– poe ./a.out -procs {n} - generates mpi.trace_file for each
process (executable must call MPI_Finalize)– summary mpi.trace_file.{n} (see over)
• Useful check for load balance :– grep “Total Communication” mpi.trace.file.*
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
29
MPI message-passing summary for mpi.trace_file.3MPI Function #calls Avg Bytes Time (sec)-------------------------------------------------------------MPI_Allreduce: 9355 8.0 3.596MPI_Barrier: 3 0.0 0.017MPI_Bcast: 66 5.8 0.013MPI_Scatter: 31 1008.0 0.088MPI_Comm_rank: 1 0.0 0.000MPI_Comm_size: 1 0.0 0.000MPI_Isend: 43023 2003.7 0.893MPI_Recv: 43023 2003.7 7.481MPI_Wait: 43023 2003.7 3.739Total Communication Information: WALL = 15.8277, CPU = 15.53, MBYTES = 258.72The total amount of wall time = 26.229613
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
30
Upcoming on the SPUpcoming on the SP• ACTS/TAU (C/C++)
• currently being ported to the IBM/SP
• VAMPIR• has been ordered, awaiting delivery
• Performance Monitor Toolkit (HPM)• should be available with Phase II system
(requires AIX 4.3.4)
• Also, see Performance API project:– http://icl.cs.utk.edu/projects/papi
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
31
General/UNIX Profiling ToolsGeneral/UNIX Profiling Tools• Command line profilers and system analysis
• prof/gprof (enabled for MPI on IBM/SP)
• csh time command : time ./a.out
• vmstat -> look for high paging over extended time period (application may require more memory)
• Fortran/C function timers • getrusage
• rtc, irtc
• etime, dtime, mclock
• MPI_Wtime
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
32
Reference MaterialReference Material• NERSC web pages
• http://hpcf.nersc.gov/software/tools
• Cray PVP/Cray T3E • http://www.cray.com/swpubs
– Optimizing Code on Cray PVP Systems
– Cray T3E C, Fortran Optimization Guides
• IBM/SP• LLNL Workshop on Performance Tools