Upload
shaeleigh-aguirre
View
71
Download
2
Tags:
Embed Size (px)
DESCRIPTION
PAPI The Performance Application Programming Interface. Kevin London [email protected] Nathan Garner [email protected]. Purpose. - PowerPoint PPT Presentation
Citation preview
PAPIThe Performance Application
Programming Interface
Kevin London Kevin London [email protected]
Nathan Garner Nathan Garner [email protected]
2
Purpose
The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors.
3
Motivation
• To leverage existing and future performance tool development
• To increase application and system performance
• To characterize application and system workload
• To stimulate run-time optimization research
4
Goals
• Provide a solid foundation for cross platform performance analysis tools.
• Loose standardization between vendors, academics and users.
• Provide a number of implementations for HPC architectures.
• Well documented, easy to use.
5
Why PAPI is needed
• No common performance tools except prof and gprof.
• Most commercial tools are based on time.
• HPC has memory and floating point intensive workloads which require good scheduling. (pipelining)
6
Implementation
• Support native events and 103 “preset” events, which are commonly available metrics, some are derived.
• Query to see if a preset exists
• Fully programmable, thread safe, low level interface directed towards the tool developer and the sophisticated user
• The EventSet is the underlying abstraction
• Hardware events are used in conjunction with one another to provide meaningful information.
7
PAPI PresetsTest case 8: Available events and hardware information.-------------------------------------------------------------------------Vendor string and code : GenuineIntel (-1)Model string and code : Celeron (Mendocino) (6)CPU revision : 10.000000CPU Megahertz : 366.504944-------------------------------------------------------------------------Name Code Avail Deriv Description (Note)PAPI_L1_DCM 0x80000000Yes No Level 1 data cache missesPAPI_L1_ICM 0x80000001Yes No Level 1 instruction cache missesPAPI_L2_DCM 0x80000002No No Level 2 data cache missesPAPI_L2_ICM 0x80000003No No Level 2 instruction cache missesPAPI_L3_DCM 0x80000004No No Level 3 data cache missesPAPI_L3_ICM 0x80000005No No Level 3 instruction cache missesPAPI_L1_TCM 0x80000006Yes Yes Level 1 cache misses PAPI_L2_TCM 0x80000007Yes No Level 2 cache misses PAPI_L3_TCM 0x80000008No No Level 3 cache misses PAPI_CA_SNP 0x80000009No No Requests for a snoop PAPI_CA_SHR 0x8000000aNo No Requests for shared cache linePAPI_CA_CLN 0x8000000bNo No Requests for clean cache linePAPI_CA_INV 0x8000000cNo No Requests for cache line inv....
8
PAPI High Level API
• PAPI high level is meant for application programmers wanting coarse-grained measurements.• Not tuned for efficiency
• Calls the lower level API.
• Not thread safe. (may change)
• Only allows PAPI Presets. (may change)
9
PAPI High Level Functions
PAPI_num_counters()
PAPI_start_counters()
PAPI_stop_counters()
PAPI_read_counters()
10
Implementation
PAPI contains functions to:
• Obtain accurate time.
• Obtain information about the executable and the hardware.
• Register callbacks on counter overflow of a user threshold.
• SRV4 compatible profil() call that uses hardware counters,
11
Implementation
12
49 PAPI FunctionsPAPI_accumPAPI_add_eventPAPI_add_eventsPAPI_add_peventPAPI_cleanup_eventsetPAPI_create_eventsetPAPI_create_eventset_rPAPI_destroy_eventsetPAPI_get_executable_infoPAPI_get_hardware_infoPAPI_get_optPAPI_get_overflow_addressPAPI_get_real_cycPAPI_get_real_usecPAPI_get_virt_cycPAPI_get_virt_usecPAPI_library_initPAPI_thread_initPAPI_list_eventsPAPI_lockPAPI_overflowPAPI_perrorPAPI_profilPAPI_query_all_events_verbosePAPI_query_eventPAPI_query_event_verbose
PAPI_get_optPAPI_get_overflow_addressPAPI_get_real_cycPAPI_get_real_usecPAPI_get_virt_cycPAPI_get_virt_usecPAPI_library_initPAPI_thread_initPAPI_list_eventsPAPI_lockPAPI_num_countersPAPI_overflowPAPI_perrorPAPI_profilPAPI_query_all_events_verbosePAPI_query_eventPAPI_query_event_verbosePAPI_readPAPI_read_countersPAPI_rem_eventPAPI_rem_eventsPAPI_resetPAPI_restorePAPI_savePAPI_set_debug
PAPI_set_domainPAPI_set_granularityPAPI_set_optPAPI_shutdownPAPI_startPAPI_start_countersPAPI_statePAPI_statePAPI_stopPAPI_stop_countersPAPI_unlockPAPI_write
13
#include "fpapi.h"
program fmatrixlowpapi ** USER DECLERATIONS **
call PAPIf_library_init( check ) call PAPIf_thread_init( handle, handle, check ) call PAPIf_num_counters( numevents ) print *, 'number of hardware counters supported: ', numevents call PAPIf_add_event(EventSet,PAPI_FLOPS,check) call PAPIf_add_event(EventSet,PAPI_L1_TCM,check) call PAPIf_add_event(EventSet,PAPI_L2_TCM,check) call PAPIf_get_hardware_info( ncpu, nnodes, totalcpus, vendor, . vstring, model, mstring, revision, mhz ) print *, 'A', totalcpus, ' CPU ', mstring, ' at', mhz, 'Mhz.' print *, ncpu, nnodes, totalcpus, vendor, vstring, model, . mstring, revision, mhz call PAPIf_get_real_usec( starttime ) call PAPIf_start( EventSet, check ) ** USER CODE **
14
call PAPIf_stop(EventSet,values,check) call PAPIf_get_real_usec( stoptime ) finaltime = (stoptime/1000000.0) - (starttime/1000000.0)
print *, 'Time: ', finaltime print *, 'FLOPS: ', values(1) print *, 'Total Level 1 Data cache misses: ', values(2) print *, 'Total Level 2 Data cache misses: ', values(3) return end
15
number of hardware counters supported: 32 A 2 CPU R12000 at 270.0000 Mhz.MIPS 30 R12000 2.300000 270.0000 Time: 1.547424316406250 FLOPS: 4258753 Total Level 1 Data cache misses: 1539918 Total Level 2 Data cache misses: 6936
16
Threads and PAPI
• PAPI must be able to support both explicit (library) and implicit (compiler) threading models.
• However, this can only happen if the threads are ‘bound’.
• A ‘bound’ thread is one that has a scheduling entity known and handled by the OS kernel.
17
The 1.0 Release
• Platforms• Linux/x86
• Solaris/Ultra
• AIX/Power
• Tru64/Alpha
• IRIX/MIPS
• Fortran wrappers
• Thread support
• Remote CVS access
• Updated Web Site
• Documentation
• Tool integration
18
UTK Tools
• Perfometer• Real time trace based visualization of metrics at the
subroutine level. (Java/Swing)
• Profometer (planned)• Real time sample based visualization at the line level.
(Java/Swing)
• Hwprof (planned)• Back end to generate performance data to be fed into
the above tools. Possible integration with DynInst.
19
• Platform independent visualization of PAPI metrics
• Graphical display may run remotely, freeing the compute node of the drawing overhead
• Flexible interface (internal drawing classes are reused for other tools)
• Quick interpretation of complex results
• Color coding to highlight selected procedures
Perfometer Features
20
Perfometer Screenshot
21
Perfometer Usage
• Application is instrumented with a single call to perfometer()
• Sections of code that are of interest can be distinguished in the graph with specific colors using a call to mark_perfometer(COLOR)
• #include "papicolorcodes.h"
• call perfometer
• call mark_perfometer(RED)
22
Perfometer Future Development
• Allow runtime selection of multiple PAPI metrics for simultaneous display
• Integration with Dyninst to eliminate need for recompiling user codes
• Dump trace data to file for post-mortem study
• Additional graph display types
23
Profometer Features
• Visual representation of the quantity of a given metric spent in a particular code segment
• Color coding of user selected code segments
• Zoom in and out to emphasize sections of interest
• Reuse of the Perfometer engine
24
Profometer Screenshot
Profometer – Histogram of a given metric per code segment
25
Profometer Future Development
• Run time modification of metric being monitored
• Hooks into debugging interface to allow GDB style interaction with source code
26
UTK hwprof Screenshot rusage child rusage childrusage child rusage child ============= ===== ============= ================== ===== ============= ===== user time sec 1.000 num of swap operations 0user time sec 1.000 num of swap operations 0 sys time sec 0.010 block input operations 0sys time sec 0.010 block input operations 0 real time sec 1.010 block output operations 0real time sec 1.010 block output operations 0 maximum resident set size 0 messages sent 0maximum resident set size 0 messages sent 0 (ru_ixrss) currently null 0 messages received 0(ru_ixrss) currently null 0 messages received 0 integral resident set size 0 signals received 0integral resident set size 0 signals received 0 (ru_ixrss) currently null 0 voluntary context switches 0(ru_ixrss) currently null 0 voluntary context switches 0 page faults without I/O 29 involuntary context switches 0page faults without I/O 29 involuntary context switches 0 page faults with I/O 78page faults with I/O 78 local platformlocal platform ============================ num hw counters: 3num hw counters: 3 clock tick: 100 Hzclock tick: 100 Hz PAPI clock rate: 199.00 MHzPAPI clock rate: 199.00 MHz PAPI cycle time: 0.00502513 usec/cyclePAPI cycle time: 0.00502513 usec/cycle CPU name for this node: redwood.cs.utk.eduCPU name for this node: redwood.cs.utk.edu PAPI countsPAPI counts ====================== PAPI_TOT_CYC: 4419PAPI_TOT_CYC: 4419 PAPI_INT_INS: 4451PAPI_INT_INS: 4451 PAPI_TOT_INS: 102034PAPI_TOT_INS: 102034
Other Tools using PAPI
28
U. Illinois: SvPablo
• Source code instrumentation based profiling of F77, F90, C and C++.
• Color coded key next to source code indicating severity of metric.
• MPI aware.
• Statistics at the function, loop and line level.
29
U. Illinois: SvPablo
30
U. Oregon: TAU
• Source code based instrumentation of C, C++, F77, F90, HPF and pC++.
• Maintains a program database in which to store and localize performance data.
• Multiple lightweight tools and a launcher• Including call graph/control flow browser, a class
browser, a remote debugger, MPI trace analysis and a profiler.
• Integrated with PAPI.
31
TAU: Racy/PAPI
32
TAU: Racy
33
Visual Profiler: vprof
• Developed by Curtis Janssen at Sandia Livermore
• Creates and visualizes line level execution profiles obtained with PC-sampling.
• Data usually generated with the profil()/monitor() library/system call or done by hand with interval timers and signal information.
• Ported to use PAPI_profil() in a day.
34
Sandia Livermore: vprof
35
Pacific Sierra Research DEEP/MPI
• Source code instrumentation based profiling at the basic block level. (regions of code with 1 entry and 1 exit, order 10’s of instructions)
• Comprehensive visualization and analysis.• Integrated source code browser with
highlighting.• Works now with MPI, soon with OpenMP. • Integrated with PAPI.
36
Pacific Sierra Research DEEP/MPI
Web Resources
• Mailing list• send “subscribe ptools-perfapi” to [email protected]
•[email protected] is the reflector
• Web page• http://icl.cs.utk.edu/projects/papi
• Post RISC paper by Richard Enbody et. al.• http://www.cps.msu.edu/~crs/cps920/
38
Web Resources 2• PCL
http://www.fz-juelich.de/zam/PCL/
• Vprofhttp://aros.ca.sandia.gov/~cljanss/perf/vprof/
• Paradynhttp://www.cs.wisc.edu/paradyn/libhrtime/
• DynInsthttp://www.cs.umd.edu/projects/dyninstAPI/
• Libhrtimehttp://www.cs.wisc.edu/paradyn/libhrtime/
• TAUhttp://www.cs.uoregon.edu/research/paracomp/tau/
• SvPablohttp://www-pablo.cs.uiuc.edu/Project/SVPablo/SvPabloOverview.htm
39
The Future
• x86/Alpha Linux kernel• Implementation under /proc
• merge with libhrtime patch from U. Wisc
• Support for signal dispatch on hardware counter overflow
• Support for 21064, HP PA 8000, Cray Inc. SV, IBM P2SC
40
Source Code Access
• Every 24 hours, snapshot of source tree at:http://icl.cs.utk.edu/projects/papi/snapshot.cgi
• Remote read-only access to the CVS source tree:> (csh) setenv CVSROOT or % (sh) export CVSROOT=
[email protected]:/cvs/homes/papicvs loginpassword: <cr>cvs checkout papi or cvs updatecd papi/srcmake –f Makefile.<arch>cvs logout
41
The Future
• Dynamic Instrumentation of Running Applications via Dyninst
• Support of gathering performance data of Applications using MPI
• Support for 21064, HP PA 8000, Cray Inc. SV, IBM P2SC