40
Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3 Philip Mucci mucci @ cs . utk . edu Shirley Moore shirley@ cs . utk . edu Daniel Terpstra [email protected]

Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

  • Upload
    cosmo

  • View
    34

  • Download
    4

Embed Size (px)

DESCRIPTION

Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3. Philip Mucci m [email protected] Shirley Moore [email protected] Daniel Terpstra [email protected]. Hardware Counters. - PowerPoint PPT Presentation

Citation preview

Page 2: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN2

Hardware Counters

• Small set of registers that count events, which are occurrences of specific signals related to the processor’s function

• Monitoring these events facilitates correlation between the structure of the source/object code and the efficiency of the mapping of that code to the underlying architecture.

Page 3: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN3

Goals of PAPI

• Solid foundation for cross platform performance analysis tools

• Free tool developers from re-implementing counter access

• Standardization between vendors, academics and users

• Encourage vendors to provide hardware and OS support for counter access

• Reference implementations for a number of HPC architectures

• Well documented and easy to use

Page 4: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN4

Overview of PAPI

• Performance Application Programming Interface

• The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors.

• Parallel Tools Consortium project http://www.ptools.org/

Page 5: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN5

PAPI Counter Interfaces

• PAPI provides three interfaces to the underlying counter hardware: 1. The low level interface manages hardware

events in user defined groups called EventSets.

2. The high level interface simply provides the ability to start, stop and read the counters for a specified list of events.

3. Graphical tools to visualize information.

Page 6: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN6

PAPI Implementation

Tools!!!

PAPI Low LevelPAPI High Level

Hardware Performance Counter

Operating System

Kernel Extension

PAPI Machine Dependent SubstrateMachine

SpecificLayer

PortableLayer

Page 7: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN7

PAPI Preset Events

• Proposed standard set of events deemed most relevant for application performance tuning

• Defined in papiStdEventDefs.h• Mapped to native events on a

given platform– Run tests/avail to see list of PAPI

preset events available on a platform

Page 8: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN8

High-level Interface

• Meant for application programmers wanting coarse-grained measurements

• Not thread safe• Calls the lower level API• Allows only PAPI preset events• Easier to use and less setup

(additional code) than low-level

Page 9: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN9

High-level API

• C interfacePAPI_start_countersPAPI_read_countersPAPI_stop_countersPAPI_accum_countersPAPI_num_countersPAPI_flops

• Fortran interfacePAPIF_start_countersPAPIF_read_countersPAPIF_stop_countersPAPIF_accum_countersPAPIF_num_countersPAPIF_flops

Page 10: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN10

Using the High-level API

• Int PAPI_num_counters(void)

– Initializes PAPI (if needed)

– Returns number of hardware counters

• int PAPI_start_counters(int *events, int len)

– Initializes PAPI (if needed)

– Sets up an event set with the given counters

– Starts counting in the event set

• int PAPI_library_init(int version)

– Low-level routine implicitly called by above

Page 11: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN11

Controlling the Counters

• PAPI_stop_counters(long_long *vals, int alen)– Stop counters and put counter values in array

• PAPI_accum_counters(long_long *vals, int alen)

– Accumulate counters into array and reset

• PAPI_read_counters(long_long *vals, int alen)– Copy counter values into array and reset counters

• PAPI_flops(float *rtime, float *ptime, long_long *flpins, float *mflops)– Wallclock time, process time, FP ins since start,– Mflop/s since last call

Page 12: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN12

PAPI_flops

• int PAPI_flops(float *real_time, float *proc_time, long_long *flpins, float *mflops)– Only two calls needed, PAPI_flops before and after

the code you want to monitor– real_time is the wall-clocktime between the two calls– proc_time is the “virtual” time or time the process

was actually executing between the two calls (not as fine grained as real_time but better for longer measurements)

– flpins is the total floating point instructions executed between the two calls

– mflops is the Mflop/s rating between the two calls

Page 13: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN13

PAPI High-level Example

long long values[NUM_EVENTS]; unsigned int

Events[NUM_EVENTS]={PAPI_TOT_INS,PAPI_TOT_CYC}; /* Start the counters */ PAPI_start_counters((int*)Events,NUM_EVENTS); /* What we are monitoring? */ do_work(); /* Stop the counters and store the results in values */ retval = PAPI_stop_counters(values,NUM_EVENTS);

Page 14: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN14

Name Description PAPI_OK No error PAPI_EINVAL Invalid argument PAPI_ENOMEM Insufficient memory PAPI_ESYS A system/C library call failed. Check errno variable PAPI_ESBSTR Substrate returned an error. E.g. unimplemented feature PAPI_ECLOST Access to the counters was lost or interrupted PAPI_EBUG Internal error PAPI_ENOEVNT Hardware event does not exist PAPI_ECNFLCT Hardware event exists, but resources are exhausted PAPI_ENOTRUN Event or envent set is currently counting PAPI_EISRUN Events or event set is currently running PAPI_ENOEVST No event set available PAPI_ENOTPRESET Argument is not a preset PAPI_ENOCNTR Hardware does not support counters PAPI_EMISC Any other error occured

Return Codes

Page 15: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN15

Low-level Interface

• Increased efficiency and functionality over the high level PAPI interface

• About 40 functions• Obtain information about the

executable and the hardware• Thread-safe• Fully programmable• Callbacks on counter overflow

Page 16: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN16

Low-level Functionality

• Library initializationPAPI_library_init, PAPI_thread_init,

PAPI_shutdown• Timing functions

PAPI_get_real_usec, PAPI_get_virt_usecPAPI_get_real_cyc, PAPI_get_virt_cyc

• Inquiry functions• Management functions• Simple lock

PAPI_lock/PAPI_unlock

Page 17: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN17

Event sets

• The event set contains key information– What low-level hardware counters to use

– Most recently read counter values

– The state of the event set (running/not running)

– Option settings (e.g., domain, granularity, overflow, profiling)

• Event sets can overlap if they map to the same hardware counter set-up.– Allows inclusive/exclusive measurements

Page 18: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN18

Event set operations

• Event set managementPAPI_create_eventset, PAPI_add_event[s], PAPI_rem_event[s], PAPI_destroy_eventset

• Event set controlPAPI_start, PAPI_stop, PAPI_read, PAPI_accum

• Event set inquiryPAPI_query_event, PAPI_list_events,...

Page 19: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN19

Simple example

#include "papi.h“#define NUM_EVENTS 2int Events[NUM_EVENTS]={PAPI_FP_INS,PAPI_TOT_CYC}, EventSet;

long_long values[NUM_EVENTS];/* Initialize the Library */retval = PAPI_library_init(PAPI_VER_CURRENT);/* Allocate space for the new eventset and do setup */retval = PAPI_create_eventset(&EventSet);/* Add Flops and total cycles to the eventset */retval = PAPI_add_events(&EventSet,Events,NUM_EVENTS);/* Start the counters */retval = PAPI_start(EventSet);

do_work(); /* What we want to monitor*/

/*Stop counters and store results in values */retval = PAPI_stop(EventSet,values);

Page 20: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN20

Overlapping counters

retval = PAPI_start(InclEventSet);retval = PAPI_start(OthersEventSet);

...

retval = PAPI_reset(OthersEventSet);do_flops(NUM_FLOPS); /* Function call */retval = PAPI_accum(OthersEventSet,Othersvalues);

...

retval = PAPI_stop(InclEventSet,Inclvalues);printf("Counts: %12lld %12lld\n",

Inclvalues[0], Inclvalues[0]-Othersvalues[0]);

Page 21: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN21

Counter Domains

• int PAPI_set_domain(int domain);– PAPI_DOM_USER User context counted

– PAPI_DOM_KERNEL Kernel/OS context counted

– PAPI_DOM_OTHER Exception/transient mode

– PAPI_DOM_ALL All contexts counted

– PAPI_DOM_MIN The smallest available context

– PAPI_DOM_MAX The largest available context

• Requires OS support, all domains not available on all platforms

• Default is PAPI_DOM_USER

Page 22: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN22

Using PAPI with Threads

• After PAPI_library_init need to register unique thread identifier function

• For Pthreads

retval=PAPI_thread_init(pthread_self, 0);

• OpenMP

retval=PAPI_thread_init(omp_get_thread_num, 0);

• Each thread responsible for creation, start, stop and read of its own counters

Page 23: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN23

Using PAPI with Multiplexing

• Multiplexing allows simultaneous use of more counters than are supported by the hardware.

• Some platforms support multiplexing in the OS (e.g., SGI IRIX); on those that don’t PAPI implements multiplexing in software.

• The more events you multiplex and the shorter the amount of time, the more likely the representation is not correct.

• See John May’s excellent whitepaper

Page 24: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN24

Issues with Multiplexing

• PAPI_multiplex_init() – should be called after

PAPI_library_init() to initialize multiplexing

• PAPI_set_multiplex( int *EventSet ); – Used after the eventset is created to

turn on multiplexing for that eventset

• Then use PAPI like normal

Page 25: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN25

Native Events

• An event countable by the CPU can be counted even if there is no matching preset PAPI event

• Same interface as when setting up a preset event, but a CPU-specific bit pattern is used instead of the PAPI event definition

• /usr/lpp/pmtoolkit/lib/POWER3-II.evs

Page 26: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN26

Power3 Native Event Example

native = (35 << 8) | 1; /* FPU1CMPL */

PAPI_add_event(&EventSet,native)

See for list of native events.

The encoding for native events is:

Lower 8 bits indicate which counter number: 0 – 7

Bits 8-16 indicate which event number: 0-50

Page 27: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN27

Power 3 General Events

1. INST_CMPL (1)2. FPU1_CMPL (35)3. LD_MISS_L1 (5)4. LD_CMPL (5)5. FPU0_CMPL (5)6. CYC (12)7. FMA (9)8. TLB (0)

Page 28: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN28

Power 3 False Sharing Events

1. L1_M_TO_E_OR_S (25)2. E_TO_S (23)3. L2_E_OR_S_TO_I (21)4. L2_M_TO_E_OR_S (27)5. L2_M_TO_I (10)6. PUSH_INT (8)7. 0INST_DISP (6)8. TLB (0)

Page 29: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN29

Callbacks on Counter Overflow

• PAPI provides the ability to call user-defined handlers when a specified event exceeds a specified threshold.

• For systems that do not support counter overflow at the OS level, PAPI sets up a high resolution interval timer and installs a timer interrupt handler.

Page 30: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN30

PAPI_overflow

• int PAPI_overflow(int EventSet, int EventCode, int threshold, int flags, PAPI_overflow_handler_t handler)

• Sets up an EventSet such that when it is PAPI_start()’d, it begins to register overflows

• The EventSet may contain multiple events, but only one may be an overflow trigger.

Page 31: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN31

Statistical Profiling

• PAPI provides support for execution profiling based on any counter event.

• PAPI_profil() creates a histogram by text address of overflow counts for a specified region of the application code.

• Used in vprof tool from Sandia Livermore

Page 32: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN32

PAPI Release Platforms

• Linux/x86, Windows 2000– Requires patch to Linux kernel, driver for

Windows

• Linux/IA-64• Sun Solaris 2.8/Ultra I/II• IBM AIX 4.3+/Power

– Contact IBM for pmtoolkit

• SGI IRIX/MIPS• Compaq Tru64/Alpha Ev6 & Ev67

• Requires OS device driver from Compaq

• Cray T3E/Unicos

Page 33: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN33

For More Information

• http://icl.cs.utk.edu/projects/papi/– Software and documentation– Reference materials– Papers and presentations– Third-party tools– Mailing lists

Page 34: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN34

PERC: Performance Evaluation Research Center

• Developing a Developing a sciencescience for understanding for understanding

performance of scientific applications on high-end performance of scientific applications on high-end

computer systems. computer systems.

• Developing Developing engineeringengineering strategies for improving strategies for improving

performance on these systems. performance on these systems.

• DOE Labs: ANL, LBNL, LLNL, ORNLDOE Labs: ANL, LBNL, LLNL, ORNL

• Universities: UCSD, UI-UC, UMD, UTKUniversities: UCSD, UI-UC, UMD, UTK

• Funded by SciDAC: Scientific Discovery through Funded by SciDAC: Scientific Discovery through

Advanced ComputingAdvanced Computing

Page 35: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN35

PERC: Real-World Applications

• High Energy and Nuclear PhysicsHigh Energy and Nuclear Physics– Shedding New Light on Exploding Stars: Terascale Simulations of Shedding New Light on Exploding Stars: Terascale Simulations of

Neutrino-Driven SuperNovae and Their NucleoSynthesisNeutrino-Driven SuperNovae and Their NucleoSynthesis– Advanced Computing for 21st Century Accelerator Science and Advanced Computing for 21st Century Accelerator Science and

TechnologyTechnology

• Biology and Environmental ResearchBiology and Environmental Research– Collaborative Design and Development of the Community Climate Collaborative Design and Development of the Community Climate

System Model for Terascale ComputersSystem Model for Terascale Computers

• Fusion Energy SciencesFusion Energy Sciences– Numerical Computation of Wave-Plasma Interactions in Multi-Numerical Computation of Wave-Plasma Interactions in Multi-

dimensional Systemsdimensional Systems

• Advanced Scientific ComputingAdvanced Scientific Computing– Terascale Optimal PDE Solvers (TOPS)Terascale Optimal PDE Solvers (TOPS)– Applied Partial Differential Equations Center (APDEC)Applied Partial Differential Equations Center (APDEC)– Scientific Data Management (SDM)Scientific Data Management (SDM)

• Chemical SciencesChemical Sciences– Accurate Properties for Open-Shell States of Large MoleculesAccurate Properties for Open-Shell States of Large Molecules

• ……and more…and more…

Page 36: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN36

Parallel Climate Transition Model

• Components for Ocean, Atmosphere, Sea Ice, Land Surface and River Transport

• Developed by Warren Washington’s group at NCAR

• POP: Parallel Ocean Program from LANL

• CCM3: Community Climate Model 3.2 from NCAR including LSM: Land Surface Model

• ICE: CICE from LANL and CCSM from NCAR

• RTM: River Transport Module from UT Austin

Page 37: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN37

PCTM: The Code

• 132K lines of code

• Mostly Fortran 90 with some Fortran 77 and C

• Each component runs sequentially

• Each component is parallel

• The overall model must run as pure MPI

• Code is already instrumented with timing library in flux coupler

• Add additional PAPI code to measure 8 hardware metrics for each timer

Page 38: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN38

PCTM: Parallel Climate Transition Model

Flux CouplerLand

SurfaceModel

OceanModel Atmosphere

Model

Sea Ice Model

Sequential Executionof Parallelized Modules

RiverModel

Page 39: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN39

PCTM: Performance Questions

• Are we cache, functional unit or bound?

• How does shared memory MPI affect performance?

• How many tasks on a 4-way node?

Page 40: Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

October 11,2001ScicomP4

Knoxville, TN40

PCTM: Answers

• No answers yet… Wait for SC2001– Measure coarse regions of code for

• Instructions per Cycle• Stall Cycles• 0 Instruction Dispatch Cycles

– Measure those regions with multiplexing and inspect code

– Statistically profile candidates for improvement, line by line