22
Perfmon2: a flexible performance monitoring interface for Linux Stéphane Eranian HP Labs [email protected] Abstract Monitoring program execution is becoming more than ever key to achieving world-class performance. A generic, flexible, and yet pow- erful monitoring interface to access the perfor- mance counters of modern processors has been designed. This interface allows performance tools to collect simple counts or profiles on a per kernel thread or system-wide basis. It in- troduces several innovations such as customiz- able sampling buffer formats, time or overflow- based multiplexing of event sets. The cur- rent implementation for the 2.6 kernel supports all the major processor architectures. Several open-source and commercial tools based on in- terface are available. We are currently working on getting the interface accepted into the main- line kernel. This paper presents an overview of the interface. 1 Introduction Performance monitoring is the action of col- lecting information about the execution of a program. The type of information collected de- pends on the level at which it is collected. We distinguish two levels: the program level: the program is instru- mented by adding explicit calls to routines that collect certain metrics. Instrumenta- tion can be inserted by the programmer or the compiler, e.g., the -pg option of GNU cc. Tools such as HP Caliper [5] or Intel PIN [17] can also instrument at runtime. With those tools, it is possible to collect, for instance, the number of times a func- tion is called, the number of time a basic block is entered, a call graph, or a memory access trace. the hardware level: the program is not modified. The information is collected by the CPU hardware and stored in perfor- mance counters. They can be exploited by tools such as OProfile and VTUNE on Linux. The counters measure the micro- architectural behavior of the program, i.e., the number of elapsed cycles, how many data cache stalls, how many TLB misses. When analyzing the performance of a pro- gram, a user must answer two simple ques- tions: where is time spent and why is spent time there? Program-level monitoring can, in many situations and with some high overhead, answer the first, but the second question is best solved with hardware-level monitoring. For instance, gprof can tell you that a program spends 20% of its time in one function. The difficulty is

Perfmon2: a flexible performance monitoring interface for

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Perfmon2: a flexible performance monitoring interface for

Perfmon2: a flexible performance monitoring interfacefor Linux

Stéphane EranianHP Labs

[email protected]

Abstract

Monitoring program execution is becomingmore than ever key to achieving world-classperformance. A generic, flexible, and yet pow-erful monitoring interface to access the perfor-mance counters of modern processors has beendesigned. This interface allows performancetools to collect simple counts or profiles on aper kernel thread or system-wide basis. It in-troduces several innovations such as customiz-able sampling buffer formats, time or overflow-based multiplexing of event sets. The cur-rent implementation for the 2.6 kernel supportsall the major processor architectures. Severalopen-source and commercial tools based on in-terface are available. We are currently workingon getting the interface accepted into the main-line kernel. This paper presents an overview ofthe interface.

1 Introduction

Performance monitoring is the action of col-lecting information about the execution of aprogram. The type of information collected de-pends on the level at which it is collected. Wedistinguish two levels:

• the program level: the program is instru-mented by adding explicit calls to routinesthat collect certain metrics. Instrumenta-tion can be inserted by the programmer orthe compiler, e.g., the -pg option of GNUcc. Tools such as HP Caliper [5] or IntelPIN [17] can also instrument at runtime.With those tools, it is possible to collect,for instance, the number of times a func-tion is called, the number of time a basicblock is entered, a call graph, or a memoryaccess trace.

• the hardware level: the program is notmodified. The information is collected bythe CPU hardware and stored in perfor-mance counters. They can be exploitedby tools such as OProfile and VTUNE onLinux. The counters measure the micro-architectural behavior of the program, i.e.,the number of elapsed cycles, how manydata cache stalls, how many TLB misses.

When analyzing the performance of a pro-gram, a user must answer two simple ques-tions: where is time spent and why is spent timethere? Program-level monitoring can, in manysituations and with some high overhead, answerthe first, but the second question is best solvedwith hardware-level monitoring. For instance,gprof can tell you that a program spends 20%of its time in one function. The difficulty is

Page 2: Perfmon2: a flexible performance monitoring interface for

270 • Perfmon2: a flexible performance monitoring interface for Linux

to know why. Is this because the function iscalled a lot? Is this due to algorithmic prob-lems? Is it because the processor stalls? If so,what is causing the stalls? As this simple ex-ample shows, the two levels of monitoring canbe complementary.

The current CPU hardware trends are increas-ing the need for powerful hardware monitor-ing. New hardware features present the op-portunity to gain considerable performance im-provements through software changes. To ben-efit from a multi-threaded CPU, for instance,a program must become multi-threaded itself.To run well on a NUMA machine, a programmust be aware of the topology of the machineto adjust memory allocations and thread affinityto minimize the number of remote memory ac-cesses. On the Itanium [3] processor architec-ture, the quality of the code produced by com-pilers is a big factor in the overall performanceof a program, i.e, the compiler must extract theparallelism of the program to take advantage ofthe hardware.

Hardware-based performance monitoring canhelp pinpoint problems in how software usesthose new hardware features. An operating sys-tem scheduler can benefit from cache profilesto optimize placement of threads to avoidingcache thrashing in multi-threaded CPUs. Staticcompilers can use performance profiles to im-prove code quality, a technique called Profile-Guided Optimization (PGO). Dynamic compil-ers, in Managed Runtime Environments (MRE)can also apply the same technique. Profile-Guided Optimizations can also be applied di-rectly to a binary by tools such as iSpike [11].In virtualized environments, such as Xen [14],system managers can also use monitoring infor-mation to guide load balancing. Developers canalso use this information to optimize the lay-out of data structures, improve data prefetch-ing, analyze code paths [13]. Performance pro-files can also be used to drive future hardware

requirements such as cache sizes, cache laten-cies, or bus bandwidth.

Hardware performance counters are logicallyimplemented by the Performance MonitoringUnit (PMU) of the CPU. By nature, this is afairly complex piece of hardware distributed allacross the chip to collect information about keycomponents such as the pipeline, the caches,the CPU buses. The PMU is, by nature, veryspecific to each processor implementation, e.g.,the Pentium M and Pentium 4 PMUs [9] havenot much in common. The Itanium proces-sor architecture specifies the framework withinwhich the PMU must be implemented whichhelps develop portable software.

One of the difficulties to standardize on a per-formance monitoring interface is to ensure thatit supports all existing and future PMU mod-els without preventing access to some of theirmodel specific features. Indeed, some models,such as the Itanium 2 PMU [8], go beyond justcounting events, they can also capture branchtraces, where cache misses occur, or filter onopcodes.

In Linux and across all architectures, the wealthof information provided by the PMU is often-times under-exploited because a lack of a flex-ible and standardized interface on which toolscan be developed.

In this paper, we give an overview of perfmon2,an interface designed to solve this problem forall major architectures. We begin by reviewingwhat Linux offers today. Then, we describe thevarious key features of this new interface. Weconclude with the current status and a short de-scription of the existing tools.

2 Existing interfaces

The problem with performance monitoring inLinux is not the lack of interface, but rather the

Page 3: Perfmon2: a flexible performance monitoring interface for

2006 Linux Symposium, Volume One • 271

multitude of interfaces. There are at least threeinterfaces:

• OProfile [16]: it is designed for DCPI-style [15] system-wide profiling. It is sup-ported on all major architectures and is en-abled by major Linux distributions. It cangenerate a flat profile and a call graph perprogram. It comes with its own tool set,such as opcontrol. Prospect [18] is an-other tool using this interface.

• perfctr [12]: it supports per-kernel-threadand system-wide monitoring for most ma-jor processor architectures, except for Ita-nium. It is distributed as a stand-alone ker-nel patch. The interface is mostly used bytools built on top of the PAPI [19] perfor-mance toolkit.

• VTUNE [10]: the Intel VTUNE perfor-mance analyzer comes with its own kernelinterface, implemented by an open-sourcedriver. The interface supports system-wide monitoring only and is very specificto the needs of the tool.

All these interfaces have been designed with aspecific measurement or tool in mind. As such,their design is somewhat limited in scope, i.e.,they typically do one thing very well. For in-stance, it is not possible to use OProfile to countthe number of retired instructions in a thread.The perfctr interface is the closest match towhat we would like to build, yet it has someshortcomings. It is very well designed andtuned for self-monitoring programs but sam-pling support is limited, especially for non self-monitoring configurations.

With the current situation, it is not necessarilyeasy for developers to figure out how to writeor port their tools. There is a question of func-tionalities of each interfaces and then, a ques-tion of distributions, i.e., which interface ships

with which distribution. We believe this situa-tion does not make it attractive for developersto build modern tools on Linux. In fact, Linuxis lagging in this area compared to commercialoperating systems.

3 Design choices

First of all, it is important to understand why akernel interface is needed. A PMU is accessi-ble through a set of registers. Typically thoseregisters are only accessible, at least for writ-ing, at the highest privilege level of execution(pl0 or ring0) which is where only the kernelexecutes. Furthermore, a PMU can trigger in-terrupts which need kernel support before theycan be converted into a notification to a user-level application such as a signal, for instance.For those reasons, the kernel needs to providean interface to access the PMU.

The goal of our work is to solve the hardware-based monitoring interface problem by design-ing a single, generic, and flexible interface thatsupports all major processor architectures. Thenew interface is built from scratch and intro-duces several innovations. At the same time,we recognize the value of certain features ofthe other interfaces and we try to integrate themwherever possible.

The interface is designed to be built into thekernel. This is the key for developers, as itensures that the interface will be available andsupported in all distributions.

To the extent possible, the interface must al-low existing monitoring tools to be ported with-out many difficulties. This is useful to ensureundisrupted availability of popular tools suchas VTUNE or OProfile, for instance.

The interface is designed from the bottom up,first looking at what the various processors pro-

Page 4: Perfmon2: a flexible performance monitoring interface for

272 • Perfmon2: a flexible performance monitoring interface for Linux

vide and building up an operating system in-terface to access the performance counters in auniform fashion. Thus, the interface is not de-signed for a specific measurement or tool.

There is efficient support for per-thread mon-itoring where performance information is col-lected on a kernel thread basis, the PMU state issaved and restored on context switch. There isalso support for system-wide monitoring whereall threads running on a CPU are monitored andthe PMU state persists across context switches.

In either mode, it is possible to collect simplecounts or profiles. Neither applications nor theLinux kernel need special compilation to en-able monitoring. In per-thread mode, it is pos-sible to monitor unmodified programs or multi-threaded programs. A monitoring session canbe dynamically attached and detached from arunning thread. Self-monitoring is supportedfor both counting and profiling.

The interface is available to regular users andnot just system administrators. This is espe-cially important for per-thread measurements.As a consequence, it is not possible to assumethat tools are necessarily well-behaved and theinterface must prevent malicious usage.

The interface provides a uniform set of fea-tures across platforms to maximize code re-usein performance tools. Measurement limitationsare mandated by the PMU hardware not thesoftware interface. For instance, if a PMU doesnot capture where cache misses occur, thereis nothing the interface nor its implementationcan do about it.

The interface must be extensible because wewant to support a variety of tools on very dif-ferent hardware platforms.

4 Core Interface

The interface leverages a common property ofall PMU models which is that the hardware in-terface always consists of a set of configura-tion registers, that we call PMC (PerformanceMonitor Configuration), and a set of data reg-isters, that we call PMD (Performance Moni-tor Data). Thus, the interface provides basicread/write access to the PMC/PMD registers.

Across all architectures, the interface exposesa uniform register-naming scheme using thePMC and PMD terminology inherited from theItanium processor architecture. As such, appli-cations actually operate on a logical PMU. Themapping from the logical to the actual PMU isdescribed in Section 4.3.

The whole PMU machine state is representedby a software abstraction called a perfmon con-text. Each context is identified and manipulatedusing a file descriptor.

4.1 System calls

The interface is implemented with multiple sys-tem calls rather than a device driver. Per-thread monitoring requires that the PMU ma-chine state be saved and restored on contextswitch. Access to such routine is usually pro-hibited for drivers. A system call provides moreflexibility than ioctl for the number, type,and type checking of arguments. Furthermore,system calls reinforce our goal of having the in-terface be an integral part of the kernel, and notjust an optional device driver.

The list of system calls is shown in Table 1.A context is created by the pfm_create_

context call. There are two types of contexts:per-thread or system-wide. The type is deter-mined when the context is created. The sameset of functionalities is available to both types

Page 5: Perfmon2: a flexible performance monitoring interface for

2006 Linux Symposium, Volume One • 273

int pfm_create_context(pfarg_ctx_t *c, void *s, size_t s)int pfm_write_pmcs(int f, pfarg_pmc_t *p, int c)int pfm_write_pmds(int f, pfarg_pmd_t *p, int c)int pfm_read_pmds(int f, pfarg_pmd_t *p, int c)int pfm_load_context(int f, pfarg_load_t *l)int pfm_start(int fd, pfarg_start_t *s)int pfm_stop(int f)int pfm_restart(int f)int pfm_create_evtsets(int f, pfarg_setdesc_t *s, int c)int pfm_getinfo_evtsets(int f, pfarg_setinfo_t *i, int c)int pfm_delete_evtsets(int f, pfarg_setdesc_t *s, int c)int pfm_unload_context(int f)

Table 1: perfmon2 system calls

of context. Upon return from the call, the con-text is identified by a file descriptor which canthen be used with the other system calls.

The write operations on the PMU registers areprovided by the pfm_write_pmcs and pfm_

write_pmds calls. It is possible to accessmore than one register per call by passing avariable-size array of structures. Each structureconsists, at a minimum, of a register index andvalue plus some additional flags and bitmasks.

An array of structures is a good compromisebetween having a call per register, i.e., one reg-ister per structure per call, and passing the en-tire PMU state each time, i.e., one large struc-ture per call for all registers. The cost of a sys-tem call is amortized, if necessary, by the factthat multiple registers are accessed, yet flexibil-ity is not affected because the size of the arrayis variable. Furthermore, the register structuredefinition is generic and is used across all ar-chitectures.

The PMU can be entirely programmed beforethe context is attached to a thread or CPU. Toolscan prepare a pool of contexts and later attachthem on-the-fly to threads or CPUs.

To actually load the PMU state onto the ac-tual hardware, the context must be bound to ei-ther a kernel thread or a CPU with the pfm_

load_context call. Figure 1 shows the ef-fect of the call when attaching to a thread of

file table pfm_context pfm_contextfile table

kerneluser

fd

monitoring tool monitored process

fd

controlling process monitored process

before pfm_load_context()

m

after pfm_load_context(12)

12 14 12 14

Figure 1: attaching to a thread

a dual-threaded process. A context can onlybe bound to one thread or CPU at a time. Itis not possible to bind more than one contextto a thread or CPU. Per-thread monitoring andsystem-wide monitoring are currently mutuallyexclusive. By construction, multiple concurrentper-thread contexts can co-exist. Potential con-flicts are detected when the context is attachedand not when it is created.

An attached context persists across a call toexec. On fork or pthread_create, thecontext is not automatically cloned in the newthread because it does not always make senseto aggregate results or profiles from child pro-cesses or threads. Monitoring tools can lever-age the 2.6 kernel ptrace interface to receivenotifications on the clone system call to de-cide whether or not to monitor a new thread orprocess. Because the context creation and at-tachment are two separate operations, it is pos-sible to batch creations and simply attach andstart on notification.

Once the context is attached, monitoring can bestarted and stopped using the pfm_start andpfm_stop calls. The values of the PMD regis-ters can be extracted with the pfm_read_pmdscall. A context can be detached with pfm_

unload_context. Once detached the contextcan later be re-attached to any thread or CPU ifnecessary.

A context is destroyed using a simple close

Page 6: Perfmon2: a flexible performance monitoring interface for

274 • Perfmon2: a flexible performance monitoring interface for Linux

call. The other system calls listed in Table 1 re-late to sampling or event sets and are discussedin later sections.

Many 64-bit processor architectures providethe ability to run with a narrow 32-bit instruc-tion set. For instance, on Linux for x86_64,it is possible to run unmodified 32-bit i386 bi-naries. Even though, the PMU is very imple-mentation specific, it may be interesting to de-velop/port tools in 32-bit mode. To avoid dataconversions in the kernel, the perfmon2 ABI isdesigned to be portable between 32-bit (ILP32)and 64-bit (LP64) modes. In other words, allthe data structures shared with the kernel usefixed-size data types.

4.2 System-wide monitoring

fd1 fd2

CPU1CPU0

monitoring tool

userkernel

worker processes

Figure 2: monitoring two CPUs

A perfmon context can be bound to only oneCPU at a time. The CPU on which the call topfm_load_context is executed determinesthe monitored CPU. It is necessary to set theaffinity of the calling thread to ensure that itruns on the CPU to monitor. The affinity can

later be modified, but all operations requiringaccess to the actual PMU must be executed onthe monitored CPU, otherwise they will fail. Inthis setup, coverage of a multi-processor sys-tem (SMP), requires that multiple contexts becreated and bound to each CPU to monitor.Figure 2 shows a possible setup for a moni-toring tool on a 2-way system. Multiple non-overlapping system-wide attached context canco-exist.

The alternative design is to have the kernelpropagate the PMU access to all CPUs of inter-est using Inter-Processor-Interrupt (IPI). Suchapproach does make sense if all CPUs are al-ways monitored. This is the approach chosenby OProfile, for instance.

With the perfmon2 approach, it is possible tomeasure subsets of CPUs. This is very inter-esting for large NUMA-style or multi-core ma-chines where all CPUs do not necessarily runthe same workload. And even then, with a uni-form workload, it possible to divide the CPUsinto groups and capture different events in eachgroup, thereby overlapping distinct measure-ments in one run. Aggregation of results canbe done by monitoring tools, if necessary.

It is relatively straightforward to construct auser-level helper library that can simplify mon-itoring multiple CPUs from a single thread ofcontrol. Internally, the library can pin threadson the CPUs of interest. Synchronization be-tween threads can easily be achieved using abarrier built with the POSIX-threads primitives.We have developed and released such a libraryas part of the libpfm [6] package.

Because PMU access requires the controllingthread to run on the monitored CPU, proces-sor and memory affinity are inherently enforcedthereby minimizing overhead which is impor-tant when sampling in NUMA machines. Fur-thermore, this design meshes well with certainPMU features such as the Precise-Event-Based

Page 7: Perfmon2: a flexible performance monitoring interface for

2006 Linux Symposium, Volume One • 275

Sampling (PEBS) support of the Pentium 4 pro-cessor (see Section 5.4 for details).

4.3 Logical PMU

PMU register names and implementations arevery diverse. On the Itanium processor archi-tecture, they are implemented by actual PMCand PMD indirect registers. On the AMDOpteron [1] processors, they are called PERF-SEL and PERFCTR indirect registers but areactually implemented by MSR registers. Aportable tool would have to know about thosenames and the interface would have to changefrom one architecture to another to accommo-date the names and types of the registers for theread and write operations. This would defeatour goal of having a uniform interface on allplatforms.

To mask the diversity without compromisingaccess to all PMU features, the interface ex-poses a logical PMU. This PMU is tailored tothe underlying hardware PMU for propertiessuch as the number of registers it implements.But it also guarantees the following propertiesacross all architectures:

• the configuration registers are called PMCregisters and are managed as 64-bit wideindirect registers

• the data registers are called PMD registersand are managed as 64-bit wide indirectregisters

• counters are 64-bit wide unsigned integers

The mapping of PMC/PMD to actual PMU reg-isters is defined by a PMU description tablewhere each entry provides the default value, abitmask of reserved fields, and the actual nameof the register. The mapping is defined by the

implementation and is accessible via a sysfsinterface.

The routine to access the actual register is partof the architecture specific part of a perfmon2implementation. For instance, on Itanium 2processor, the mapping is defined such that theindex in the table corresponds to the index ofthe actual PMU register, e.g., logical PMD0corresponds to actual PMD0. The read functionconsists of a single mov rXX=pmd[0] instruc-tion. On the Pentium M processor however, themapping is defined as follows:

% cat /sys/kernel/perfmon/pmu_desc/mappings

PMC0:0x100000:0xffcfffff:PERFEVTSEL0

PMC1:0x100000:0xffcfffff:PERFEVTSEL1

PMD0:0x0:0xffffffffffffffff:PERFCTR0

PMD1:0x0:0xffffffffffffffff:PERFCTR1

When a tool writes to register PMD0, it writesto register PERFEVTSEL0. The actual regis-ter is implemented by MSR 0x186. Thereis an architecture specific section of the PMUdescription table that provides the mapping tothe MSR. The read function consist of a singlerdmsr instruction.

On the Itanium 2 processors, we use this map-ping mechanism to export the code (IBR) anddata (DBR) debug registers as PMC registersbecause they can be used to restrict monitoringto a specific range of code or data respectively.There was no need to create an Itanium 2 pro-cessor specific system call in the interface tosupport this useful feature.

To make applications more portable, countersare always exposed as 64-bit wide unsigned in-tegers. This is particularly interesting whensampling, see Section 5 for more details. Usu-ally, PMUs implement narrower counters, e.g.,47 bits on Itanium 2 PMU, 40 bits on AMDOpteron PMU. If necessary, each implemen-tation must emulate 64-bit counters. This canbe accomplished fairly easily by leveraging the

Page 8: Perfmon2: a flexible performance monitoring interface for

276 • Perfmon2: a flexible performance monitoring interface for Linux

counter overflow interrupt capability present onall modern PMUs. Emulation can be turned offby applications on a per-counter basis, if neces-sary.

Oftentimes, it is interesting to associate PMU-based information with non-PMU based infor-mation such as an operating system resourceor other hardware resource. For instance, onemay want to include the time since monitoringhas been started, the number of active networksconnections, or the identification of the currentprocess in a sample. The perfctr interface pro-vides this kind of information, e.g., the virtualcycle counter, through a kernel data structurethat is re-mapped to user level.

With perfmon2, it is possible to leverage themapping table to define Virtual PMD regis-ters, i.e., registers that do not map to actualPMU or PMU-related registers. This mecha-nism provides a uniform and extensible nam-ing and access interface for those resources.Access to new resources can be added with-out breaking the ABI. When a tool invokespfm_read_pmds on a virtual PMD register, aread call-back function, provided by the PMUdescription table, is invoked and returns a 64-bit value for the resource.

4.4 PMU description module

Hardware and software release cycles do not al-ways align correctly. Although Linux kernelpatches are produced daily on the kernel.org web site, most end-users really run pack-aged distributions which have a very differ-ent development cycle. Thus, new hardwaremay become available before there is an ac-tual Linux distribution ready. Similarly, pro-cessors may be revised and new steppings mayfix bugs in the PMU. Although providing up-dates is fairly easy nowadays, end-users tend tobe reluctant to patch and recompile their ownkernels.

It is important to understand that monitoringtool developers are not necessarily kernel de-velopers. As such, it is important to providesimple mechanisms whereby they can enableearly access to new hardware, add virtual PMDregisters and run experimentations without fullkernel patching and recompiling.

There are no technical reasons for having thePMU description tables built into the kernel.With a minimal framework, they can as wellbe implemented by kernel modules where theybecome easier to maintain. The perfmon2 in-terface provides a framework where a PMU de-scription module can be dynamically insertedinto the kernel at runtime. Only one module canbe inserted at a time. When new hardware be-comes available, assuming there is no changesneeded in the architecture specific implementa-tion, a new description module can be providedquickly. Similarly, it becomes easy to experi-ment with virtual PMD registers by modifyingthe description table and not the interface northe core implementation.

5 Sampling Support

Statistical Sampling or profiling is the act ofrecording information about the execution ofa program at some interval. The interval iscommonly expressed in units of time, e.g., ev-ery 20ms. This is called Time-Based sampling(TBS). But the interval can also be expressedin terms of a number of occurrences of a PMUevent, e.g., every 2000 L2 cache misses. This iscalled Event-Based sampling (EBS). TBS caneasily be emulated with EBS by using an eventwith a fixed correlation to time, e.g., the num-ber of elapsed cycles. Such emulation typicallyprovides a much finer granularity than the op-erating system timer which is usually limited tomillisecond at best. The interval, regardless ofits unit, does not have to be constant.

Page 9: Perfmon2: a flexible performance monitoring interface for

2006 Linux Symposium, Volume One • 277

At the end of an interval, the information isstored into a sample which may contain infor-mation as simple as where the thread was, i.e.,the instruction pointer. It may also include val-ues of some PMU registers or other hardwareor software resources.

The quality of a profile depends mostly on theduration of the run and the number of samplescollected. A good profile can provide a lot ofuseful information about the behavior of a pro-gram, in particular it can help identify bottle-necks. The difficulty is to manage to overheadinvolved with sampling. It is important to makesure that sampling does not perturb the execu-tion of the monitored program such that it doesnot exhibit its normal behavior. As the sam-pling interval decreases, overhead increases.

The perfmon2 interface has an extensive set offeatures to support sampling. It is possible tomanage sampling completely at the user level.But there is also kernel-level support to mini-mize the overhead. The interface provides sup-port for EBS.

5.1 Sampling periods

All modern PMUs implement a counter over-flow interrupt mechanism where the proces-sor generates an interrupt whenever a counterwraps around to zero. Using this mechanismand supposing a 64-bit wide counter, it is pos-sible to implement EBS by expressing a sam-pling period p as 264 − p or in two’s comple-ment arithmetics as −p. After p occurrences,the counter overflows, an interrupt is generatedindicating that a sample must be recorded.

Because all counters are 64-bit unsigned in-tegers, tools do not have to worry about theactual width of counters when setting the pe-riod. When 64-bit emulation is needed, theimplementation maintains a 64-bit software

hwPMD fffe7960

ffffffff fffe7960

32 bits 32 bits

ffffffff

32 bits

swPMD 0

value = −100000 =

Figure 3: 64-bit counter emulation

value and loads only the low-order bits ontothe actual register as shown in Figure 3. AnEBS overflow is declared only when the 64-bitsoftware-maintained value overflows.

The interface does not have the notion of a sam-pling period, all it knows about is PMD values.Thus a sampling period p, is programmed intoa PMD by setting its value to −p. The num-ber of sampling periods is only limited by thenumber of counters. Thus, it is possible to over-lap sampling measurements to collect multipleprofiles in one run.

For each counter, the interface provides threevalues which are used as follows:

• value: the value loaded into the PMD reg-ister when the context is attached. This isthe initial value.

• long_reset: the value to reload into thePMD register after an overflow with user-level notification.

• short_reset: the value to reload into thePMD register after an overflow with nouser-level notification.

The three values can be used to try and masksome of the overhead involved with sampling.The initial period would typically be large be-cause it is not always interesting to capturesamples in initialization code. The long andshort reset values can be used to mask the noisegenerated by the PMU interrupt handler. Weexplain how they are used in Section 5.3.

Page 10: Perfmon2: a flexible performance monitoring interface for

278 • Perfmon2: a flexible performance monitoring interface for Linux

5.2 Overflow notifications

To support sampling at the user level, it is nec-essary to inform the tool when a 64-bit over-flow occurs. The notification can be requestedper counter and is sent as a message. There isonly one notification per interrupt even whenmultiple counters overflow at the same time.

Each perfmon context has a fixed-depth mes-sage queue. The fixed-size message containsinformation about the overflow such as whichcounter(s) overflowed, the instruction pointer,the current CPU at the time of the overflow.Each new message is appended to the queuewhich is managed as a FIFO.

Instead of re-inventing yet another notificationmechanism, existing kernel interfaces are lever-aged and messages are extracted using a simpleread call on the file descriptor of the context.The benefit is that common interfaces such asselect or poll can be used to wait on mul-tiple contexts at the same time. Similarly, asyn-chronous notifications via SIGIO are also sup-ported.

Regular file descriptor sharing semantic ap-plies, thus it is possible to delegate notificationprocessing to a specific thread or child process.

During a notification, monitoring is stopped.When monitoring another thread, it is possi-ble to request that this thread be blocked whilethe notification is being processed. A tool maychoose the block on notification option whenthe context is created. Depending on the typeof sampling, it may be interesting to have thethread run just to keep the caches and TLBwarm, for instance.

Once a notification is processed, the pfm_

restart function is invoked. It is used to resetthe overflowed counters using their long resetvalue, to resume monitoring, and potentially tounblock the monitored thread.

5.3 Kernel sampling buffer

It is quite expensive to send a notification touser level for each sample. This is particularlybad when monitoring another thread becausethere could be, at least, two context switchesper overflow and a couple of system calls.

One way to minimize this cost, it is to amor-tize it over a large set of samples. The idea isto have the kernel directly record samples intoa buffer. It is not possible to take page faultsfrom the PMU interrupt handler, usually a highpriority handler. As such the memory wouldhave to be locked, an operation that is typi-cally restricted to privileged users. As indicatedearlier, sampling must be available to regularusers, thus, the buffer is allocated by the kerneland marked as reserved to avoid being pagedout.

When the buffer becomes full, the monitoringtool is notified. A similar approach is used bythe OProfile and VTUNE interfaces. Severalissues must be solved for the buffer to becomeuseable:

• how to make the kernel buffer accessibleto the user?

• how to reset the PMD values after an over-flow when the monitoring tool is not in-volved?

• what format for the buffer?

The buffer can be made available via a readcall. This is how OProfile and VTUNEwork.Perfmon2 uses a different approach to try andminimize overhead. The buffer is re-mappedread-only into the user address space of themonitoring tool with a call to mmap, as shownin Figure 4. The content of the buffer is guaran-teed consistent when a notification is received.

Page 11: Perfmon2: a flexible performance monitoring interface for

2006 Linux Symposium, Volume One • 279

file table file tablepfm_context

monitoring tool

fd

user

kernel

sampling buffer

pfm_context

monitoring tool

fd

userkernel

sampling buffer

after pfm_create_context() after mmap()

Figure 4: re-mapping the sampling buffer

On counter overflow, the kernel needs to knowwhat value to reload into an overflowed PMDregister. This information is passed, per reg-ister, during the pfm_write_pmd call. If thebuffer does not become full, the kernel uses theshort reset value to reload the counter.

When the buffer becomes full, monitoring isstopped and a notification is sent. Reset is de-ferred until the monitoring tool invokes pfm_restart, at which point, the buffer is markedas empty, the overflowed counter is reset withthe long reset value and monitoring resumes.

long period short period

monitoring activeprogram executes

recovery period

short period short period

program executesmonitoring active

stoppedmonitoring

processingoverflow

Figure 5: short vs. long reset values.

The distinction between long and short resetvalues allows tools to specify a different, po-tentially larger value, for the first period after anoverflow notification. It is very likely that theuser-level notification and subsequent process-ing will modify the CPU state, e.g., caches andTLB, such that when monitoring resumes, theexecution will enter a recovery phase where itsbehavior may be different from what it wouldhave been without monitoring. Depending on

the type of sampling, the long vs. short resetvalues can be leveraged to hide that recoveryperiod. This is demonstrated in Figure 5 whichshows where the long reset value is used afteroverflow processing is completed. Of course,the impact and duration of the recovery periodis very specific to each workload and CPU.

It is possible to request, per counter, that bothreset values be randomized. This is very use-ful to avoid biased samples for certain mea-surements. The pseudo-random number gen-erator does not need to be very fancy, simplevariation are good enough. The randomizationis specified by a seed value and a bitmask tolimit the range of variation. For instance, amask of 0xff allows a variation in the inter-val [0-255] from the base value. The exist-ing implementation uses the Carta [2] pseudo-random number generator because it is simpleand very efficient.

A monitoring tool may want to record the val-ues of certain PMD registers in each sample.Similarly, after each sample, a tool may want toreset certain PMD registers. This could be usedto compute event deltas, for instance. EachPMD register has two bitmasks to convey thisinformation to the kernel. Each bit in the bit-mask represents a PMD register, e.g., bit 1 rep-resents PMD1. Let us suppose that on overflowof PMD4, a tool needs to record PMD6 andPMD7 and then reset PMD7. In that case, thetool would initialize the sampling bitmask ofPMD4 to 0xc0 and the reset bitmask to 0x80.

With a kernel-level sampling buffer, the for-mat in which samples are stored and whatgets recorded becomes somehow fixed and it ismore difficult to evolve. Monitoring tools canhave very diverse needs. Some tools may wantto store samples sequentially into the buffer,some may want to aggregate them immediately,others may want to record non PMU-based in-formation, e.g., the kernel call stack.

Page 12: Perfmon2: a flexible performance monitoring interface for

280 • Perfmon2: a flexible performance monitoring interface for Linux

As indicated earlier, it is important to ensurethat existing interfaces such as OProfile orVTUNE, both using their own buffer formats,can be ported without having to modify a lot oftheir code. Similarly, It is important to ensurethe interface can take advantage of advancedPMU support sampling such as the PEBS fea-ture of the Pentium 4 processor.

Preserving a high level of flexibility for thebuffer, while having it fully specified into theinterface did not look very realistic. We re-alized that it would be very difficult to comeup with a universal format that would satisfyall needs. Instead, the interface uses a radi-cally different approach which is described inthe next section.

5.4 Custom Sampling Buffer Formats

kernel

useruser interface

coreperfmon custom sampling

format

user interface

cust

om fo

rmat

inte

rface

sam

plin

g fo

rmat

inte

rface

Figure 6: Custom sampling format architecture

The interface introduces a new flexible mecha-nism called Custom Sampling Buffer Formats,or formats for short. The idea is to removethe buffer format from the interface and insteadprovide a framework for extending the interfacevia specific sampling formats implemented bykernel modules. The architecture is shown inFigure 6.

Each format is uniquely identified by a 128-bitUniversal Unique IDentifier (UUID) which canbe generated by commands such as uuidgen.In order to use a format, a tool must pass thisUUID when the context is created. It is possible

to pass arguments, such as the buffer size, to aformat when a context is created.

When the format module is inserted into thekernel, it registers with the perfmon core viaa dedicated interface. Multiple formats can beregistered. The list of available formats is ac-cessible via a sysfs interface. Formats canalso be dynamically removed like any otherkernel module.

Each format provides a set of call-backs func-tions invoked by the perfmon core during cer-tain operations. To make developing a formatfairly easy, the perfmon core provides certainbasic services such as memory allocation andthe ability to re-map the buffer, if needed. For-mats are not required to use those services.They may, instead, allocate their own bufferand expose it using a different interface, suchas a driver interface.

At a minimum, a format must provide a call-back function invoked on 64-bit counter over-flow, i.e., an interrupt handler. That handlerdoes not bypass the core PMU interrupt han-dler which controls 64-bit counter emulation,overflow detection, notification, and monitor-ing masking. This layering make it very simpleto write a handler. Each format controls:

• how samples are stored

• what gets recorded on overflow

• how the samples are exported to user-level

• when an overflow notification must be sent

• whether or not to reset counters after anoverflow

• whether or not to mask monitoring after anoverflow

Page 13: Perfmon2: a flexible performance monitoring interface for

2006 Linux Symposium, Volume One • 281

The interface specifies a simple and relativelygeneric default sampling format that is built-in on all architectures. It stores samples se-quentially in the buffer. Each sample has afixed-size header containing information suchas the instruction pointer at the time of theoverflow, the process identification. It is fol-lowed by a variable-size body containing 64-bit PMD values stored in increasing index or-der. Those PMD values correspond to the infor-mation provided in the sampling bitmask of theoverflowed PMD register. Buffer space is man-aged such that there can never be a partial sam-ple. If multiple counters overflow at the sametime, multiple contiguous samples are written.

Using the flexibility of formats, it was fairlyeasy to port the OProfile kernel code overto perfmon2. An new format was created toconnect the perfmon2 PMU and OProfileinterrupt handlers. The user-level OProfile,opcontrol tool was migrated over to use theperfmon2 interface to program the PMU. Theresulting format is about 30 lines of C code.The OProfile buffer format and managementkernel code was totally preserved.

Other formats have been developed since then.In particular we have released a format thatimplements n-way buffering. In this format,the buffer space is split into equal-size regions.Samples are stored in one region, when it fillsup, the tool is notified but monitoring remainsactive and samples are stored in the next region.This idea is to limit the number of blind spotsby never stopping monitoring on counter over-flow.

The format mechanism proved particularly use-ful to implement support for the Pentium 4 pro-cessor Precise Event-Based Sampling (PEBS)feature where the CPU is directly writing sam-ples to a designated region of memory. Byhaving the CPU write the samples, the skewobserved on the instruction pointer with typ-ical interrupt-based sampling can be avoided,

thus a much improved precision of the sam-ples. That skew comes from the fact that thePMU interrupt is not generated exactly on theinstruction where the counter overflowed. Thephenomenon is especially important on deeply-pipelined processor implementations, such asPentium 4 processor. With PEBS, there is aPMU interrupt when the memory region givento the CPU fills up.

The problem with PEBS is that the sample for-mat is now fixed by the CPU and it cannot bechanged. Furthermore, the format is differentbetween the 32-bit and 64-bit implementationsof the CPU. By leveraging the format infras-tructure, we created two new formats, one for32-bit and one for 64-bit PEBS with less thanone hundred lines of C code each. Perfmon2is the first to provide support for PEBS and itrequired no changes to the interface.

6 Event sets and multiplexing

On many PMU models, the number of coun-ters is fairly limited yet certain measurementsrequire lots of events. For instance, on the Ita-nium 2 processor, it takes about a dozen eventsto gather a cycle breakdown, showing how eachCPU cycle is spent, yet there are only 4 coun-ters. Thus, it is necessary to run the workloadunder test multiple times. This is not alwaysvery convenient as workloads sometimes can-not be stopped or are long to restart. Further-more, this inevitably introduces fluctuations inthe collected counts which may affect the accu-racy of the results.

Even with a large number of counters, e.g., 18for the Pentium 4 processor, there are still hard-ware constraints which make it difficult to col-lect some measurements in one run. For in-stance, it is fairly common to have constraintssuch as:

Page 14: Perfmon2: a flexible performance monitoring interface for

282 • Perfmon2: a flexible performance monitoring interface for Linux

• event A and B cannot be measured to-gether

• event A can only be measured on counterC.

Those constraints are unlikely to go away inthe future because that could impact the perfor-mance of CPUs. An elegant solution to theseproblems is to introduce the notion of eventssets where each set encapsulates the full PMUmachine state. Multiple sets can be defined andthey are multiplexed on the actual PMU hard-ware such that only one set if active at a time.At the end of the multiplexed run, the countsare scaled to compute an estimate of what theywould have been, had they been collected forthe entire duration of the measurement.

The accuracy of the scaled counts depends alot of the switch frequency and the workload,the goal being to avoid blind spots where cer-tain events are not visible because the set thatmeasures them did not activate at the right time.The key point is to balance to the need for highswitch frequency with higher overhead.

Sets and multiplexing can be implemented to-tally at the user level and this is done by thePAPI toolkit, for instance. However, it is criti-cal to minimize the overhead especially for nonself-monitoring measurements where it is ex-tremely expensive to switch because it could in-cur, at least, two context switches and a bunchof system calls to save the current PMD val-ues, reprogram the new PMC and PMD regis-ters. During that window of time the monitoredthread usually keeps on running opening up alarge blind spot.

The perfmon2 interface supports events setsand multiplexing at the kernel level. Switch-ing overhead is significantly minimized, blindspots are eliminated by the fact that switchingsystematically occurs in the context of the mon-itored thread.

Sets and multiplexing is supported for per-thread and system-wide monitoring and forboth counting and sampling measurements.

6.1 Defining sets

0 5

3 50

0

pfm_context

sets

after pfm_create_evtsets(5)

pfm_context

sets

after pfm_create_evtsets(3)

after pfm_create_context()

pfm_context

sets

Figure 7: creating sets.

Each context is created with a default eventset, called set0. Sets can be dynamically cre-ated, modified, or deleted when the contextis detached using the pfm_create_evtsets

and pfm_delete_evtsets calls. Informa-tion, such as the number of activations of aset, can be retrieved with the pfm_getinfo_

evtsets call. All these functions take arrayarguments and can, therefore, manipulate mul-tiple sets per call.

A set is identified with a 16-bit number. Assuch, there is a theoretical limit of 65k sets.Sets are managed through an ordered list basedon their identification numbers. Figure 7 showsthe effect of adding set5 and set3 on the list.

Tools can program registers in each set by pass-ing the set identification for each element of thearray passed to the read or write calls. In onepfm_write_pmcs call it is possible to pro-gram registers for multiple sets.

Page 15: Perfmon2: a flexible performance monitoring interface for

2006 Linux Symposium, Volume One • 283

6.2 Set switching

Set switching can be triggered by two differentevents: a timeout or a counter overflow. This isanother innovation of the perfmon2 interface,again giving tools maximum flexibility. Thetype of trigger is determined, per set, when itis created.

The timeout is specified in micro-seconds whenthe set is created. The granularity of the time-out depends on the granularity of kernel inter-nal timer tick, usually 1ms or 10ms. If the gran-ularity is 10ms, then it is not possible to switchmore than 100 times per second, i.e., the time-out cannot be smaller than 100µs. Because thegranularity can greatly affect the accuracy of ameasurement, the actual timeout, rounded upthe the closest multiple of the timer tick, is re-turned by the pfm_create_evtsets call.

It is also possible to trigger a switch on counteroverflow. To avoid dedicating a counter as atrigger, there is a trigger threshold value asso-ciated with each counter. At each overflow, thethreshold value is decremented, when it reacheszero, switching occurs. It is possible to havemultiple trigger counters per set, i.e., switch onmultiple conditions.

The next set is determined by the position inthe ordered list of sets. Switching is managedin a round-robin fashion. In the example fromFigure 7, this means that the set following set5is set0.

Using overflow switching, it is possible to im-plement counter cascading where a counterstarts counting only when a certain number ofoccurrences, n, of an event E is reached. In afirst set, a PMC register is programmed to mea-sure event E, the corresponding PMD registeris initialized to −n, and its switch trigger is setto 1. The next set is setup to count the event ofinterest and it will activated only when there isan overflow in the first set.

6.3 Sampling

Sets are fully integrated with sampling. Set in-formation is propagated wherever is necessary.The counter overflow notification carries theidentification of the active set. The default sam-pling format fully supports sets. Samples fromall sets are stored them into the same buffer.The active set at the time of the overflow isidentified in the header of each sample.

7 Security

The interface is designed to be built into thebase kernel, as such, it must follow the samesecurity guidelines.

It is not possible to assume that tools will al-ways be well-behaved. Each implementationmust check arguments to calls. It must not bepossible to use the interface for malicious at-tacks. A user cannot run a monitoring tool toextract information about a process or the sys-tem without proper permission.

All vector arguments have a maximum size tolimit the amount of kernel memory necessaryto perform the copy into kernel space. Bynature, those calls are non-blocking and non-preemptible, ensuring that memory is eventu-ally freed. The default limit is set to a page.

The sampling buffer size is also limited be-cause it consumes kernel memory that cannotbe paged out. There is a system-wide limitand a per-process limit. The latter is usingthe resource limit on locked memory (RLIMIT_MEMLOCK). The two-level protection is requiredto prevent users from launching lots of pro-cesses each allocating a small buffer.

In per-thread mode, the user credentials arechecked against the permission of the thread

Page 16: Perfmon2: a flexible performance monitoring interface for

284 • Perfmon2: a flexible performance monitoring interface for Linux

to monitor when the context is attached. Typi-cally, if a user cannot send a signal to the pro-cess, it is not possible to attach. By default,per-thread monitoring is available to all users,but a system administrator can limit to a usergroup. An identical, but separate, restriction isavailable for system-wide contexts.

On several architectures, such as Itanium, itis possible to read the PMD registers directlyfrom user-level, i.e., with a simple instruction.There is always a provision to turn this fea-ture off. The interface supports this mode ofaccess by default for all self-monitoring per-thread context. It is turned off by default forall other configurations, thereby preventing spyapplications from peeking at values left in PMDregisters by others.

All size and user group limitations can be con-figured by a system administrator via a simplesysfs interface.

As for sampling, we are planning on adding aPMU interrupt throttling mechanism to preventDenial-of-Service (DoS) attacks when applica-tions set very high sampling rates.

8 Fast user-level PMD read

Invoking a system call to read a PMD registercan be quite expensive compared to the cost ofthe actual instruction. On an Itanium 2 1.5GHzprocessor, for instance, it costs about 36 cyclesto read a PMD with a single instruction andabout 750 cycles via pfm_read_pmds whichis not really optimized at this point. As a ref-erence, the simplest system call, i.e., getpid,costs about 210 cycles.

On many PMU models, it is possible to directlyread a PMD register from user level with a sin-gle instruction. This very lightweight mode

of access is allowed by the interface for allself-monitoring threads. Yet, if actual counterswidth is less than 64-bit, only the partial valueis returned. The software-maintained value re-quires a kernel call.

To enable fast 64-bit PMD read accesses fromuser level, the interface supports re-mapping ofthe software-maintained PMD values to userlevel for self-monitoring threads. This mech-anism was introduced by the perfctr interface.This enables fast access on architectures with-out hardware support for direct access. For theothers, this enables a full 64-bit value to bereconstructed by merging the high-order bitsfrom the re-mapped PMD with the low-orderbit obtained from hardware.

Re-mapping has to be requested when the con-text is created. For each event set, the PMD reg-ister values have to be explicitly re-mapped viaa call to mmap on the file descriptor identify-ing the context. When a set is created a specialcookie value is passed back by pfm_create_

evtset. It is used as an offset for mmap andis required to identify the set to map. The map-ping is limited to one page per set. For each set,the re-mapped region contains the 64-bit soft-ware value of each PMD register along with astatus bit indicating whether the set is the activeset or not. For non-active sets, the re-mappedvalue is the up-to-date full 64-bit value.

Given that the merge of the software and hard-ware values is not atomic, there can be a racecondition if, for instance, the thread is pre-empted in the middle of building the 64-bitvalue. There is no way to avoid the race, insteadthe interface provides an atomic sequence num-ber for each set. The number is updated eachtime the state of the set is modified. The num-ber must be read by user-level code before andafter reading the re-mapped PMD value. If thenumber is the same before and after, it meansthat the PMD value is current, otherwise the

Page 17: Perfmon2: a flexible performance monitoring interface for

2006 Linux Symposium, Volume One • 285

operation must be restarted. On the same Ita-nium 2 processor and without conflict, the costis about 55 cycles to read the 64-bit value of aPMD register.

9 Status

A first generation of this interface has been im-plemented for the 2.6 kernel series for the Ita-nium Processor Family (IPF). It uses a singlemultiplexing system call, perfmonctl, and ismissing events sets, PMU description tables,and fast user-level PMD reads. It is currentlyshipping with all major Linux distributions forthis architecture.

The second generation interface, which we de-scribe in this paper, currently exists as a kernelpatch against the latest official 2.6 kernel fromkernel.org. It supports the following pro-cessor architectures and/or models:

• all the Itanium processors

• the AMD Opteron processors in 64-bitmode

• the Intel Pentium M and P6 processors

• the Intel Pentium 4 and Xeon processors.That includes 32-bit and 64-bit (EM64T)processors. Hyper-Threading and PEBSare supported.

• the MIPS 5k and MIPS 20k processors

• preliminary support for IBM Power5 pro-cessor

Certain ports were contributed by other com-panies or developers. As our user communitygrows, we expect other contributions to bothkernel and user-level code. The kernel patch

has been widely distributed and has generateda lot of discussions on various Linux mailinglists.

Our goal is to establish perfmon2 as the stan-dard Linux interface for hardware-based per-formance monitoring. We are in the process ofgetting it reviewed by the Community in prepa-ration for a merge with the mainline kernel.

10 Existing tools

Several tools already exists for the interface.Most of them are only available for Itaniumprocessors at this point, because an implemen-tation exists since several years.

The first open-source tool to use the interface ispfmon [6] from HP Labs. This is a command-line oriented tool initially built to test the in-terface. It can collect counts and profiles on aper-thread or system-wide basis. It supports theItanium, AMD Opteron, and Intel Pentium Mprocessors. It is built on top of a helper library,called libpfm, which handles all the event en-codings and assignment logic.

HP Labs also developed q-tools [7], a re-placement program for gprof. Q-tools usesthe interface to collect a flat profile and a sta-tistical call graph of all processes running in asystem. Unlike gprof, there is no need to re-compile applications or the kernel. The profileand call graph include both user- and kernel-level execution. The tool only works on Ita-nium 2 processors because it leverages certainPMU features, in particular the Branch TraceBuffer. This tool takes advantage of the in-terface by overlapping two sampling measure-ments to collect the flat profile and call graph inone run.

The HP Caliper [5] is an official HP productwhich is free for non-commercial use. This is

Page 18: Perfmon2: a flexible performance monitoring interface for

286 • Perfmon2: a flexible performance monitoring interface for Linux

a professional tool which works with all ma-jor Linux distributions for Itanium processors.It collects counts or profiles on a per-thread orper-CPU basis. It is very simple to use andcomes with a large choice of preset metricssuch as flat profile (fprof), data cache misses(dcache_miss). It exploits all the advancedPMU features, such as the Branch Trace Buffer(BTB) and the Data Event Address Registers(D-EAR). The profiles are correlated to sourceand assembly code.

The PAPI toolkit has long been available on topof the perfmon2 interface for Itanium proces-sors. We expect that PAPI will migrate over toperfmon2 on other architectures as well. Thismigration will likely simplify the code and al-low better support for sampling and set multi-plexing.

The BEA JRockit JVM on Linux/ia64, start-ing with version 1.4.2 is also exploiting the in-terface. The JIT compiler is using a, dynam-ically collected, per-thread profile to improvecode generation. This technique [4], called Dy-namic Profile Guided Optimization (DPGO),takes advantage of the efficient per-thread sam-pling support of the interface and of the abil-ity of the Itanium 2 PMU to sample branchesand locations of cache misses (Data Event Ad-dress Registers). What is particularly interest-ing about this example is that it introduces anew usage model. Monitoring is used each timea program runs and not just during the develop-ment phase. Optimizations are applied in theend-user environment and for the real work-load.

11 Conclusion

We have designed the most advanced perfor-mance monitoring interface for Linux. It pro-vides a uniform set of functionalities across all

architectures making it easier to write portableperformance tools. The feature set was care-fully designed to allow efficient monitoring anda very high degree of flexibility to support adiversity of usage models and hardware archi-tectures. The interface provides several keyfeatures such as custom sampling buffer for-mats, kernel support event sets multiplexing,and PMU description modules.

We have developed a multi-architecture imple-mentation of this interface that support all ma-jor processors. On the Intel Pentium 4 proces-sor, this implementation is the first to offer sup-port for PEBS.

We are in the process of getting it merged intothe mainline kernel. Several open-source andcommercial tools are available on Itanium 2processors at this point and we expect that oth-ers will be released for the other architecturesas well.

Hardware-based performance monitoring is thekey tool to understand how applications and op-erating systems behave. The monitoring infor-mation is used to drive performance improve-ments in applications, operating system ker-nels, compilers, and hardware. As proces-sor implementation enhancements shift frompure clock speed to multi-core, multi-thread,the need for powerful monitoring will increasesignificantly. The perfmon2 interface is wellsuited to address those needs.

References

[1] AMD. AMD64 ArchitectureProgrammer’s Manual: SystemProgramming, 2005.http://www.amd.com/us-en/Processors/DevelopWithAMD.

[2] David F. Carta. Two fast implementationsof the minimal standard random number

Page 19: Perfmon2: a flexible performance monitoring interface for

2006 Linux Symposium, Volume One • 287

generator. Com. of the ACM,33(1):87–88, 1990. http://doi.acm.org/10.1145/76372.76379.

[3] Intel Coporation. The Itanium processorfamily architecture. http://developer.intel.com/design/itanium2/documentation.htm.

[4] Greg Eastman, Shirish Aundhe, RobertKnight, and Robert Kasten. Inteldynamic profile-guided optimization inthe BEA JRockitTM JVM. In 3rdWorkshop on Managed RuntimeEnvironments, MRE’05, 2005.http://www.research.ibm.com/mre05/program.html.

[5] Hewlett-Packard Company. The Caliperperformance analyzer. http://www.hp.com/go/caliper.

[6] Hewlett-Packard Laboratories. Thepfmon tool and the libpfm library.http://perfmon2.sf.net/.

[7] Hewlett-Packard Laboratories. q-tools,and q-prof tools. http://www.hpl.hp.com/research/linux.

[8] Intel. Intel Itanium 2 ProcessorReference Manual for SoftwareDevelopment and Optimization, April2003. http://www.intel.com/design/itanium/documentation.htm.

[9] Intel. IA-32 Intel Architecture SoftwareDevelopers’ Manual: SystemProgramming Guide, 2004.http://developer.intel.com/design/pentium4/manuals/index_new.htm.

[10] Intel Corp. The VTuneTM performanceanalyzer. http://www.intel.com/software/products/vtune/.

[11] Chi-Keung Luk and Robert Muth et al.Ispike: A post-link optimizer for the IntelItanium architecture. In Code Generationand Optimization Conference 2004(CGO 2004), March 2004.http://www.cgo.org/cgo2004/papers/01_82_luk_ck.pdf.

[12] Mikael Pettersson. the Perfctr interface.http://user.it.uu.se/~mikpe/linux/perfctr/.

[13] Alex Shye et al. Analysis of pathprofiling information generated withperformance monitoring hardware. InINTERACT HPCA’04 workshop, 2004.http://rogue.colorado.edu/draco/papers/interact05-pmu_pathprof.pdf.

[14] B.D̃ragovic et al. Xen and the art ofvirtualization. In Proceedings of theACM Symposium on Operating SystemsPrinciples, October 2003.http://www.cl.cam.ac.uk/Research/SRG/netos/xen/architecture.html.

[15] J.Ãnderson et al. Continuous profiling:Where have all the cycles gone?, 1997.http://citeseer.ist.psu.edu/article/anderson97continuous.html.

[16] John Levon et al. Oprofile.http://oprofile.sf.net/.

[17] Robert Cohn et al. The PIN tool. http://rogue.colorado.edu/Pin/.

[18] Alex Tsariounov. The Prospectmonitoring tool.http://prospect.sf.net/.

[19] University of Tenessee, Knoxville.Performance Application ProgrammingInterface (PAPI) project.http://icl.cs.utk.edu/papi.

Page 20: Perfmon2: a flexible performance monitoring interface for

288 • Perfmon2: a flexible performance monitoring interface for Linux

Page 21: Perfmon2: a flexible performance monitoring interface for

Proceedings of theLinux Symposium

Volume One

July 19th–22nd, 2006Ottawa, Ontario

Canada

Page 22: Perfmon2: a flexible performance monitoring interface for

Conference Organizers

Andrew J. Hutton, Steamballoon, Inc.C. Craig Ross, Linux Symposium

Review Committee

Jeff Garzik, Red Hat SoftwareGerrit Huizenga, IBMDave Jones, Red Hat SoftwareBen LaHaise, Intel CorporationMatt Mackall, Selenic ConsultingPatrick Mochel, Intel CorporationC. Craig Ross, Linux SymposiumAndrew Hutton, Steamballoon, Inc.

Proceedings Formatting Team

John W. Lockhart, Red Hat, Inc.David M. Fellows, Fellows and Carr, Inc.Kyle McMartin

Authors retain copyright to all submitted papers, but have granted unlimited redistribution rightsto all as a condition of submission.