[IEEE 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) - Austin, TX, USA (2013.04.21-2013.04.23)] 2013 IEEE International Symposium on Performance

PAPI 5: Measuring Power, Energy, and the Cloud

Vincent M. Weaver∗, Dan Terpstra†, Heike McCraw†, Matt Johnson†, Kiran Kasichayanula†, James Ralph†,John Nelson†, Phil Mucci†, Tushar Mohan§, and Shirley Moore‡

∗Electrical and Computer Engineering, University of Maine†Innovative Computing Lab, University of Tennessee

‡Computer and Computational Sciences, University of Texas at El Paso§Minimal Metrics

I. INTRODUCTION

The PAPI library [1] was originally developed to provideportable access to the hardware performance counters foundon a diverse collection of modern microprocessors. Ratherthan learning and writing to a new performance infrastructureeach time code is moved to a new machine, measurementcode can be written to the PAPI API which abstracts awaythe underlying interface.

Over time, other system components besides the processorhave gained performance interfaces (for example, GPUs andnetwork interfaces). PAPI was redesigned to have a componentarchitecture to allow modular access to these new sources ofperformance data [2].

In addition to incremental changes in processor support, therecent PAPI 5 release adds support for two emerging concernsin the high-performance landscape: energy consumption andcloud computing.

As processor densities climb, the thermal properties andenergy usage of high performance systems are becomingincreasingly important. We have extended the PAPI interfaceto simultaneously monitor processor metrics, thermal sensors,and power meters to provide clues for correlating algorithmicactivity with thermal response and energy consumption.

We have also extended PAPI to provide support for runninginside of Virtual Machines (VMs). This ongoing work willenable developers to use PAPI to engage in performanceanalysis in a virtualized cloud environment.

II. NEW FEATURES

The recent PAPI 5.0 and 5.1 releases have added many newfeatures over PAPI 4.4.

A. Improved CPU Support

PAPI 5 provides better support for Intel SandyBridge, IvyBridge, Cedarview Atom, and Xeon Phi architectures. Supporthas been added for Intel Offcore Response and Uncore Events.

PAPI now supports Blue Gene/Q (BG/Q), the third gen-eration in the IBM Blue Gene line of massively parallel,energy efficient supercomputers. The BG/Q predecessor, BlueGene/P, suffered from incompletely implemented hardwareperformance monitoring tools. To address these limitations,an industry/academic collaboration was established to extendPAPI with five new components that allow hardware per-formance counter monitoring of the 5D-Torus network, the

I/O system and the Compute Node Kernel in addition to theprocessing cores on BG/Q.

B. New Interfaces

The previous limit of 16 components has been lifted,allowing for a much richer collection of measurement options.PAPI can now report results other than unsigned 64-bit integer,such as signed integer, fixed point, and ratios. Support is alsoincluded for reporting units, which becomes necessary withevents that report power and energy. A document is availablethat describes all of the interface changes [3].

C. Power and Energy Components

Energy and power have become increasingly importantcomponents of overall system behavior in high-performancecomputing (HPC). Now that HPC machines have hundreds ofthousands of cores [4], the ability to reduce consumption byjust a few Watts per CPU quickly adds up to major power,cooling, and monetary savings.

There are some limitations when measuring power andenergy using PAPI. Typically these readings are system-wide:it is not possible to exactly map the results to the user’s code,especially on multi-core systems.

1) Intel RAPL: Recent Intel SandyBridge chips include the“Running Average Power Limit” (RAPL) interface. Internalcircuitry can estimate current energy usage based on a modeldriven by hardware counters, temperature, and leakage models.The results of this model are available to the user via a modelspecific register (MSR), with an update frequency on the orderof milliseconds. Linux has no RAPL driver, so we must usethe “MSR driver” that exports MSR access to userspace. If theMSR driver is given proper read-only permissions then PAPIcan access these registers without needing kernel support.

2) NVIDIA Management Library: Recent NVIDIA GPUscan report power usage via the NVIDIA Management Library(NVML) [5]. The nvmlDeviceGetPowerUsage() rou-tine exports the current power; on Fermi C2075 GPUs it isupdated at 60Hz with milliwatt precision and ±5 Watt absoluteaccuracy. PAPI can use this interface to report power for theentire board, including GPU and memory.

3) Xeon Phi / MIC: The PAPI MIC power componentexposes instantaneous voltage and current data collected byan onboard system management controller (SMC); the mea-surements are from an analog sensor, not a model. The SMCalso provides an averaged total power utilization over two timewindows.

124978-1-4673-5779-1/13/$31.00 ©2013 IEEE

D. Virtualization

Cloud computing involves use of a hosted computationalenvironment that can provide elastic compute and storageservices on demand. Virtualization is a technology that allowsmultiple virtual machines (VMs) to run on a single physicalmachine and share its resources. Virtualization is increasinglybeing used in cloud computing to provide economies of scale,customized environments, fault isolation, and reliability.

PAPI 5 addresses various aspects of measuring perfor-mance in the cloud.

1) Timing and Stealtime: PAPI aims to use the best andmost accurate timers exposed by each VMM to implement auniform timing interface that can be used across VMMs. Thistimer standardization will allow the same timing code to beused from within an application regardless of which native orvirtualized environment it is running on. Some VMMs supportthe notion of virtualized time, for example called “apparenttime” by VMware [6], whereby the virtual machine can haveits own idea of time.

Most processors have a built-in hardware clock that allowsthe operating system to measure real and process time. Realtime, also called elapsed or wall clock time, is the timeaccording to an external standard since some fixed point suchas the start of the life of a process. Process time is the amountof the CPU time used by a process since it was created. Processtime is broken down into user CPU time (also called virtualtime), which is the amount of time spent executing in usermode; and system CPU time, which is the amount of timespent executing in kernel mode. Measurement of process timecan be useful for evaluating the performance of a program,including on a per-process or per-thread basis.

In order to improve the accuracy of CPU time accountingon virtual systems, the mechanism must be able to not only dis-tinguish between real and virtual CPU time but also recognizebeing in involuntary wait states. These wait states are referredto as “steal time”. Linux KVM/Xen supports reporting stealtime, and we have written a PAPI component for reporting it.

Steal time is only reported system-wide, so it is notpossible to get fine-grained per-process results, thus mak-ing it hard for PAPI to auto-adjust results returned byPAPI_get_virt_usec(). Stealtime is only an issue ifa machine is over-subscribed with VMs, and in most HPCsituations only one task is running per node, so this might notbe a critical limitation.

2) I/O Performance: Variable I/O performance has beenfound to significantly impact application performance in virtualenvironments [7], [8], [9]. We have developed a PAPI com-ponent (appio) for measuring IO performance at applicationlevel in virtual environments. This component intercepts callsto read, write, fread and fwrite and reports a variety of metricson the size and amount of I/O taking place.

3) VMware Component: The PAPI VMware componentexposes information provided from within VMware to usersrunning inside as a guest. Values are gathered using VMware’sGuest SDK [10], as well as what VMware refers to as“pseudo performance” counters [6]. VMware makes pseudoperformance counters available through an rdpmc instruction

to obtain fine-grained time from within the virtual space.These pseudo-performance counters are not enabled by de-fault; they must be enabled and an environment variable(PAPI_VMWARE_PSEUDOPERFORMANCE) must be set be-fore use.

4) Virtualized Processor Counters: The performance mon-itoring unit (PMU) of a processor typically includes a setof performance counter registers that count the frequency orduration of specific processor events, a set of performanceevent select registers used to specify the events that are trackedby the performance counter registers, a hardware interrupt thatcan be generated when a counter overflows, and a time stampcounter (TSC) that can be used to count processor clock cycles.On x86 hardware these interfaces are programmed via MSRs;typically these require ring-0 (kernel) levels of permission.

In order to transparently use performance counters (andPAPI) from within a VM the full MSR interface must betrapped and emulated. Recently support for doing this wasadded to KVM (in Linux 3.3) and Xen (Linux 3.5). Supportis still undergoing beta testing at VMware.

III. CONCLUSION

The PAPI library provides transparent access to new classesof interfaces, including virtualized, power and energy mea-surements. Existing programs that already support PAPI in-strumentation for CPU performance measurements can quicklybe adapted to measure these new events with a simple PAPIupgrade.

ACKNOWLEDGEMENTS

This material is based upon work supported by the NationalScience Foundation under Grants No. 0910899 and 1117058.Additional support for this work was provided through aSponsored Academic Research Award from VMware, Inc.

REFERENCES

[1] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci, “A portableprogramming interface for performance evaluation on modern proces-sors,” International Journal of High Performance Computing Applica-tions, vol. 14, no. 3, pp. 189–204, 2000.

[2] D. Terpstra, H. Jagode, H. You, and J. Dongarra, “Collecting perfor-mance data with PAPI-C,” in 3rd Parallel Tools Workshop, 2009, pp.157–173.

[3] V. Weaver, “New features in the PAPI 5.0 release,” University ofTennessee, Tech. Rep., 2012. [Online]. Available: http://www.eece.maine.edu/∼vweaver/papers/papi/papi v5 changes.pdf

[4] “Top 500 supercomputing sites,” http://www.top500.org/.[5] NVML Reference Manual, NVIDIA, 2012.[6] Timekeeping in VMware Virtual Machines, VMware, Inc., 2010.[7] A. Nanos, G. Goumas, and N. Koziris, “Exploring i/o virtualization

data paths for MPI applications in a cluster of VMs: A networking per-spective,” in Proc. 5th Workshop on Virtualization in High PerformanceCloud Computing, 2011.

[8] H. Kim, H. Lim, J. Jeong, H. Jo, and J. Lee, “Task-aware virtualmachine scheduling for I/O performance,” in Proc. of the ACM SIG-PLAN/SIGOPS international conference on Virtual execution environ-ments, 2009, pp. 101–110.

[9] D. Ongaro, A. Cox, and S. Rixner, “Scheduling I/O in virtual machinemonitors,” in Proc. of the 4th ACM SIGPLAN/SIGOPS internationalconference on Virtual execution environments, 2008, pp. 1–10.

[10] vSphere Guest SDK Documentation, VMware, Inc., 2011.

125

Documents

[IEEE 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) - Austin, TX, USA (2013.04.21-2013.04.23)] 2013 IEEE International Symposium on Performance