Sapienza University of Rome - Dipartimento di Ingegneria ...delia/files/msc-thesis.pdf · Sapienza University of Rome ... the data streaming computational model can be used to handle

Sapienza University of Rome

School of Information Engineering,

Computer Science, and Statistics

Master’s Thesis in

Engineering in Computer Science

Mining Hot Calling Contexts in Small Space

Advisor: Prof. Camil Demetrescu

Student: Daniele Cono D’Elia

A.Y. 2011/2012

Homines, dum docent, discunt.

(Seneca)

Introduction

A crucial problem in the development of computer software is to identify badly

designed portions of code that can affect execution performance. An empirical

rule [15] states that 90% of the time is spent in executing 10% of the code: hence,

you have to pinpoint and optimize this 10% of code to improve performance

bottlenecks, since optimizing the remaining 90% would have a little impact on

overall performance.

Dynamic program analysis encompasses the design and development of tools

for analyzing software by gathering information at execution time. Main appli-

cations include performance profiling, debugging, and program understanding.

Static techniques for performance profiling - they are based on the inspection of

the source or sometimes the object code of the software and were very used in the

past - are not sufficient anymore, mostly because of the massive usage of dynamic

libraries and polymorphism and late binding in object-oriented languages.

Traditional profilers such as gprof [14] and valgrind [24] provide information

about the execution frequency of single routines (vertex profiling) or of caller-

callee pairs (edge profiling). However, these data sometimes are not sufficient to

describe accurately the dynamics of execution [26, 30]. Context-sensitive profiling

provides, instead, more valuable information with finer granularity for program

understanding, performance analysis, and runtime optimization. Collecting con-

text information in modern object-oriented software is very challenging: source

code is composed of a large number of small routines, and the high frequency

of function calls and returns might result in considerable profiling overhead and

huge amounts of data to be analyzed by the profiler.

The main goal of this thesis is to show how sophisticated techniques from

the data streaming computational model can be used to handle the huge amount

i

of profiling data gathered at execution time, thus achieving little overheads in

the implementation of a context sensitive profiler. In particular, we adopted

and optimized two well-know algorithms for the frequent items problem, Lossy

Counting and Space Saving, and then realized two implementations of our context

sensitive profiler. The first implementation is based on Intel Pin and allows it to

rapidly perform profiling on any C/C++ application since the instrumentation

is done at runtime. Although Pin is one of the most efficient and widespread

instrumentation frameworks, preliminary experimental results have shown that

the overhead introduced for context sensitive profiling was large.

For this reason, we developed a much more efficient profiler based on gcc.

Although instrumentation is now performed at compiling time, the tool is very

flexible: it is sufficient to change a symbolic link or an environment variable to

modify even crucial parameters of the profiler. This tool also allows for a partial

instrumentation of the source code and outputs only calling contexts representa-

tive of hot spots to which optimizations must be directed.

The experiments show that our profiler is able to correctly identify all the

hottest calling contexts, producing accurate performance metrics. The running

time overhead is kept under control using bursting techniques [2, 17, 32] while

the peak memory usage is only 1% of standard context-sensitive profilers. This

thesis is largely based on, and extends, the paper Mining Hot Calling Contexts in

Small Space [12], written by the author of this dissertation together with Profes-

sors Camil Demetrescu and Irene Finocchi (Sapienza University of Rome), and

published in Proceedings of the 32nd ACM SIGPLAN Conference on Program-

ming Language Design and Implementation, PLDI 2011, San Jose, CA, USA,

June 4-8, 2011. Source code and documentation related to this work are hosted

on Google Code (http://code.google.com/p/hcct/).

Thesis structure. The remaining part of this thesis is structured as follows.

Chapter 1 describes performance profiling methodologies and most used program

instrumentation tools. Chapter 2 describes issues related to context-sensitive pro-

filing and our approach to the problem. Chapter 3 focuses on implementation and

engineering aspects, while Chapter 4 illustrates the outcome of our experimental

study. Conclusions are presented in Chapter 5.

ii

Contents

1 Performance profiling methodologies 1

1.1 Survey of techniques . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.2 Static and dynamic analysis . . . . . . . . . . . . . . . . . 2

1.1.3 Collection of data . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.4 Data type . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Data aggregation and representation . . . . . . . . . . . . . . . . 4

1.2.1 Calling contexts . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.2 Call graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.3 Call tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.4 Calling context tree . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Program instrumentation tools . . . . . . . . . . . . . . . . . . . . 7

1.3.1 gprof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3.2 Dyninst . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3.3 DynamoRIO . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3.4 Valgrind . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3.5 Pin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3.6 gcc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Mining hot calling contexts 11

2.1 Previous approaches . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Frequent items in data streams . . . . . . . . . . . . . . . . . . . 13

2.2.1 Space Saving . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.2 Lossy Counting . . . . . . . . . . . . . . . . . . . . . . . . 15

iii

CONTENTS

2.3 HCCT: Hot Calling Context Tree . . . . . . . . . . . . . . . . . . 16

2.3.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.2 Update and query . . . . . . . . . . . . . . . . . . . . . . . 21

2.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Implementation 25

3.1 Space Saving (BSS and LSS) . . . . . . . . . . . . . . . . . . . . . 26

3.2 Lossy Counting (LC) . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 The pintool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4 The trace analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5 The gcc-based profiler . . . . . . . . . . . . . . . . . . . . . . . . 33

3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 Experimental evaluation 39

4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1.3 Parameter tuning . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2.1 Accuracy: exact HCCT . . . . . . . . . . . . . . . . . . . . 45

4.2.2 Accuracy: (φ, ε)-HCCT . . . . . . . . . . . . . . . . . . . . 47

4.2.3 Memory usage and cost of analysis . . . . . . . . . . . . . 53

4.2.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5 Conclusions 60

Appendix A: Using the tools 63

Bibliography 72

iv

Chapter 1

Performance profiling

methodologies

1.1 Survey of techniques

1.1.1 Profiling

The development of computer software often leads to programs that are quite

large and typically modular, according to well-known principles from software

engineering. These programs are usually constituted of a large number of small

routines, written by several developers and eventually updated to introduce en-

hancements and new functionalities, sometimes introducing deviations from soft-

ware’s initial goals. When a large program is compiled, it becomes important

to pinpoint and optimize the pieces of code that dominate the execution time.

Actually, according to the 90-10 empirical rule [15], 90% of the time is spent in

executing 10% of the code: this rule suggests, in accordance to the well-know

Pareto principle, to optimize mostly this part of the program to achieve remark-

able improvements in terms of efficiency and execution time.

The main goal of a performance profiler is that of identifying which parts of

the program should be optimized, in order to improve the global execution speed.

There are many other applications for profilers: for instance, optimal memory

allocation, intrusion detection, and understanding of program behaviour.

1

CHAPTER 1. PERFORMANCE PROFILING METHODOLOGIES

1.1.2 Static and dynamic analysis

Static analysis techniques are based on the inspection of source code (and some-

times of the object code). This analysis is executed without actually executing

programs and is performed by an automated tool. The granularity of analysis

tools varies from those that only consider individual statements, to those that in-

clude the complete source code in their analysis. Static analysis techniques had a

lot of success in the past years, especially in procedural programming; nowadays

they are used mainly by compilers to apply optimizations to the code, preserving

at the same time the semantic correctness. This is due to the fact that changes

in the development methodologies in the last years have created the need for the

adoption of a dynamic analysis of the behaviour of a program: polymorphism,

late binding, and dynamically linked libraries are emblematic of this need. A

static analysis would be inaccurate since the increasing shortage of information

forces the profiler to make conservative assumptions in its analysis, thus under-

mining the validity of the analysis in several cases. The development of dynamic

analysis tools has put several challenges also from a methodological point of view,

encompassing different areas: building efficient instrumentation techniques that

introduce a moderated slowdown on execution time and do not interfere with the

internal logic of the analysed program (i.e. do not introduce heisenbugs); design-

ing efficient algorithms to process huge amount of data with limited time and

resources; dealing with sometimes obscure optimizations introduced by compiler

in the executable code.

1.1.3 Collection of data

A statistical profiler operates by sampling: it usually probes the target program’s

program counter on a regular basis using operating system interrupts. The anal-

ysis performed by a statistical profiler is the result of a statistical approximation,

with a high error probability on unfrequent functions (sometimes they are not

detected at all). In practice, these profiler can often provide a more accurate

analysis than other approaches, since ther are not as intrusive to the analyzed

program (it is possible to achieve a near full speed execution, so they can detect

issues that would otherwise be hidden) and don’t have as many side effects (such

2


as on memory caches). Moreover, a statistical profiler can show the amount of

time approximately spent in user mode versus kernel mode by every analyzed rou-

tine. Several drawbacks due to the usage of operating system interrupts can be

overcome by using dedicated hardware to capture the program counter without

interferences on the execution of the program under analysis.

An instrumenting profiler modifies the target program with additional instruc-

tions to collect the required information. Instrumenting the program will always

have an impact on the program execution, typically introducing a slowdown and

potentially causing inaccurate results or heisenbugs. However, instrumenting a

program in a careful way can lead to a minimal impact and a very high degree

of accuracy in the results. For this reason, the profiler we implemented relies on

instrumentation rather than on sampling. The impact of instrumentation on a

particular program depends on many factors: the placement of instrumentation

points, the mechanism used to capture information, and the inner structure of

the program to be analyzed. Several instrumentation methodologies are available

nowadays: manual (i.e. performed by the programmer, by adding instructions

to extract profiling information), compiler assisted, binary translation, runtime

instrumentation, and runtime injection.

An event-based profiler is typical of languages such as Java, .NET, Python,

and Ruby. In this case the runtime provides hooks to profilers, for trapping events

like enter, leaves, object creation, exception etc. These profilers are usually very

flexible: for instance, it is possible to vary both the profiling event and the

sampling frequency.

1.1.4 Data type

Profilers can also be classified according to the type of information they collect;

then a further distinction can be made according to the different representations

of the gathered information. There are two main categories: profilers that output

counters for instructions/routines and profilers that produce estimations on the

fraction of execution time spent in each routine. Counters are usually presented

in a simple table, while for time measures there could be more than one value

associated with each routine (for instance, time spent in user mode and in kernel

3


mode).

It is important to stress the fact that counters are not necessarily repre-

sentative of or proportional to the execution time of the functions. Moreover,

the completion time may vary among different invocations of the same function.

When the goal is to estimate accurately the distribution of execution time, a

counter-based profiler is not the right choice. However, counters are very useful

in a variety of scenarios and can be used in several contexts. For instance, the ex-

act number of invocations of a function can be used to verify the correctness of an

algorithm, or to identify algorithms whose complexity is not adequate to the tasks

they have to complete. Moreover, an accurate analysis of counters can provide

a rough estimation of the execution time for parts of code and help to improve

the performance and refactor already implemented algorithms. Finally, counters

can also be used to identify subtle programming errors and which portions of the

code are executed (for instance, to check whether a new implementation of an

abstraction has completely replaced the old one).

Profilers that estimate time measures operate by approximation: measuring

the exact execution time of each instruction or routine would introduce a too

large slowdown on global execution speed. For this reason, these profilers are

based on sampling techniques as described in the previous paragraph.

1.2 Data aggregation and representation

In this thesis we focus on counter-based profiling, although our approach can be

easily extendend to other metrics such as execution time or cache misses. Several

data structures have been proposed during the years to mantain information

about interprocedural control flow1.

The design of a data structure to store profiling information is strictly related

to the aggregation level of such information. A vertex profiler stores the number of

invocations for each function and does not distinguish the caller nor the context.

An edge profiler instead stores the number of invocations for each ordered pair of

caller-callee routines. This terminology derives from the graphical representation

1An intraprocedural data flow analysis operates on a control-flow graph for a single method;

an interprocedural analysis operates across function boundaries.

4


of the control flow of a computer software: every function is represented as a

vertex; two vertices A and B are linked by a directed edge A → B if function

A invokes function B; a path is a sequence of vertices that can be traversed by

starting from a node and following existing edges.

Ball and Larus [4] introduced an intraprocedural path profiling technique to

efficiently determine how many times each acyclic path in a routine executes,

extending the more common edge profiling. Melski and Reps [21] have proposed

interprocedural path profiling to capture both inter- and intra-procedural contro

flow; however, their approach suffers from scalability issues since the number of

paths existing across procedure boundaries can be very large.

1.2.1 Calling contexts

A context-sensitive analysis is an interprocedural analysis that takes into account

the calling context when analyzing the target of a function call. A calling context

is a sequence of routine calls that are concurrently active on the run-time stack

and that lead to a program location. The utility of calling context information

was already clear in the 80s, and nowadays collecting context information in mod-

ern object-oriented information is very challenging: since application are often

structured in a large number of small routines, the high frequency of function

calls and returns might result in considerable profiling overhead, huge amounts

of data to be analyzed by the programmer, and heisenbugs. For this reason,

edge profilers still today are the most used profilers. On the other hand, both

context-sensitive and path profilers provides more valuable information than edge

profilers, while at the same time it is still possible to reconstruct an edge profile

from this information [5].

We will now define three well-known data structures for counter-based profilers.

1.2.2 Call graph

A call graph (CG) is a succinct representation of function invocations: every

function is represented as a vertex and every directed arc shows a caller-callee

relationship between two functions. The use of call graph profiles was pionereed

5


in the 80s by gprof [14]. Although this representation is very space- and time-

efficient, this approach can lead to misleading results [26, 30] and partial com-

prehension of the behaviour of the program. For instance, in Figure 1.1 the call

graph is missing the information that routine D is invoked by C only when C is

invoked by B (calling context A→B→C→D occurs two times, while A→C→D

does not appear in the execution). Call graphs are used by edge profilers and are

not suited for the representation of context sensitive information.

AB

CD

2

2

2

1

1

A

B BC C

D C

D D

A

B C

D C

D

2 2

1 1

2

Figure 1.1: Call graph, Call tree, Calling Context Tree (from left to right).

1.2.3 Call tree

In a call tree (CT) each node represents a different routine invocation: a new node

is added for each method invocation and then attached to the parent node (i.e.,

the caller method). This yields very accurate context information, but requires

extremely large (possibly unbounded) space.

1.2.4 Calling context tree

The inaccuracy of call graphs and the huge size of call trees have motivated the

introduction [1] of calling context trees (CCT). The CCT is a succinct summary

of the call tree, built recursively from the root by merging for each node children

representing different invocations of the same methods: in other words, CCTs do

not distinguish between different invocations of the same routine within the same

context. The edge weight of a CCT represents the number of nodes merged into

6


the target node, and also the calling frequency of the callee by the caller (i.e., the

parent node) within the particular calling context of the caller. This representa-

tion takes advantage of the tree structure by encoding every calling context as a

simple node: the context can be easily reconstructed by moving along the path

from the root to the target node. While preserving a good accuracy, typically

CCTs are several orders of magnitude smaller than call trees: the presence of re-

cursion may heavily increase the number of nodes, but this scenario is infrequent

in practice.

However, as noticed in previous works [7, 32], even CCTs may be very large

and difficult to analyze in several applications. The exhaustive approach to the

construction of a CCT is based on the instrumentation of each routine call and

return, incurring considerable slowdown even when using efficient instrumenta-

tion mechanisms [1, 32]: these two issues will be addressed and analyzed in more

detail in Chapter 2.

1.3 Program instrumentation tools

Here follows a survey of the most used instrumentation tools and frameworks.

1.3.1 gprof

This profiler has been introduced in 1982 [14] by a group of researchers of the

Berkeley University as an extension of older prof Unix tool. It uses hybrid of

instrumentation and sampling. Instrumentation is done by compiler (i.e., the -pg

option of gcc compiler) and is used to count how many times a single procedure

of the program is invoked; moreover, also the edge of the call graph related

to each invocation is stored. The time spent in each routine is estimated by

statistical sampling: the program counter of the monitored software is probed at

regular intervals using operating system interrupts (e.g., programmed via profil

syscall). Typical sampling period is 0.01 seconds (10 milliseconds). In 2004 gprof

paper appeared on the list of 50 most influential PLDI papers of all time: it

revolutionized the performance analysis field and became a milestone and the

tool of choice for many developers.

7


1.3.2 Dyninst

Dyninst [9] is a multi-platform API for runtime code-patching developed as part

of the Paradyn project at the University of Wisconsin-Madison and University

of Maryland. The goal of this API is to provide a machine independent interface

to permit the creation of tools and applications that use runtime code patch-

ing. Several APIs are available, such as the InstructionAPI for decoding and

representing machine instructions in a platform-independent manner, and the

StackwalkerAPI for walking the run-time stack of a process.

1.3.3 DynamoRIO

DynamoRIO [8] is a runtime code manipulation system: it supports code trans-

formations while the target program executes. Unlike many frameworks, it allows

arbitrary modifications to application instructions (a powerful instruction manip-

ulation library for IA-32 and AMD64 is provided). DynamoRIO’s API abstracts

away the details of the underlying infrastructure and allows the tool builder to

implement efficient dynamic tools for a wide variety of uses, such as program

analysis, profiling, and optimization.

1.3.4 Valgrind

This tool [24] was originally designed to be a free memory debugging tool for

Linux, but has since evolved to become a generic framework for creating dy-

namic analysis tools for memory debugging, memory leak detection, and profil-

ing. Valgrind is essentially a virtual machine using just-in-time (JIT) compilation

techniques: nothing from the original program ever gets run directly on the host

processor. After an initial translation into an intermediate processor-neutral for-

mat, a Valgrind-based tool is free to make any transformation on the code before

Valgrind translates it back into machine code and executes it. Although a con-

siderable amount of performance iIn particulars lost in these transformations, the

intermediate format is much more suitable for instrumentation than the original.

Several tools are included with Valgrind, such as Memcheck for memory usage

inspection and callgrind for the construction of the call graph.

8


1.3.5 Pin

Pin [18] is an instrumentation system developed by Intel and the University of

Colorado. The goal of this project is to provide easy-to-use, portable, transparent,

and efficient instrumentation. Instrumentation tools (called pintools) are written

in C/C++ using the Pin API, that is designed to be as much architecture inde-

pendent as possible. Pin uses just-in-time compilation to instrument executables

while they are running; several techniques are used to optimize instrumentation,

such as inlining, register re-allocation, and instruction scheduling. According to

authors, the fully automated approach delivers significantly better instrumenta-

tion performance than similar tools such as DynamoRIO and Valgrind. Another

interesting feature of Pin is the possibility to attach to a process, instrument it,

collect profiles, and eventually detach. Pin preserves the original behaviour of

the analyzed application by providing instrumentation transparency: the appli-

cation observes the same addresses and values as it would in a native execution:

this makes the information collected by instrumentation more accurate and rel-

evant. At the highest level of its architecture, Pin consists of a virtual machine,

a code cache, and an instrumentation API; the virtual machine consists of a JIT

compiler, an emulator (for instructions that cannot be executed directly, such as

system calls), and a dispatcher. When an istrumented program is running, three

binary programs are present in memory: the application, Pin, and the pintool.

1.3.6 gcc

The GNU C Compiler [27] includes an experimental and somehow obscure fea-

ture: the -finstrument-functions option. This feature was originally implemented

by Cygnus Solutions and makes the compiler execute two special user-defined

functions at the top and bottom of every function - the only information pro-

vided to the user are two parameters containing the address of the routine and

the calling site. These two functions can be used to perform a compiler assisted

instrumentation of any C/C++ application on Linux, achieving a good degree of

performance while maintaining full control of program execution. Profiling tools

can be implemented as dynamic libraries to be linked to programs compiled with

the above-mentioned option. Clearly, the developer of a tool has to deal manually

9


with aspects such as program launch and termination, multithreading, etc. - the

only functionality provided by gcc is intercepting routine enter and exit events.

Although this at first glance can seem discouraging, in this thesis we learned how

it is possible to write an efficient gcc-based profiler using a limited number of

lines of code to deal with these aspects.

1.4 Conclusions

In this chapter we have provided an overview on general performance profiling

techniques, showing the importance of the use of dynamic program analysis tools

in order to pinpoint and optimize the hot spots that dominate the execution

time. Calling Context Trees have been presented as main data structure for

compactly representing context-sensitive profiling information gathered during

the execution of a program. Several instrumentations frameworks are nowadays

available for developers: in particular, Pin can be used to quickly implement

analysis tools on several platforms, while the GNU C Compiler offers a compiler-

assisted instrumentation feature to develop highly-tuned tools with low overhead

introduced.

10

Chapter 2

Mining hot calling contexts

In this chapter we will firstly discuss some state-of-the-art techniques for context

sensitive profiling. Then we will present the frequent items problem in the data

streaming computational model and two well-known space-efficient algorithms to

solve it. Finally, we will introduce our novel space-efficient approach, discussing

the main choices behind its design and its correlation with other techniques.

2.1 Previous approaches

In the previous chapter we highlighted two relevant issues for context sensitive

profiling: the exhaustive approach to the construction of the CCT introduces con-

siderable slowdown even using efficient instrumentation mechanisms, and CCTs

may be very large and difficult to analyze in several applications. In this section

we will discuss some of the most relevant approaches that have been proposed in

literature in the past years.

Early approaches. The utility of calling context information was already clear

to the authors of gprof: this tool approximates context sensitive profiles by as-

sociating procedure timing with caller-callee pairs rather than with single pro-

cedures. This level of context sensitivity, however, may yield sever inaccura-

cies [26, 30]. Since exhaustive instrumentation can lead to a large slowdown,

Bernat and Miller [6] proposed to generate path profiles including only methods

of interest in order to reduce overhead. A very popular approach is based on sam-

11

CHAPTER 2. MINING HOT CALLING CONTEXTS

pling [3, 13, 16, 31]: the run-time stack is periodically sampled and the context is

reconstructed through stack-walking. For call-intensive programs, sample-driver

stack-walking can be significantly (i.e., at least one order of magnitude) faster

than exhaustive instrumentation. This approach, however, can lead to a signif-

icant loss of accuracy with respect to the exhaustive construction of the CCT:

results may be highly inconsistent in different executions, and sampling guaran-

tees neither high coverage [7] nor accuracy of performance metrics [32].

Adaptive bursting. Several works explore the combination of sampling with

bursting [2, 17, 32]. Most recently, Zhuang et al. suggest to perform stack-

walking followed by a burst during which the profiler traces every routine enter

and exit event: experimental results show that adaptive bursting can yield much

more accurate results with respect to traditional sample-based stack-walking1.

While static bursting collects a highly accurate profile, it can still cause signif-

icant performance degradation: the authors propose an adaptive mechanism to

dynamically disable profile collection for previously observed calling contexts and

periodically re-enable it according to a re-enablement ratio parameter which re-

flects the trade-off between accuracy and overhead. Their profiler is based on the

Java Virtual Machine Profiler Interface (JVMPI), while experiments have been

performed with a sampling rate parameter of 10ms and a burst length of 0.2 ms.

Reducing space. A few previous works have addressed techniques to reduce

profile data (or at least the amount of data presented to the user) in context

sensitive profiling. Bernat and Miller [6] suggest to let the user choose a subset

of routines to be analyzed. Quite recently, Bond and McKinley [7] introduced

a new approach called probabilistic calling context that continuously maintains

a probabilistically unique value representing the current calling context. The

proposed representation is extremely compact (just a 32-bit value per context)

and, according to the authors, the average overhead introduced in the Java virtual

machine is around 3%. This approach is efficient and accurate enough for tasks

such as residual testing, bug detection, and intrusion detection. However, it

1According to the authors, a low-overhead solution using sampled stack-walking alone is less

than 50% accurate, based on degree of overlap with a complete CCT.

12


cannot be used for profiling with the purpose of understanding and improving

application performance, where it is crucial to maintain for each context the

sequence of active routine calls along with performance metrics. Moreover, Bond

and McKinley target applications where coverage of both hot and cold contexts

is necessary; this is not the case in performance analysis, where identifying a few

hot contexts is typically sufficient to guide code optimization.

2.2 Frequent items in data streams

In the last years a lot of effort has been put on the design of algorithms able

to perform analyses on a near-real time basis on massive data streams; in these

streams input data often come at a high rate and cannot be stored due to their

possibly unbounded size [23]. This line of research has been mainly motivated

by networking and database applications, in which streams may be very long

and stream items may also be drawn from a very large universe. Streaming

algorithms are designed to address these requirements, providing approximate

answers in contexts where it is impossible to obtain an exact solution using only

limited space.

These techniques can be adapted in order to analyze profiling information

gathered during a program’s execution. In fact, the stream of routine enter and

exit events collected by a profiler represents a good case of massive data stream:

items belong to a possibly large universe (consider an application composed of

many small routines frequently invoked from different call sites) and the length of

the stream itself is very high. Moreover, in order to keep the introduced slowdown

under control, timing requirements to process each item are strict.

The Frequent Items (a.k.a. heavy hitters) problem [22] has been extensively

studied in data streaming computational models. Given a frequency threshold φ

in [0,1] and a data stream of length N , the problem (in its simplest formulation)

is to find all items that appear more than bφNc times, i.e., having frequency

greater than bφNc. For instance, for φ=0.01 the problem is to find all items that

appear at least 1% of the times. It can be proved that any algorithm able to

provide an exact solution requires Ω(N) space [23]. Therefore, research focused

on solving an approximate version of the problem [20, 22]:

13


Definition 1 ε-Deficient Frequent Items problem. Given two parameters φ, ε ∈[0, 1], with ε < φ, return all items with frequency ≥ bφNc and no item with

frequency ≤ b(φ − ε)Nc. The absolute difference between estimated and true

frequency is at most εN .

In the approximate solution, false negatives cannot exist: all frequent items

are correctly output. Instead, some false positives are allowed: however, their real

frequency is guaranteed to be at most εN -far from the threshold bφNc. Many

different algorithms for computing (φ, ε)-heavy hitters have been proposed in the

literature in the last ten years. Counter-based algorithms, according to extensive

experimental studies [19], have better performance and ease of implementation

than sketch-based algorithms. Counter-based algorithms track a subset of items

from the input and monitor counts associated with them; sketch-based algorithms

act as general data stream summaries, and can be used for other types of approx-

imate statistical analysis of the data stream, apart from being used to find the

frequent items.

In a preliminary analysis we considered three counter-based algorithms: Space

Saving, Sticky Sampling, and Lossy Counting. Space Saving [22] and Lossy Count-

ing [20] are deterministic and use 1ε

and 1ε

log(εN) entries, respectively. Sticky

Sampling [20] is probabilistic: it fails to produce the correct answer with a minus-

cole probability, say δ, and uses at most 2ε

log(φ−1δ−1) entries in its data structure;

in our experiments, Sticky Sampling consistently proved itself to be less efficient

and accurate than its competitors, so we will not mention it any further.

2.2.1 Space Saving

Space Saving [22] monitors a set of d1/εe = M triples of the form (item, counti,

εi) initialized respectively by the first 1/ε distinct items, their exact counts, and

zero. After the init phase, when a query q is observed in the stream the update

operation works as follows:

• if q is monitored, the corresponding counter is incremented;

• if q is not monitored, the (item, counti, εi) triple with the smallest count

is chosen as a victim and has its item replaced with q and its count in-

14


cremented. Moreover, εi is set to the count value of the victim triple - it

measures the maximum possible estimation error.

The update time is bounded by the dictionary operation of checking whether an

item is monitored. Heavy hitters queries are answered by returning monitored

entries such that count > dφNe. Space Saving has the following properties [22]:

1. The minimum counter value min is no greater than bεNc;

2. For any monitored element ei, 0 ≤ εi ≤ min, that is, fi ≤ (fi + εi) =

counti ≤ fi +min. In other words, count is overestimated by at most min

with respect to the true frequency fi;

3. Any element whose true frequency is greater than min is monitored;

4. The algorithm uses a number of counters of min(|A|, d1/εe), where |A| is

the cardinality of the universe from which the items are drawn. This can

be proved to be optimal;

5. Space Saving solves the ε-Deficient Frequent Items problem, by reporting

all the elements with frequencies more than dφNe, such that no reported

element has true frequency less than d(φ− ε)Ne;

6. Items whose true frequencies are more than d(φ− ε)Ne but less than dφNecould be output. They are the so-called false positives.

2.2.2 Lossy Counting

Lossy Counting [20] maintains a set M of (item, count, ∆) triples, where count

represents the exact frequency of the item since it was last inserted in M and ∆ is

the maximum possible underestimation in count: the algorithm guarantees that

at any time the true frequency of a monitored item is ≤ count + ∆. While Space

Saving overestimates counters, Lossy Counting takes an opposite approach.

The incoming data stream is conceptually divided into bursts of width d1/εe.During the i-th burst, if the observed item already exists in M, the corresponding

count is incremented, otherwise a new entry is inserted with count = 1 and

∆ = i − 1. At the end of the burst, M is pruned by deleting triples such that

15


(count + ∆) ≤ i. Heavy hitters queries are answered by returning entries in M

such that count ≥ b(φ− ε)Nc. Lossy Counting has the following properties [20]:

1. Lossy Counting solves the ε-Deficient Frequent Items problem, by reporting

all the elements with frequencies more than φN , such that no reported

element has true frequency less than (φ− ε)N ;

2. Estimated frequencies are less than the true frequencies by at most εN , i.e,

count ≤ fe ≤ count+ εN ;

3. If an element e does not appear in M, then fe ≤ εN ;

4. Lossy Counting computes an ε-deficient synopsis using at most 1ε

log(εN)

entries. However, if we make the assumption that in real world datasets,

elements with very low frequency (at most εN2

) tend to occur more or less

uniformly at random: in this scenario Lossy Counting requires no more

than space 7ε

in the worst case.

2.3 HCCT: Hot Calling Context Tree

As suggested in Section 2.2, the execution trace of routine invocations and ter-

minations can be naturally regarded as a stream of item. Each item is a triple

of the form (routine address, call site, event type). Table 2.1 reports essential

informations about some well-known Linux applications. It is easy to notice that

the number of nodes of the call graph (i.e., the number of distinct routines) is

small compared to the stream length (i.e., to the number of nodes of the call tree),

even for large-scale applications such as gimp, inkscape, and those belonging to

the OpenOffice suite. Hence, a non-contextual profiler can mantain a hash table

of size proportional to the number of distinct routines, using routine names as

hash keys to efficiently update the corresponding metrics.

In the case of contextual profiling, the number of distinct calling contexts

(i.e., the number of nodes of the CCT) is too large: hashing would be inefficient,

and for some applications it would even be impossible to maintain the tree in

main memory. Consider, for instance, oocalc: the optimistic assumption that

16


Application |Call graph| Call sites |CCT| |Call tree|amarok 13 754 113 362 13 794 470 991 112 563

ark 9 933 76 547 8 171 612 216 881 324

audacity 6 895 79 656 13 131 115 924 534 168

bluefish 5 211 64 239 7 274 132 248 162 281

dolphin 10 744 84 152 11 667 974 390 134 028

firefox 6 756 145 883 30 294 063 625 133 218

gedit 5 063 57 774 4 183 946 407 906 721

ghex2 3 816 39 714 1 868 555 80 988 952

gimp 5 146 93 372 26 107 261 805 947 134

gnome-sudoku 5 340 49 885 2 794 177 325 944 813

gwenview 11 436 86 609 9 987 922 494 753 038

inkscape 6 454 89 590 13 896 175 675 915 815

oocalc 30 807 394 913 48 310 585 551 472 065

ooimpress 16 980 256 848 43 068 214 730 115 446

oowriter 17 012 253 713 41 395 182 563 763 684

pidgin 7 195 80 028 10 743 073 404 787 763

quanta 13 263 113 850 27 426 654 602 409 403

vlc 5 692 47 481 3 295 907 125 436 877

Table 2.1: Number of nodes of call graph, call tree, calling context tree, and

number of distinct call sites for different applications.

each CCT node requires 20 bytes2 already results in almost 1 GB just to store

a CCT with 48 million nodes for a few minutes usage session of the spreadsheet

component of the OpenOffice suite. As we will report in Chapter 4, the CCTs

associated with some benchmarks of the SPEC CPU2006 suite cannot be kept

in main memory since the 3 GB user space memory limit for data is reached in

a few minutes. Motivated by the fact that execution traces are typically very

long (i.e., hundreds of millions of routine enter and exit events) and their items

(i.e., calling contexts) are taken from a large universe, we cast the problem of

identifying the most frequent contexts into a data streaming setting.

The choice to focus on most frequent contexts is motivated by several reasons.

2Previous works use larger nodes [1, 30].

17


Scalability is a key issue: even the approach based on an extremely compact (32-

bit per context) representation proposed by Bond and McKinley [7] can run

out of memory on large traces with tens of millions of contexts. Probabilistic

calling context was not designed for profiling with the purpose of understanding

and improving application performance, where it is crucial to maintain for each

context the sequence of concurrently active routine calls along with performance

metrics. In this scenario, only the most frequent contexts are of interest, since

they represent the hot spots to which optimizations must be directed according

to the previously mentioned 90-10 rule [15]. In fact, the cumulative distribution

of calling context frequencies shown in Figure 2.1 is extremely skewed: about

90% of the total number of calls takes place in the top 10% of calling contexts

ordered by decreasing frequencies (i.e., from the hottest to the coldest). This

skewness suggests that space could be greatly reduced by keeping information

about hot contexts only and discarding on the fly contexts that are likely to be

cold. As observed in [32]: “Accurately collecting information about hot edges

may be more useful than accurately constructing an entire CCT that includes

rarely called paths.”

2.3.1 Approach

In this thesis we define a novel run-time data structure, called Hot Calling Context

Tree (HCCT), that compactly represents all the hot calling contexts encountered

during a program’s execution. The HCCT is a subtree of the CCT that includes

only hot nodes and their ancestors3 and offers an additional intermediate point in

the spectrum of data structures for representing interprocedural control flow (see

Section 1.2). For each hot calling context, the HCCT also maintains an estimate

of its performance metrics (in this thesis we focus on frequency counts, although

our approach can be easily extended to arbitrary metrics such as execution time

and cache misses).

Let N be the number of calling contexts encountered during a program’s

execution. Clearly, N equals the number of routine invocations in the execution

trace, the number of nodes of the call tree, as well as the sum of the frequency

3Ancestors are needed to reconstruct the calling context starting from the root node.

18


0

20

40

60

80

100

0 10 20 30 40 50 60 70 80 90 100

% o

f th

e tota

l num

ber

of calls

(degre

e o

f overlap w

ith full

CC

T)

% of hottest calling contexts

Cumulative frequencies

amarok audacity

firefoxgedit

oocalcpidgin

quantavlc

Figure 2.1: Skewness of calling contexts distribution.

counts of CCT nodes. Given a frequency threshold parameter φ ∈ [0, 1], we will

classify a calling context as hot if the frequency count of the corresponding CCT

node is greater than bφNc, otherwise the context considered cold. We define

the Hot Calling Context Tree as the unique subtree of the CCT obtained by

pruning all cold nodes that are not ancestors of a hot node. It is worth noticing

that all the leaves in the HCCT are necessarily hot, while root and intermediate

nodes can correspond to either hot or cold contexts. In graph theory, the HCCT

corresponds to the Steiner tree of the CCT with hot nodes used as terminals.

According to the space lower bound for the heavy hitters problem [23], the

HCCT cannot be calculated exactly in a space smaller than that required to store

the entire CCT. Hence, we relax the problem and compute an approximatation

of the HCCT, which we denote by (φ, ε)-HCCT. The parameters ε and φ control

the degree of approximation and have the same meaning as in the ε-Deficient

Frequent Items problem discussed in Section 2.2. The (φ, ε)-HCCT contains all

19


100

100

100 10050

50

40

10

10 1010

1

falsepositive

falsepositive

(a) (b) (c)

Figure 2.2: (a) CCT; (b) HCCT; and (c) (φ, ε)-HCCT. Hot nodes are darker.

The number close to each node is the frequency count of the corresponding calling

context. In this example N = 581, φ = 1/10, and ε = 1/30: the approximate

HCCT includes all contexts with frequency ≥ bφNc = 58 and no context with

frequency ≤ b(φ− ε)Nc = 38.

hot nodes (true positives), but may possibly contain some cold nodes without

hot descendants (false positives). The true frequency of these false positives,

however, is guaranteed to be at most εN far from bφNc: hence, it is still close to

the hotness threshold. Differently from the HCCT, a (φ, ε)-HCCT is not uniquely

defined as a subtree of the CCT, since the set of spanned (φ, ε)-heavy hitters is

not unique: depending on the behaviour of the adopted algorithm, nodes with

frequencies with frequencies smaller than bφNc and larger than b(φ− ε)Nc may

be either included in such a set or not. Figure 2.2 provides a graphical comparison

of the three mentioned tree structures.

With the help of the Venn diagram in Figure 2.3, some important relations

can be identified:

• H⊆A, where H is the set of hot contexts and A is a set of (φ, ε)-heavy

hitters. Nodes in A \H are false positives.

• H⊆HCCT. Nodes in HCCT \H are cold nodes that have a hot descendant

in H, hence must be kept in the HCCT.

20


Figure 2.3: Tree data structures and calling contexts classification.

• A⊆ (φ, ε)-HCCT. Nodes in (φ, ε)-HCCT \A are cold nodes that have a

descendant in A, hence must be kept in the (φ, ε)-HCCT.

• HCCT⊆ (φ, ε)-HCCT, as implied by the previous inclusions. Both of them

are connected subtrees of the full CCT.

In Figure 2.3 two additional sets are shown: the set M of monitored nodes

and the subtree MCCT spanning all nodes in M. The counter-based streaming

algorithms used to compute the set A of (φ, ε)-heavy hitters monitor a slightly

larger set M⊇A. When a most frequent elements query is issued, these algo-

rithms extract A from M and return it. To keep track of the context-sensitive

information, we mantain the subtree MCCT of the CCT consisting of the nodes

in M and of all their ancestors (hence, MCCT⊇ (φ, ε)-HCCT). At query time,

the MCCT is appropriately pruned and the (φ, ε)-HCCT is returned.

2.3.2 Update and query

The set M of monitored contexts must be updated at each function call; more-

over, when M is changed, the MCCT needs to be brought up to date as well.

To describe how this happens, we introduce two abstract main functions to be

provided by the interface of the streaming algorithm:

update(x,M)→ v: given a calling context x, update M to reflect the new occur-

rence of x in the stream. If x was already monitored in M, its frequency

21


count will be increased by one; otherwise, x is added to M and a victim

context v may be evicted from M and returned.

query(M)→ A: remove low-frequency items from M and return the subset A of

(φ, ε)-heavy hitters (see Section 2.3.1 and Figure 2.3).

Details on the implementation of update and query depend on the specific

streaming algorithm. It is worth mentioning that for Lossy Counting the update

operation does not return a victim since the pruning is performed on a periodical

basis, while for Space Saving a victim is evicted from M every time a new item

is added to M after the initial startup phase4.

During the construction of the MCCT a pointer to the current calling context

should be maintained, creating a new node if the current context x is not already

in the tree. The MCCT is pruned on a periodical basis in the case of Lossy

Counting, or according to the victim context returned by the update operation

in the case of Space Saving; for both algorithms, the pseudocode of the pruning

algorithm is the same and is reported in Listing 2.1.

It is important to notice that in order to keep the tree connected, nodes can

be removed from the MCCT only if they are leaves. An internal node with a

hot descendant can be removed instead only from the set of monitored elements.

Note that removing a victim might expose a path of unmonitored ancestors that

no longer have descendants in M: these nodes are pruned as well in a bottom-

up recursive manner. The current context x is never removed from the MCCT:

this guarantees that no node in the path from the root of the tree to x will be

removed, since at least one of these nodes will be needed to represent the next

calling context to be observed in the execution.

1 v = update (x ,M)

2 while ( v != x && v i s a l e a f in MCCT && v i s not in M) do

3 remove v from MCCT

4 v = parent ( v )

Listing 2.1: prune(X,MCCT)

4Startup phase ends when the number of monitored elements reaches d 1εe.

22


The (φ, ε)-HCCT can be computed from the MCCT in an analogous way.

The query operation invoked on M returns the support A of the (φ, ε)-HCCT:

pruning is performed on all MCCT nodes that have no descendant in A, following

bottom up path traversals as in Listing 2.1.

2.3.3 Discussion

The space required to store the heavy hitters in the data structure M depends on

the specific streaming algorithm in use, and in both cases is roughly proportional

to 1/ε (see Section 2.2 for the exact bounds). An appropriate choice of ε appears

to be crucial for the effectiveness of our approach: smaller values of ε guarantee

more accurate results (more precise counters and possibly less false positives), but

imply a larger memory footprint. The high skewness of context frequency dis-

tribution guarantees the existence of very convenient tradeoffs between accuracy

and memory usage, as we will see in Chapter 4.

The space required by the MCCT dominates the space required by M, since

some of the ancestors of contexts monitored in M may be cold contexts without a

corresponding entry in M. The number of cold ancestors depends on the structure

of the CCT corresponding to the particular execution of a computer program

and on the properties of the stream of events associated with the execution trace:

for this reason, it cannot be analyzed theoretically. Experimental results (see

Chapter 4) show that this amount is negligible with respect to |M|.Performing updates of the MCCT in a very short time is crucial for the

effectiveness of our approach. As discussed in Section 2.3, hashing is inefficient

when the number of distinct contexts is large. Our implementation requires

constant time for the streaming update operation and hinges upon very simple

and cache-efficient data structures. We will discuss these aspects in Chapter 3.

Our approach differs from previously proposed solutions such as, e.g., adaptive

bursting [32], in the capability to automatically adapt to the case where the hot

calling contexts vary over time. This peculiarity is achieved by exploiting the

characteristics of data streaming algorithms for frequent items: these algorithms

are able to remove from the set of monitored elements those contexts that are

losing their popularity over time, while being gradually replaced by contexts

23


growing more popular. Hence, heavy hitters queries can be issued at any point

in time - online querying - and the algorithms will always be able to return the

right set of hot contexts up to that time.

Our solution is orthogonal to previous techniques such as sample-based stack

walking and adaptive bursting (see Section 2.1), which focused on reducing the

overhead introduced by instrumentation without explicitly taking into account

intensive memory usage and its implications (e.g., performance degradation in-

duced by poor access locality). Moreover, as we will see in Chapter 4, our ap-

proach enables the profiling of large-scale applications which may be difficult or

sometimes impossible5 to analyze with traditional context-sensitive profiling tech-

niques. Finally, it can be extendend to support arbitrary performance metrics,

exploiting the ability of some streaming algorithms to compute heavy hitters in

a set of items of different weights.

2.4 Conclusions

In this chapter we have seen that state-of-the-art techniques for context-sensitive

profiling do not take esplicitly into account space-efficiency issues, while Calling

Context Trees built for recent applications may be very large and in several cases

they cannot be kept in main memory. Our approach takes advantage of efficient

algorithms from the data streaming computational model in order to focus the

analysis on relevant contexts only, building a subtree of the Calling Context Tree

called HCCT (Hot Calling Context Tree) while discarding on-the-fly contexts that

are likely to be cold. Since our solution is orthogonal to techniques that have

been proposed for reducing instrumentation overhead, it can be easily integrated

with them in order to keep the introduced slowdown under control.

5When the space needed for the exhaustive construction of the CCT exceeds the 3 GB user

space memory limit of the virtual address space on 32-bit Linux machines.

24

Chapter 3

Implementation

In this chapter we will discuss the implementation of the HCCT construction

algorithm. In particular, we will present two variants - based on Lossy Counting

and Space Saving [22], respectively - that are cast in a common framework in

which also different streaming algorithms can be plugged in.

We use a first-child, next-sibling representation for trees since it is very space-

efficient and still guarantees that the children of each node can be explored in

time proportional to their number. Each MCCT node also contains a pointer

to its parent (to make the prune operation more efficient - see Listing 2.1), the

routine ID, the call site, and the performance metrics. The parent field is not

required in CCT nodes since no pruning is needed. As routine ID, we use the

routine address. On 32-bit architectures, CCT and MCCT nodes require 20

and 24 bytes, respectively. Moreover, since call sites can be represented with

the 16 LSBs of the address, 2 bytes can be used to store additional information.

Actually, we use this space - without increasing the number of bytes per node - to

encode a Boolean flag that tells if the calling context associated with the node is

currently monitored by the streaming algorithm. According to our experiments,

the average number of scanned children for a node is a small constant around

2-3: hence the first-child, next-sibling representation turns out to be convenient

also for checking whether a (routine ID, call site) pair already appears among

the children of a node. To improve time and space efficiency, we allocate nodes

through a custom, page-based allocator, which maintains blocks of fixed size.

Any additional algorithm-specific field is stored as trailing field.

25

CHAPTER 3. IMPLEMENTATION

3.1 Space Saving (BSS and LSS)

The authors of Space Saving propose [22] a data structure called Stream-Summary

that uses an ordered doubly linked lists of buckets and a collection of linked lists

of counters. All elements with the same counter value are linked together in a

linked list; every node in the list points to a parent bucket in which the count

value is stored. Every bucket points to the first element of the linked list of

counters, and buckets are kept in a doubly linked list, sorted by their values.

Initially, M empty counters are created and attached to a single parent bucket

with value 0. When an element of the stream is processed, the SS lookup rou-

tine verifies whether an element is already monitored and returns a pointer to

the counter. If the element is not actually monitored, the SS update routine se-

lects a victim counter for replacement as described in Section 2.2.1. Finally, the

increment counter operation links the counter to the proper bucket. In partic-

ular, it first checks the neighbor of its bucket with the larger value. If its value is

equal to the new value of the element, the element is detached from its current

list and inserted as first child in the neighbors list. Otherwise, a new bucket with

the correct value is created and properly attached to the ordered bucket list; then

the element is attached to the new bucket. The old bucket is deleted if it now

points to an empty child list.

In addition to the implementation descrived above, we devised a more efficient

variant based on a lazy priority queue. We will refer to our implementation

as Lazy Space Saving (LSS) and to the one proposed by the authors of Space

Saving as Bucket Space Saving (BSS). In both implementations, the dictionary

operation needed by the update(x,M) procedure (see Section 2.3.2) is avoided

by maintaining a cursor pointer to the current calling context x and using it to

directly access the monitored flag of the MCCT node associated with x.

Our approach is based on a lazy priority queue that supports two operations,

find-min and increment, which returns the item with minimum count and in-

crements a counter, respectively. We use an unordered array M of size 1/ε, where

each array entry contains a pointer to a node of the MCCT. We also maintain

in a lazy fashion the value min of the minimum counter and the smallest index

min-idx of an array entry that points to a monitored node with counter equal

26


to min. It is important to notice that more than one monitored node may have

count equal to min.

The increment operation does not change M, since counters are stored di-

rectly inside MCCT nodes. However, min and min-idx may become temporarily

out of date after an increment: this is the reason why we call the approach lazy.

1 while ( M[ min−idx ] != min and min−idx <= M ) do

2 min−idx = min−idx + 1

3 i f (min−idx > M) then

4 min = minimum in M

5 min−idx = s m a l l e s t index j s . t . M[ j ]=min

6 return min

Listing 3.1: find-min()

The find-min operation described in Listing 3.1 restores the invariant prop-

erty on min and min-idx: it finds the next index in M with counter equal to min.

If such an index does not exist (i.e., min-idx > M), it completely rescans M in

order to find a new min value and its corresponding min-idx. In practice, several

strategies are possible: an elegant implementation that proved itself to be very

efficient is reported in Listing 3.2.

1 while ( queue[++min idx]−>counter > min ) ;

2 i f ( min idx < e p s i l o n ) return ;

3

4 unsigned long i ;

5 min = queue [ second min idx]−>counter ;

6 for ( min idx = 0 , i = 0 ; i < e p s i l o n ; ++i )

7 i f ( queue [ i ]−>counter < min ) 8 second min idx = min idx ;

9 min idx = i ;

10 min = queue [ i ]−>counter ;

11 12 queue [ e p s i l o n ]−>counter = min ;

Listing 3.2: Implementation of find-min() with sentinel and second min idx

27


Here follows an important theoretical result on the performance of LSS:

Lemma 1 After a find-min query, the lazy priority queue correctly returns the

minimum counter value in O(1) amortized time.

Proof. Counters are never decremented. Hence, at any time, if a monitored

item with counter equal to min exists, it must be found in a position larger than

or equal to min-idx. This yields correctness of the algorithm.

To analyze the running time, let ∆ be the value of min after k find-min and

increment operations. Since there are |M | counters ≥ ∆, counters are initialized

to 0, and each increment operation adds 1 to the value of a single counter, it

must be k ≥ |M |∆. For each distinct value assumed by min, the array is scanned

twice. We therefore have at most 2∆ array scans each of length |M |, and the

total cost of find-min operations throughout the whole sequence of operations is

upper bounded by 2|M |∆. It follows that the amortized cost is (2|M |∆)/k ≤ 2,

hence O(1) amortized time. 2

Our highly tuned implementation of LSS is more efficient in practice than the

one proposed by the authors in terms of both performance and memory usage,

and might be of independent interest. Note that both of our implementations do

not store the εi field for monitored elements. This field measures the maximum

possible estimation error and is useful to check whether there might be any false

positives among heavy hitters output by the algorithm [22]. However, since this

check does not provide exact answers, the εi field is not stored by default1 in order

to save space: in fact, 4 bytes per node would be needed to store the maximum

possible overestimation error when a node in inserted into M.

3.2 Lossy Counting (LC)

Our implementation of Lossy Counting [20] maintains an unordered linked list

of monitored contexts. Every node conceptually represents a (item, count, ∆)

triple, where count represents the exact frequency of the items since it was last

inserted and ∆ is the maximum possible underestimation in count.

1This behaviour can be changed by modifying a macro in the source code of the algorithm.

28


The pruning operation is performed on a periodical basis. When the length

of the incoming stream becomes a multiple of the width of a bucket (i.e., N mod

w == 0), the linked list of monitored items is scanned sequentially: an item

is removed from the list if count + ∆ ≤ bcurrent, where bcurrent is the bucket in

use. In other implementations, the ∆ field is not stored and during each pruning

operation count is decremented by one, removing the node when count reaches

zero. However, this approach loses information on the frequencies of monitored

items, hence only an unordered list of heavy hitters can be output.

1 #d e f i n e MakeParent (p , b i t ) ( ( long ) ( p) | ( b i t ) )

2 #d e f i n e GetParent ( x ) ( ( l c h c c t n o d e t ∗) ( ( x−>parent ) & (∼3 ) ) )

3 #d e f i n e UnsetMonitored ( x ) ( MakeParent ( GetParent ( x ) , 0 ) )

4 #d e f i n e IsMonitored ( x ) ( ( x)−>parent & 1)

5 typedef struct l c h c c t n o d e s l c h c c t n o d e t ;

6 struct l c h c c t n o d e s 7 unsigned long r o u t i n e i d ;

8 unsigned long counter ;

9 l c h c c t n o d e t ∗ f i r s t c h i l d ;

10 l c h c c t n o d e t ∗ n e x t s i b l i n g ;

11 unsigned short c a l l s i t e ;

12 unsigned long parent ; // b i t s t e a l i n g f i e l d

13 l c h c c t n o d e t ∗ l c n e x t ; // l i n k e d l i s t o f counters

14 unsigned long d e l t a ;

15

Listing 3.3: Tree node structure for Lossy Counting

We modified the original formulation of Lossy Counting in order to take advan-

tage of the tree structure and of the presence of cold ancestors. The bit stealing

technique is used to check whether a node in the tree is currently monitored by

the streaming algorithm: when a context is removed from the set of monitored

elements, the corresponding node is nevertheless kept in the tree when it is not a

leaf (i.e., it is an internal node with a hot descendant). For each non-monitored

cold ancestor we keep track of the values for count and ∆ and, when the node is

inserted into M, the old values are restored and count is simply incremented by

29


one. We also modified the pruning strategy for the query operation: the deletion

rule proposed by Manku and Motwani [20] removes an element from M if count

< (φ− ε)N, while our implementation takes advantage of maintaining deltas by

modifying the deletion condition into count + ∆ < φN. In our experiments the

introduced changes improved the performance of the algorithm by reducing the

percentage of false positives by a factor 2.

3.3 The pintool

The first version of our profiler is based on Intel Pin (see Section 1.3 and [18]).

We developed a common infrastructure called STool in which different tools can

be plugged in: in particular, we implemented a trace recorder and two online

profilers based on Space Saving and Lossy Counting, respectively. A recorded

timestamped execution trace is fundamental to ensure deterministic replay of

the execution of interactive applications: since the behaviour of the application

changes among different executions, comparing the results from different sessions

might lead to misleading conclusions.

A pintool is conceptually divided into two parts: the instrumentation code,

which specifies the type of the code and where to insert it, and the analysis code,

that is the code that has to be executed in the insertion points.

Pin’s just in time (JIT) instrumentation occurs immediately before a code

sequence is executed for the first time. This mode of operation is known as trace

instrumentation and lets the pintool inspect and instrument an executable one

trace at a time. Traces usually begin at the target of a taken branch and end

with an unconditional branch, hence including calls and returns. Pin guarantees

that a trace has an unique entrance, but it may contain multiple exits. A trace is

composed of a sequence of Basic Blocks (BBL). A BBL is a single entrace, single

exit sequence of instructions. Pin offers the developer the possibility to insert a

single analysis call for a BBL, instead of one analysis call for every instruction

(instruction instrumentation mode): obviously, reducing the number of analysis

calls makes the instrumentation more efficient. Basic Blocks are picked out at

execution time, so there might be slight differences with respect to the classical

definition of a BBL for a compiler. Pin offers also other instrumentation modes

30


such as routine instrumentation and image instrumentation. In this thesis we

adopt the trace instrumentation mode in order to construct the [H]CCT in a

reliable way, by tracking all the invocations and returns from methods that occur

during the execution of a program.

STool has been implemented in C++ in a modular structure and provides an

intermediate layer to write profiling tools on top of Pin’s trace instrumentation.

By using a callback mechanism the instrumentation code becomes transparent

for the developer and is separated from the analysis code. Listing 3.4 reports the

activation paradigm of the infrastructure: multi-threading2 and call site identifi-

cation are natively supported, as well as tracing of routine enter/exit events and

optionally also of memory read/write accesses.

typedef struct int argc ;

char∗∗ argv ;

s i z e t a c t i va t i o nRe cS i z e ;

s i z e t threadRecSize ;

BOOL c a l l i n g S i t e ;

// c a l l b a c k s f o r ana l y sy s

STool TFiniH appEnd ;

STool TThreadH threadStar t ;

STool TThreadH threadEnd ;

STool TRtnH rtnEnter ;

STool TRtnH rtnExi t ;

STool TMemH memRead ;

STool TMemH memWrite ;

STool TBufH memBuf ;

STool TSetup ;

Listing 3.4: Activation paradigm of STool

For the construction of the [H]CCT the precision in tracing routine invocations

and terminations is crucial. Distinguishing an invocation from a terminatio is not

trivial [29], since very often the same control transfer instructions are used in both

2A separate tree is constructed for each thread.

31


scenarios. Consider the following example: A transfers the control to B, C, and

D, then the control returns to A skipping the return instructions in B and C. This

scenario is quite commmon due to the optimizations introduced by compilers, and

tracking only the rtnEnter and rtnExit events is not enough. A stackwalking

strategy based on frame pointers is not practicable, since most applications and

dynamic libraries are compiled without frame pointers for efficiency reasons. An

alternative approach inspects the opcode of a jump instruction to distinguish

between call and return, but in practice several exceptions to this schema are

found. Another relevant problem is related to the use of a proxy code in order

to transfer the control flow without using a call/return schema: this is the case

of SETJMP/LONGJMP in C and of exception handling in Java. To deal with

these issues, we maintain a copy of the stack, called Shadow Stack, handled by

a method capable of reconstruct consistently the call stack.

Pin provides an API to spawn internal threads transparent to the instru-

mented program. This feature is very useful to realize a timer thread: techniques

such as sampling and bursting (see Section 2.1) can be implemented using a

timer to turn off the profiler and speed up program’s execution. In our trace

recorder we implement a timer thread with a 200 µs granularity in order to have

a recorded timestamped execution trace. Trace recording is highly optimized:

we adopt a double-buffering strategy to reduce the frequency and the number of

I/O operations. In particular, when the main buffer becomes full, the secondary

buffer is empty and their roles are swapped, while an internal thread is spawned

to asynchronously write data from the full buffer to disk. Buffers are organized

in pairs (i.e., a pair for each thread) and the pair reserved for the main thread is

an order of magnitude bigger than the others since most of the routine enter/exit

events take place in the main thread.

3.4 The trace analyzer

Recorded traces are stored in a compact binary format. Every file starts with

a 52-bytes preamble containing a magic number (4 bytes), the offsets of symbol

and library tables (8+8 bytes), and the number of routine enter/exit and memory

read/write events (8+8+8+8 bytes). Header is followed by a stream of events of

32


variable length. The length of each entry depends on the type of the event: 7

bytes are needed to represent a routine enter event (1 byte to encode the event

type, 4 bytes for the routine address, and 2 bytes for the call site) while 1 byte

is enough to represent routine exit and timer tick events. Memory access events

are not recorded by our profiler since we are focusing on frequency counts as

performance metrics.

We designed and implemented in C a sophisticated trace analyzer in order to

construct the CCT and the (φ, ε)-HCCT in a reproducible and reliable manner.

The main component of the trace analyzer is the driver module, which reads a

recorded trace from disk and produces a stream of events as in a live execution

of a program. This module adopts an internal buffering strategy to deal with

large trace files3 and provides a mechanism to measure the time spent in analysis

subtracting the fraction spent in I/O operations.

The burst module can be configured to simulate techniques such as static

and adaptive bursting (see Section 2.1) and relies on the timer ticks recorded in

the trace. This module also provides a Shadow Stack for stack walking in order

to accurately reconstruct the context and realign the tree when a burst starts.

Finally, it provides an optional support for the adaptive bursting with weight

compensation technique described in [32].

The metrics module plays a key role in the evaluation of accuracy of perfor-

mance metrics: it compares the (φ, ε)-HCCT with the exact CCT and writes the

log to disk. The cct module constructs the CCT, while the modules bss-hcct,

lss-hcct, and lc-hcct implement the streaming algorithms BSS, LSS, and LC,

respectively, for the construction of the (φ, ε)-HCCT. Logs produced by the trace

analyzer use a simple column-based representation and are well-suited for being

processed by graphical visualization tools such as gnuplot.

3.5 The gcc-based profiler

Our online profiling tools based Pin suffered from an important slowdown while

analyzing several applications. For this reason, we have developed a more effi-

cient version based on gcc, the GNU C compiler [27]. With respect to online

3Very often larger than 4 GB, hence a special descriptor must be used.

33


instrumentation frameworks such as Pin and Valgrind [24], this approach has es-

sentially one drawback: the program must be recompiled in order to be analyzed.

However, this initial cost is compensated by a more efficient analysis and instru-

mentation code: the instrumented program is directly executed by the CPU,

without translations in an intermediate language for inserting profiling code and

manipulating the execution.

In order to interfere as little as possible with the instrumented program, we

implemented our gcc-based profiler as a dynamic library. This approach has

an important advantage: analysis code can be modified without the need for

recompile the program again. Routine enter and exit events are intercepted

using two special user-defined methods executed just after function entry and

before function exit, respectively. These two methods are automatically inserted

by the compiler into the produced executable when the -finstrument-functions

option is specified. This option was originally implemented by Cygnus Solutions

and has recently been introduced also in other non-C/C++ compilers of the gcc

suite such as gfortran. Listing 3.5 shows the function prototypes for the two

instrumentation methods: the only information provided to the developer are

two parameters containing the address of the start of the current function and

the call site.

void c y g p r o f i l e f u n c e n t e r (void ∗ t h i s f n ,

void ∗ c a l l s i t e ) ;

void c y g p r o f i l e f u n c e x i t (void ∗ t h i s f n ,

void ∗ c a l l s i t e ) ;

Listing 3.5: Function prototypes for gcc-based instrumentation

This compiler assisted instrumentation mechanism is flexible: a function may

be given the attribute no instrument function, in which case instrumentation will

not be done. The list of functions not to be instrumented can also be specified

through the -finstrument-functions-exclude-file-list option of gcc. This can be

used, for example, for high-priority interrupt routines and any functions from

which the profiling functions cannot safely be called (perhaps signal handlers, if

the profiling routines generate output or allocate memory). Instrumentation is

also done for functions expanded inline in other functions: the profiling calls will

34


indicate where, conceptually, the inline function is entered and exited.

Since the only functionality provided by this instrumentation framework is

intercepting routine enter and exit events, the developer has to deal manually

with aspects such as program launch/termination and multithreading. In GNU

C/C++, it is possible to declare special attributes constructor and destructor

for functions. The constructor attribute causes the function to be called auto-

matically before execution enters main (). Similarly, the destructor attribute

causes the function to be called automatically after main() has completed or

exit() has been called. Functions with these attributes are useful for initializing

data that will be used implicitly during the execution of the program (e.g., data

structures for streaming algorithms and MCCT construction).

Multithreading support requires special attention due to the need for hav-

ing distinct data structures (i.e., trees and sets of monitored elements) for each

thread. The POSIX thread API provides the pthread get,setspecific()

methods to access per-thread data by using a key. Several compilers, instead,

offer non-standard C extensions to support Thread Local Storage (TLS) of ar-

bitrary size. TLS provides significantly better performance than pthread-based

storage [25] and can be used in gcc via the the thread storage class specifier:

this means that, in a multi-thread application, an unique instance of the variable

is created for each thread that uses it, and destroyed when the thread terminates.

We use thread variables to store pointers to data structures specific to a thread.

Thread creation and termination events have to be intercepted in order to

respectively create data structures needed for profiling and save the collected

information to disk. The GNU linker offers the --wrap symbol option to wrap

a particular function: any undefined reference to symbol will be resolved to

wrap symbol, while any undefined reference to real symbol will be resolved

to symbol (useful to call the original function). Listing 3.6 shows the code used

in our profiler to wrap both pthread crete and pthread exit.

35


1 // hand les a l s o thread terminat ion made wi thout p t h r e a d e x i t

2 void∗ a t t r i b u t e ( ( no in s t rument func t i on ) )

3 aux pthread c rea te (void ∗ arg ) 4 h c c t i n i t ( ) ;

5

6 // Ret r i eve o r i g i n a l rou t ine address and argument

7 void∗ o r i g a r g =((void ∗∗) arg ) [ 1 ] ;

8 void ∗(∗ s t a r t r o u t i n e ) ( void∗)=((void ∗∗) arg ) [ 0 ] ;

9 f r e e ( arg ) ;

10

11 // Run ac tua l a p p l i c a t i o n thread ’ s i n i t rou t ine

12 void∗ r e t =(∗ s t a r t r o u t i n e ) ( o r i g a r g ) ;

13

14 hcct dump ( h c c t g e t r o o t ( ) , 1 ) ;

15

16 return r e t ;

17 18

19 int a t t r i b u t e ( ( no in s t rument func t i on ) )

20 wrap pth r ead c r ea t e ( pthread t ∗ thread , p t h r e a d a t t r t

21 ∗ attr , void ∗(∗ s t a r t r o u t i n e ) ( void ∗ ) , void ∗ arg ) 22 void∗∗ params=mal loc (2∗ s izeof (void ∗ ) ) ;

23 params [0 ]= s t a r t r o u t i n e ;

24 params [1 ]= arg ;

25 return r e a l p t h r e a d c r e a t e ( thread , at t r ,

26 aux pthread create , params ) ;

27 28

29 void a t t r i b u t e ( ( no in s t rument func t i on ) )

30 wrap pth r ead ex i t (void ∗ v a l u e p t r ) 31 hcct dump ( h c c t g e t r o o t ( ) , 1 ) ;

32 r e a l p t h r e a d e x i t ( v a l u e p t r ) ;

33

Listing 3.6: Wrapping of pthread create and pthread exit

36


Sampling and bursting (see Section 2.1) are implemented using a timer thread

in order to turn off analysis code when needed. With kernel version 2.6.21 on-

wards4, high resolution timers (HRT) are available under Linux. The timer thread

uses the clock nanosleep() system call: like nanosleep(), it allows the caller

to sleep for an interval specified with nanosecond precision. The main difference

between the two system calls is that clock nanosleep() allows the caller to se-

lect the clock against which the sleep interval is to be measured. In particular,

we use the CLOCK PROCESS CPUTIME ID clock that measures CPU time

consumed by all threads in the process. This system call first appeared in Linux

2.6, while support is available in glibc since version 2.1; POSIX.1 specifies that

clock nanosleep() has no effect on signals dispositions or the signal mask.

Instrumentation framework has been realized in a modular structure in order

to facilitate the development of profiling analysis tool. As in STool (see Sec-

tion 3.3), a callback mechanism is used to separate instrumentation code from

analysis code. Listing 3.7 reports the simple activation paradigm of the infrastruc-

ture. Parameters such as φ, ε, sampling interval, and burst length can be specified

as environment variables and retrieved through the hcct getenv method. Each

profiling tool is implemented as a shared library. The tool in use can be changed

by simply modifying the symbolic link pointing to the default analysis library.

int hcct ge tenv ( ) ;

int h c c t i n i t ( ) ;

void h c c t e n t e r (UINT32 r o u t i n e i d , UINT16 c a l l s i t e ) ;

void h c c t e x i t ( ) ;

void∗ h c c t g e t r o o t ( ) ;

void hcct dump ( ) ;

#i f BURSTING==1

void h c c t a l i g n ( ) ;

#e n d i f

Listing 3.7: Activation paradigm of the gcc-based profiling infrastructure

At the time being, we have implemented a trace recorder with the same output

format of the Pin-based version (see Section 3.4) and an online profiling tool based

4Man page: http://www.kernel.org/doc/man-pages/online/pages/man7/time.7.html

37


on LSS. We have focused on LSS since, as we will see in Chapter 4, it performs

faster than LC. The constructed (φ, ε)-HCCT is stored on a file using a syntax

that resembles the DIMACS format for graphs5 and reported in Listing 3.8.

c <too l> [< eps i l on> <phi>] [< s amp l ing in t e rva l>

<bur s t l ength> <bur s t en t e r ev en t s >]

c <command> <proce s s / thread id> <working d i r e c to ry>

v <node id> <parent id> <counter> <r ou t i n e i d> <c a l l s i t e >

[ . . . ]

v <node id> <parent id> <counter> <r ou t i n e i d> <c a l l s i t e >

p <num nodes>

Listing 3.8: Syntax of output format for gcc-based online profiler

We have also implemented a simple analysis tool to reconstruct the tree in

main memory starting from a log file, in order to compute arbitrary statistics.

This tool is a starting point for the implementation of more complex analysis

methods, for instance to compare the characteristics of two HCCTs. Routine

addresses are stored in hexadecimal format, in order to be able to resolve them

using the addr2line command. Given an address and an executable, this pro-

gram uses the debugging information in the executable6 to figure out which file

name and line number are associated with a given address.

3.6 Conclusions

In this chapter we have discussed the details of the implementation of Space Sav-

ing and Lossy Counting, that have been both optimized in order to take advantage

of the tree-based data structures. We have also presented the three main software

components developed in this thesis: a pintool, useful to analyze any application

without need for recompilation, and the related STool infrastructure; a trace an-

alyzer, that can be used to perform offline analysis and accuracy evaluation; and

a gcc-based profiler, designed to achieve better performance in the analysis of

C/C++ Linux applications by using compiler-assisted instrumentation.

5http://www.dis.uniroma1.it/challenge9/format.shtml#graph6For an accurate address translation program should be compiled with the -g option.

38

Chapter 4

Experimental evaluation

In this chapter we present an extensive experimental study of our data-streaming

based profiling mechanism. We analyze the performances and the accuracy of the

produced (φ, ε)-HCCT with respect to several metrics and using many different

parameter settings. Beside the approach to the construction of the tree based on

exhaustive instrumentation, we integrate our implementations with static burst-

ing and adaptive bursting with re-enablement techniques [32] to reduce running

time overhead. The results of experimental analysis not only confirm, but rein-

force the theoretical prediction: the (φ, ε)-HCCT accurately represents the hot

portions of the complete CCT using only an extremely small percentage of the

space required by the CCT. The use of bursting techniques does not affect accu-

racy in a substantial way, and the global slowdown is often comparable to the one

introduced by gprof while providing a much deeper context-sensitive analysis.

4.1 Methodology

4.1.1 Benchmarks

We use two distinct sets of benchmarks to evaluate the accuracy of the (φ, ε)-

HCCT and the performance of our profilers, respectively.

Interactive applications. Accuracy tests are performed on a variety of well-

known large-scale interactive Linux applications. Since the behaviour of an inter-

39

CHAPTER 4. EXPERIMENTAL EVALUATION

active application changes among different executions, comparing the accuracy

of results from different session might lead to wrong considerations. For this rea-

son, we use the tracer based on Pin described in Section 3.4 to record execution

traces and then analyze them to evaluate accuracy metrics. The complete list of

benchmarks and related essential information (e.g. the number of distinct call-

ing contexts) are reported in Table 2.1, while the distribution of calling context

frequencies is shown in Figure 2.1 - we introduced these aspects in Section 2.3.

Benchmarks include graphics programs (inkscape, gimp and gwenview), audio

players/editors (amarok and audacity), an archiver (ark), an Internet browser

(firefox), an HTML editor (quanta), and the Open Office suite for word pro-

cessing (oowriter), spreadsheets (oocalc), and drawing (ooimpress). After the

startup, in each session we interactively used the application for approximately

ten up to twenty minutes, in order to simulate a typical usage session: e.g., we

created a 420×300 pixel image with gimp applying a variety of graphic filters

and color effects, we reproduced fifty images of ∼4.5MB each in slideshow mode

with gwenview, we played a ten minutes OGG audio file with amarok, and we

wrote a two pages formatted text, including tables, with oowriter. Statistical

information about test sets reported in Table 2.1 shows that even short sessions

of a few minutes result in CCTs consisting of tens of millions of calling contexts,

whereas the call graph has only a few thousands nodes.

SPEC CPU2006 suite. Performance metrics are evaluated on some of the

benchmarks of the SPEC CPU2006 suite in order to measure the efficiency of our

gcc-based profiler with CPU-intensive applications.

We have chosen a subset of the SPEC CPU2006 benchmarks according to

three exclusion criteria: implementation using Fortran, CCT size too small (e.g.,

bzip2), and incompatibility with gcc-based instrumentation (e.g., gcc and perl

crashed after a few minutes execution). Six of the selected benchmarks belong to

the integer component of the suite: astar is a pathfinding library for 2D maps

including the well-known A* algorithm; gobmk is a complex artificial intelligence

game; h264ref is a video compression application; mcf involves combinatorial

optimization in vehicle scheduling; omnetpp uses a discrete event simulator to

model a large Ethernet campus network; sjeng is a highly-ranked chess program

40


Application Suite component Language |CCT|astar CINT2006 C++ 844

gobmk CINT2006 C too large

h264ref CINT2006 C 3 903

mcf CINT2006 C 30 507

omnetpp CINT2006 C++ 162 742

povray CFP2006 C++ 71 685

sjeng CINT2006 C too large

soplex CFP2006 C++ 7 519

Table 4.1: Benchmarks selected from SPEC CPU2006 for performance evaluation.

that also plays several chess variants. The two remaining selected benchmarks

belong instead to the floating point component of the suite: povray renders a

1280x1024 pixel anti-aliased image of a chessboard with all the pieces in the

starting position using the ray-tracing technique; soplex solves a linear program

using a simplex algorithm and sparse linear algebra.

The size of the CCTs associated to the benchmarks is highly variable. For

two benchmarks, gobmk and sjeng, the 3 GB user space memory limit is reached

in a few minutes since the CCTs are extremely large. On the other hand, for

other benchmarks such as astar and h264ref, the trees are composed of a few

thousand nodes. As we will see in Section 4.2.4, there is no linear correlation

between the size of the tree and the overhead introduced by instrumentation. We

are not surprised by the variability in the size of the CCTs: these benchmarks are

strictly optimized for performance assessment and adopt different internal logic

(consider, for instance, the impact on the size of the CCT of a massive usage

of recursion in order to solve a problem), while large-scale Linux applications

described in the previous paragraph are based on a GUI and use well-known

heavy libraries such as GTK and QT. For each benchmark we will analyze the

impact of both instrumentation code only and instrumentation plus analysis code,

to evaluate also the effectiveness of gcc as an instrumentation framework.

Platform. Our experiments were performed on a 2.53GHz Intel Core2 Duo

T9400 with 128KB of L1 data cache, 6MB of L2 cache, and 4 GB of main memory

41


DDR3 1066, running Ubuntu 8.04, Linux Kernel 2.6.24, 32 bit.

4.1.2 Metrics

We test the accuracy of the (φ, ε)-HCCT produced by our profilers according to

a variety of metrics:

1. Degree of overlap, considered in [2, 3, 32], is used to measure the com-

pleteness of the (φ, ε)-HCCT with respect to the full CCT and defined as

follows:

overlap((φ, ε)-HCCT,CCT) =1

N

∑arcs e∈(φ,ε)-HCCT

w(e)

where N is the total number of routine activations (corresponding to the

sum of the counters associated t o CCT nodes) and w(e) is the true fre-

quency of the target node of arc e in the CCT.

2. Hot edge coverage, as defined in [32], measures the percentage of hot edges

of the CCT that are covered by the (φ, ε)-HCCT, using an edge-weight

threshold τ ∈ [0, 1] to determine hotness, as follows:

cover((φ, ε)-HCCT,CCT, τ) =|e ∈ (φ, ε)-HCCT:w(e) ≥ τH1||e ∈ CCT:w(e) ≥ τH2|

where H1 and H2 are the weights of the hottest arcs in the (φ, ε)-HCCT

and in the CCT, respectively.

3. Maximum frequency of uncovered calling contexts, where a context is un-

covered if it is not included in the (φ, ε)-HCCT:

maxUncov((φ, ε)-HCCT,CCT) = maxe∈CCT\(φ,ε)-HCCT

w(e)

H1× 100

Average frequency of uncovered contexts is defined similarly.

4. Number of false positives, i.e., |A \ H|: the smaller this number, the bet-

ter the (φ, ε)-HCCT approximates the exact HCCT obtained from CCT

pruning.

42


5. Counter accuracy, i.e., maximum error in the frequency counters of (φ, ε)-

HCCT nodes with respect to their true value in the complete CCT:

maxError((φ, ε)-HCCT) = maxe∈(φ,ε)-HCCT

|w(e)− w(e)|w(e)

× 100

where w(e) and w(e) are the true and the estimated frequency of context

e, respectively. Average counter error is defined similarly.

An accurate solution should maximize degree of overlap and hot edge coverage,

and minimize the remaining metrics. With respect to performance metrics, in-

stead, we compare execution time of the SPEC CPU2006 benchmarks.

4.1.3 Parameter tuning

According to the theoretical analysis, the choice of φ and ε parameters affects

both the space used by the algorithms and the accuracy of the solution. In our

study we considered many different choices of φ and ε across rather etherogeneous

sets of execution traces, always obtaining similar results that we summarize in

this section.

A rule of thumb validated by previous experimental studies [19] suggests that

it is sufficient to choose them such that φ/ε = 10 in order to obtain a small

number of false positives and high counter accuracy. We found this choice overly

pessimistic in our scenario: the extremely skewed distribution of calling context

frequencies makes it possible to use much smaller values of this ratio (i.e., larger

values of ε for a fixed φ) without sacrificing accuracy. Since space usage is roughly

proportional to 1/ε, this yields substantial benefits on the space usage. Unless

otherwise stated, in all our experiments we used φ/ε = 5.

Since the length of the stream N is unknown a priori during the execution,

an appropriate choice of the hotness threshold φ may appear to be difficult.

In fact, too large values might result in returning too many contexts without

being able to discriminate which of them are actually hot, resulting also in a

waste of space. On the other hand, too small values might result in returning

too few contexts without being able to provide any relevant information about

hot spots in the execution of the program. Our experiments suggest that an

appropriate choice of φ is mostly independent of the stream length and of the

43


specific benchmark: as shown in Table 4.2, different benchmarks1 have a number

of HCCT nodes of the same order of magnitude when using the same φ threshold.

This fact is not surprising since it is a consequence of the skewness of context

frequency distribution, and greatly simplifies the choice of φ in practice. Unless

otherwise stated, in our experiments we used φ = 10−4, which corresponds to

mining roughly the hottest 1 000 calling contexts.

Table 4.2: Typical thresholds.

HCCT nodes HCCT nodes HCCT nodes

Benchmark φ = 10−3 φ = 10−5 φ = 10−7

audacity 112 9 181 233 362

dolphin 97 14 563 978 544

gimp 96 15 330 963 708

inkscape 80 16 713 830 191

oocalc 136 13 414 1 339 752

quanta 94 13 881 812 098

4.2 Experimental results

This section is organized as follows. Results related to accuracy are divided into

two parts: in Section 4.2.1 we discuss degree of overlap and hot edge coverage

for the exact HCCT, while in Section 4.2.2 we report results about the remaining

metrics for the (φ, ε)-HCCT. This choice aims to measure the quality of our

approach from a theoretical point of view: degree of overlap and hot edge coverage

can be evaluated for the exact HCCT without involving a particular streaming

algorithm. Moreover, since the HCCT is a subtree of the (φ, ε)-HCCT computed

by streaming algorithms, the results described for the exact HCCT apply to the

(φ, ε)-HCCT as well. In Section 4.2.3 we discuss the cost of our approach in terms

of memory usage and time spent for analysis. Finally, in Section 4.2.4 we analyze

the efficiency of our gcc-based profiler using the SPEC CPU2006 test suite.

1Results for omitted benchmarks are similar.

44


4.2.1 Accuracy: exact HCCT

Degree of overlap. Not surprisingly, the variation of the degree of overlap

with the full CCT shown in Figure 4.1 for decreasing values of φ is coherent with

the shape of cumulative distributions of calling context frequencies discussed in

previous sections. This figure illustrates the relationship between degree of over-

lap and hotness threshold, plotting the value φ of the largest hotness threshold

for which a given degree of overlap d can be achieved: using any φ ≤ φ, the

achieved degree of overlap will be larger than or equal to d.

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

10 20 30 40 50 60 70 80 90 100

Max φ

Degree of overlap (%)

Largest value of φ that guarantees a given degree of overlap

amarokaudacitybluefish

firefoxgedit

gwenviewoocalcpidgin

quantavlc

Figure 4.1: Relation between φ and degree of overlap between the exact HCCT

and the full CCT on a representative subset of benchmarks.

Since the HCCT is a subtree of the (φ, ε)-HCCT, the degree of overlap eval-

uated for the HCCT is a lower bound to the corresponding value in the (φ, ε)-

HCCT. The value of φ decreases as d increases: if we want to achieve a larger

degree of overlap, we must include in the HCCT a larger number of nodes, which

corresponds to choosing a smaller hotness threshold. However, when computing

the (φ, ε)-HCCT, the value of φ indirectly affects2 the space used by the algo-

2The φ/ε ratio is fixed.

45


rithm and in practice cannot be too small (see Section 4.1.3). By analyzing hot

edge coverage and uncovered frequency for the HCCT and the (φ, ε)-HCCT, re-

spectively, we will show that even when the degree of overlap is not particularly

large, the HCCT and the (φ, ε)-HCCT are nevertheless good approximations of

the full CCT: values of φ ∈ [10−4, 10−6] represent on all our benchmarks a good

tradeoff between accuracy and space reduction.

Hot edge coverage. While degree of overlap measures the similarity of two

trees in their entirety, the hot edge coverage metrics proposed by Zhuanget al. [32]

focuses on the similarity of edges in each graph having relatively high edge

weights. In particular, since in our approach edge weights are represented by

the count field of the target node, hot edge coverage measures the similarity of

the hottest calling contexts. We compare the hot edge coverage of the HCCT

with respect to the full CCT to evaluate the percentage of covered hot contexts.

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

1e-09 1e-08 1e-07 1e-06 1e-05 0.0001 0.001

Min

τ

φ

Minimum τ for which hot edge coverage is 100%

firefox oocalc

amarok ark

inkscape

Figure 4.2: Hot edge coverage of the exact HCCT: relation between φ and edge-

weight threshold τ .

The analysis of hot edge coverages is reported in Figure 4.2. In this figure we

plot, as a function of φ, the smallest value τ of the hotness threshold τ for which

46


hot edge coverage of the HCCT is 100%. Results are shown only for some of the

less skewed, and thus more difficult, benchmarks. Since the HCCT is a subtree

of the (φ, ε)-HCCT, the hot edge coverage evaluated for the HCCT is a lower

bound to the corresponding value in the (φ, ε)-HCCT.

Note that τ is directly proportional to and roughly one order of magnitude

larger than φ. This is because the HCCT contains all contexts with frequency

≥ bφNc, and always contains the hottest context (which implies H1 = H2 in the

definition of hot edge coverage in Section 4.1.2). Hence, the hot edge coverage

is 100% as long as bφNc ≥ τH1, which yields τ = bφNc/H1. The experiment

shows that 100% hot edge coverage is always obtained for τ ≥ 0.01: even for

values of φ for which the degree of overlap may be small, the HCCT represents

the hot portions of the full CCT remarkably well according to hot edge coverage.

As a frame of comparison, notice that the τ thresholds used in [32] to analyze

hot edge coverage are always larger than 0.05, and for those values we always

guarantee total coverage. Clearly, the higher the threshold, the fewer hot edges

included in the range. For instance, a threshold of 0.95 will use all nodes whose

count values are within 5% of the hottest CCT node in computing the coverage.

Note that this is different, and possibly more challenging, than simply choosing

the top 5% hottest contexts, or nodes whose count values represent 5% of the

sum of the frequency counts of the CCT nodes.

4.2.2 Accuracy: (φ, ε)-HCCT

Uncovered frequency. Analysis of maximum and average frequency of calling

contexts not included in the (φ, ε)-HCCT provides important information on

the effectiveness of our approach especially when the degree of overlap is small.

Consider, as an example, φ = 10−4, which yields degree of overlap as small as

10% on two of the less skewed benchmarks (oocalc and firefox).

In this apparently bad scenario, Figure 4.3 analyzes how the remaining 90%

of the total CCT weight is distributed among uncovered contexts: the average

frequency of uncovered contexts is about 0.01% of the frequency of the hottest

context, and the maximum frequency is typically less than 10%. This suggests

that uncovered contexts are likely to be uninteresting with respect to the hottest

47


1e-05

0.0001

0.001

0.01

0.1

1

10

100

amarok

arkaudacity

bluefish

dolphin

firefox

gedit

ghex2

gimpsudoku

gwenview

inkscape

oocalc

ooimpress

oowriter

pidgin

quanta

vlcUncovere

d fre

qu

en

cy (

% o

f th

e h

ott

est co

nte

xt)

Benchmarks

Avg/max frequency of contexts not included in (φ,ε)-HCCT

LSS avgLSS max

Figure 4.3: Maximum and average frequency of calling contexts not included in

the (φ, ε)-HCCT generated by LSS. Results for LC are almost identical and are

not shown in the chart.

contexts, and that the distribution of calling context frequencies obeys a “long-

tail, heavy-tail” phenomenon: the CCT contains a huge number of calling con-

texts that rarely get executed, but overall these low-frequency contexts account

for a significant fraction of the total CCT weight. Discarding on-the-fly informa-

tion about these contexts is essential to reduce space usage, and the efficiency of

data streaming algorithms plays a fundamental role in the process.

In our experiments, Space Saving and Lossy Counting showed an almost iden-

tical behaviour from the point of view of uncovered frequencies. Note that in this

section we do not report accuracy results for BSS, since it computes the same

(φ, ε)-HCCT as LSS.

Partition of the nodes. Figure 4.4 shows the percentages of cold nodes, true

hots, and false positives in the (φ, ε)-HCCT using φ = 10−4 and ε = φ/5. Both

algorithms include in the tree only very few false positives (less than 10% of the

total number of tree nodes in the worst case), and Lazy Space Saving consistently

48


proved to be better than Lossy Counting.

0

20

40

60

80

100

amarok

ark audacity

bluefish

dolphin

firefox

gedit

ghex2

gimp

sudoku

gwenview

inkscape

oocalc

ooimpress

oowriter

pidgin

quanta

vlc

Cold

nodes / h

ot n

od

es / fa

lse

positiv

es (

%)

Classification of (φ,ε)-HCCT nodes

Figure 4.4: Partition of (φ, ε)-HCCT nodes into: cold (bottom bar), hot (mid-

dle bar), and false positives (top bar) for LC and LSS. The two bars for each

benchmark are related to LC (left) and LSS (right), respectively.

These results are not surprising: previous experimental studies [19] show that

these algorithms are very accurate in terms of recall (i.e., the fraction of relevant

instances that are retrieved by the algorithm). The percentage of false positives

is influenced by the choice of the φ/ε ratio: as the maximum possible estimation

error decreases, Lossy Counting and Space Saving identify heavy hitters more

accurately. Figure 4.5 shows how the number of false positives considerably

decreases as we choose smaller values of ε for a fixed φ, getting very close to 0%

on most benchmarks when the ratio φ/ε is larger than 10.

Since the average node degree is a small constant around 2-3, cold HCCT

nodes are typically a fraction of the total number of nodes. In our experiments

we observed that this fraction strongly depends on the hotness threshold φ, and

in particular decreases with φ: cold nodes that have a hot descendant are indeed

more likely to become hot when φ is smaller.

49


0

2

4

6

8

10

12

14

16

18

2 4 6 8 10 12 14 16

Fals

e p

ositiv

es (

% o

f (φ

,ε)-

HC

CT

nod

es)

φ/ε

Number of false positives as a function of ε

LSS - firefoxLSS - oocalc

LSS - inkscapeLSS - ark

LC - firefoxLC - oocalc

LC - inkscapeLC - ark

Figure 4.5: False positives in the (φ, ε)-HCCT as a function of ε on a represen-

tative subset of benchmarks. The value of φ is fixed to 10−4.

Accuracy of counters. A very interesting feature of our approach is that

counter estimates are very close to the true frequencies, as shown in Figure 4.6.

Lossy Counting on average achieves almost exact counters for the hot contexts

(0.057% average difference from the true frequency), and even the maximum

error in the worst case is smaller than 8%. When a heavy hitter is inserted in the

set of monitored items, it is unlikely that at the end of the current bucket it is

removed from M: for this reason very few hot contexts are underestimated in a

non-negligible way. Lazy Space Saving computes instead less accurate counters.

This is due to the fact that many heavy hitters may not appear in the initial

part of the stream: when they are inserted in the set of monitored items for the

first time, the overestimation is often non-negligible. We implemented a variant

of LSS that uses the εi field (see Section 2.2.1) to improve accuracy of computed

frequencies by subtracting εi from count for each node output at the end of the

query operation. With this trick, the average error becomes comparable to LC.

50


0 3 6 9

12

15

18

amar

okark

auda

citybl

uefis

hdolphinfir

efox

gedit

ghex

2gim

psu

dokugw

enviewin

ksca

peooca

lcooim

pres

soo

writ

erpidg

inqu

antavl

c

amar

okark

auda

citybl

uefis

hdolphinfir

efox

gedit

ghex

2gim

psu

dokugw

enviewin

ksca

peooca

lcooim

pres

soo

writ

erpidg

inqu

antavl

c

Avg/max error (%)

Be

nch

ma

rks

Avg

/ma

x c

ou

nte

r e

rro

r a

mo

ng

ho

t e

lem

en

ts (

% o

f th

e t

rue

fre

qu

en

cy)

L

SS

avg

err

or

L

C a

vg

err

or

L

SS

ma

x e

rro

r

LC

ma

x e

rro

r

Ma

xA

vg

Fig

ure

4.6:

Acc

ura

cyof

conte

xt

freq

uen

cies

com

pute

dby

LSS

and

LC

,m

easu

red

onhot

conte

xts

incl

uded

inth

e(φ,ε

)-

HC

CT

.L

ossy

Cou

nti

ng

onav

erag

eac

hie

ves

alm

ost

exac

tco

unte

rs,

per

form

ing

sign

ifica

ntl

yb

ette

rth

anSpac

eSav

ing.

51


Integration with bursting. The integration with static and adaptive bursting

techniques [32] does not substantially affect the accuracy of the (φ, ε)-HCCT, with

the exception of adaptive bursting with no re-enablement. This is not surprising,

since without re-enablement information about calling contexts observed in a

previous burst is ignored: hence, the corresponding nodes are more likely to

become cold as the number of routine enter events increases. When re-enablement

is active, the RR parameter specifies the probability of updating the tree with

profiling information for contexts observed during a previous burst rather than

discarding it. A higher RR causes more bursts to be re-enabled, adding higher

overhead, whereas a smaller RR could miss more burts and lowers the quality of

the constructed tree.

0

10

20

30

40

50

60

70

80

90

100

static

RR 20%

RR 10%

RR 5%

adaptive

Hot edge c

overa

ge (

%)

Comparison of hot edge coverage results using static and adaptive bursting

LSS LC

Figure 4.7: Hot edge coverage analysis on the vlc benchmark for LSS/LC com-

bined with static and adaptive bursting for τ = 0.025, with φ = 4 · 10−5,

ε = φ/2, sampling interval 2 msec, and burst length 0.2 msec. RR stands for

re-enablement, adaptive for adaptive bursting without re-enablement.

We take as an example the vlc benchmark and we focus on hot edge coverage

for τ = 0.025. The HCCT computed without bursting has coverage 100%, which

is also achieved using static busting and adaptive bursting with re-enablement

52


ratio 20% as shown in Figure 4.7. The hot edge coverage decreases as sampling

becomes more aggressive, but in this experiment is always larger than 78% and

91% for LSS and LC, respectively.

4.2.3 Memory usage and cost of analysis

Memory usage. Our trace analyzer (see Section 3.4) keeps track of the peak

memory usage of the analysis tools. The peak memory usage is proportional to

the number of MCCT nodes, which is typically larger than the actual number of

hot contexts obtained after pruning. In spite of this, a considerable amount of

space is saved by all our algorithms.

Figure 4.8 plots the peak memory usage of our analysis tools as a percentage

of the full CCT. On many benchmarks, our profilers use less than 1% of the

space required by the full CCT, and even in the worst case the space usage is

about 6.5%. Space Saving is always more efficient than Lossy Counting, and the

lazy approach is preferable to the bucket-based implementations suggested by

the authors [22]. It should be noted, however, that the space used in practice by

Lossy Counting is much smaller than the theoretical prediction (see Section 2.2.2)

and considerably smaller than the 7ε

bound discussed by the authors [20] for real

world datasets.

Quite surprisingly, the integration with static bursting improves space usage

(results for adaptive bursting are similar and are not reported in the chart). This

depends on the fact that sampling reduces the variance of calling context frequen-

cies: MCCT cold nodes that have a hot descendant are more likely to become hot

when sampling is active, and monitoring these nodes in place of others reduces

the total MCCT size. The histogram also shows that static/adaptive bursting

alone (i.e., without streaming) is not sufficient to reduce the space substantially:

along with hot contexts, a large fraction of cold contexts is also sampled and

included in the CCT. On the oocalc benchmark, even adaptive bursting without

re-enablement requires 127 times more space than LSS. We also observed that

the larger the applications, the larger the space reduction of our approach over

bursting alone.

53


0.1 1

10

10

0

amar

ok

ark

auda

city

blue

fish

dolphin

firef

ox

gedit

ghex

2

gim

p

sudo

ku

gwen

view

inks

cape

ooca

lc

ooim

pres

soo

writ

er

pidg

in

quan

ta

vlc

Virtual memory peak(% of full CCT)

Be

nch

ma

rks

Sp

ace

co

mp

ariso

n w

ith

fu

ll C

CT

LS

S +

sta

tic b

urs

tL

SS

BS

SL

C +

sta

tic b

urs

tL

Csta

tic b

urs

t

Fig

ure

4.8:

Spac

ean

alysi

son

seve

ral

ben

chm

arks

ofB

SS/L

SS/L

C,

stat

icburs

ting,

and

LSS/L

Cco

mbin

edw

ith

stat

ic

burs

ting

wit

hsa

mpling

inte

rval

2m

sec,

and

burs

tle

ngt

h0.

2m

sec.

54


Cost of analysis. Since our approach is independent of any specific instru-

mentation mechanisms, and different techniques for tracing routine enter and

exit events might incur rather different overheads in practice, in this paragraph

we will focus on the time required by the analysis routines only. Our trace ana-

lyzer provides a mechanism to measure the time spent in analysis subtracting the

fraction spent in I/O operations (see Section 3.4). Compared to the construc-

tion of the full CCT, streaming algorithms incur a small overhead, which can be

considerably reduced by exploiting bursting techniques.

0.25

0.5

1

2

4

8

16

32

64

static

RR 20%

RR 10%

RR 5%

adaptive

static

RR 20%

RR 10%

RR 5%

adaptive

Speedup w

.r.t.

CC

T c

onstr

uction

Comparison of analysis running times using static and adaptive bursting

Full CCT

BurstingLSS

LC

Sampling interval = 10 ms Sampling interval = 2 ms No sampling

Figure 4.9: Speedup analysis relative to full CCT construction on the vlc bench-

mark for LSS/LC, static and adaptive bursting alone, and LSS/LC combined

with static and adaptive bursting: (1) no sampling (left histogram); (2) sampling

interval 2 msec and burst length 0.2 msec (middle histogram); (3) sampling inter-

val 10 msec and burst length 0.2 msec (right histogram). The CCT construction

baseline is about 1.5 seconds for 2.5 · 108 routine enter/exit events.

In Figure 4.9 we plot the speedup that can be obtained by combining our

approach with static and adaptive bursting. We compared the performance of

six variants (from slowest to fastest) for both LSS and LC: full (φ, ε)-HCCT

construction (on the entire routine enter/exit stream), static bursting, adaptive

55


bursting with re-enablement ratio equal to 20%, 10%, and 5%, and adaptive

bursting without re-enablement. We omitted sample-driven stack-walking, which

has been shown to be largely inferior to bursting with respect to the accuracy [22].

The construction of the HCCT greatly benefits from bursting, achieving speedups

up to 35× in the execution time of analysis routines. LSS is always faster than

LC, but both of them are slower than bursting without streaming: the speedup

difference, however, is moderate and for LSS is typically smaller than 2.

4.2.4 Performance

Performance of our gcc-based profiling tools is evaluated against running times of

suitable SPEC CPU2006 benchmarks compiled with the base set of optimizations.

According to SPEC Run and Reporting Rules3, the median of three runs for each

benchmark is picked. In these experiments we used ε = 2 ∗ 10−5.

In Figure 4.10 we plot the slowdown introduced by our profiling tools and by

gprof. We do not report results for LC since LSS is always faster than LC. We

also do not report results for LSS with adaptive bursting since the improvement

brought with respect to LSS with static bursting is not considerably remarkable

as in the evaluation made for analysis code only at the end of the previous section:

the overhead introduced by instrumentation code only (the empty analysis tool)

often accounts for a large fraction of total execution time, hence the benefits of

adaptive bursting with re-enablement are less appreciable.

The overhead introduced by our streaming-based approach with respect to

canonical CCT construction is small, while the integration with static bursting

reduces by half the slowdown in many benchmarks. In the worst case, LSS with

static bursting introduces 1.56x slowdown for C benchmarks (sjeng) and 3.46x

for C++ benchmarks (omnetpp). CCT construction times for gobmk and sjeng

are not available since the corresponding CCTs are too large to be maintained

in main memory. For the sake of completeness, absolute execution times are

reported at the end of this section in Table 4.3.

Note that there is no linear correlation between the size of the CCT (reported

in Table 4.1) and the slowdown introduced by our profiling tools. Consider, for

3http://www.spec.org/cpu2006/Docs/runrules.html

56


0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

astar

gobmk

h264ref

mcf

omnetpp

povray

sjeng

soplex

Slo

wd

ow

n

Benchmarks

Analysis of slowdown introduced by profiling wrt native execution

CCTLSS

LSS+staticgprof

empty

Figure 4.10: Analysis of slowdown introduced by profiling tools with respect to

a native execution of a set of SPEC CPU2006 benchmarks. The tool empty

instruments routine enter and exit events without performing any analysis and

is used to evaluate the effectiveness of gcc-based instrumentation. Statistics are

reported also for the traditional exhaustive construction of the CCT.

instance, mcf and povray: the size of the respective CCTs are of the same order

of magnitude, while the slowdown for mcf is much smaller than for povray. More-

over, the slowdown does not either increase or decrease with the size of the CCT:

for instance, the overhead introduced for mcf is remarkably lower than for astar

and omnetpp, whose CCTs are considerably smaller and larger, respectively, than

the CCT of mcf.

In Figure 4.11 we compare the performance of our profilers with gprof. The

reasons behind this choice are several: its widespread use in the profiling com-

munity, the use of a compiler assisted instrumentation (the -pg option of gcc),

and the compatibility with SPEC CPU2006 benchmarks. gprof is a very efficient

tool: the introduced slowdown with respect to a native execution is 2.07x in the

worst case scenario (omnetpp benchmark, see Figure 4.10), while it is very close

to 1.5x for most benchmarks. We believe that aiming at achieving better perfor-

57


mance than gprof would be a very ambitious, and probably unrealistic, goal - at

least without the adoption of sampling and bursting techniques: gprof has been

largely studied and optimized over the years, and computes much less accurate

information in terms of analysis depth.

0

0.5

1

1.5

2

2.5

3

astar

gobmk

h264ref

mcf

omnetpp

povray

sjeng

soplex

Slo

wdow

n

Benchmarks

Comparison of performance of gcc-based profilers with gprof

CCTLSS

LSS+staticempty

Figure 4.11: Analysis of slowdown introduced by profiling tools with respect to

an execution of SPEC CPU2006 benchmarks under gprof.

A comparison of the performance of the empty tool with gprof is very useful

in order to evaluate the effectiveness of gcc-based instrumentation. Quite sur-

prisingly, the empty tool is consistently slower than gprof (two times slower in

the worst case, represented by soplex benchmark) for C++ benchmarks: the

-finstrument-functions instrumentation framework provided by g++ seems to be

not as efficient as in gcc. Since this is still an experimental feature for developers,

we are confident that performance improvements might be realized soon.

In conclusion, we believe that the results of the comparison with gprof are

encouraging: in three benchmarks (i.e., h264ref, mcf, and sjeng) LSS with static

bursting performs slightly faster, and only in two benchmarks the slowdown is

greater than 1.6x (in particular, it is close to 2x for omnetpp and less than 2.4x

for soplex).

58


Benchmark native CCT LSS LSS+static gprof empty

astar 667 1705 1748 1452 1001 1271

gobmk 688 N/A 1776 1049 984 867

h264ref 1097 2582 2676 1676 1685 1455

mcf 399 567 581 571 576 460

omnetpp 480 2169 2317 1663 797 1290

povray 393 1992 2056 1241 813 956

sjeng 817 N/A 2368 1278 1296 1037

soplex 495 1412 1567 1229 526 1059

Table 4.3: Execution times (in seconds) for SPEC CPU2006 benchmarks.

4.3 Conclusions

In this chapter we have presented and discussed an extensive experimental study

of our profiling mechanism, using two distincts sets of benchmarks for accuracy

and performance metrics evaluation. Data streaming algorithms have proved

themselves to be very effective in this setting: both algorithms include very

few false positives in the (φ, ε)-HCCT, while average counter error is negligible

especially for LC. On the other hand, the overhead introduced with respect to

canonical CCT construction is small, while the integration with static bursting

reduces by half the slowdown in many benchmarks. Moreover, the results of the

comparison with gprof are encouraging, with our tool performing slightly faster

in some benchmarks while providing much more accurate contextual profiling

information.

59

Chapter 5

Conclusions

Summary

Program analysis tools are extremely important in software development and

engineering. In particular, profilers are largely used to understand program be-

haviour (e.g., identify critical sections of code), aid program optimization (i.e.,

make some aspects of it work more efficiently or use fewer resources), and detect

anomalies of program execution or code injection attacks. The call graph con-

structed by many performance profiling tools such as gprof does not accurately

describe context-sensitive information, since it represents pairs of caller-callee

routines rather than maintaining a separate node for each call stack that a pro-

cedure can be activated with. For this reason, calling context trees have been

proposed in order to offer a compact representation of all calling contexts encoun-

tered during a program’s execution. However, even for short runs of medium-sized

applications, CCTs can be rather large and difficult to analyze. Moreover, their

sheer size might hurt execution time due to poor access locality during construc-

tion and query.

Motivated by the observation that only a very small fraction of calling contexts

are representative of hot spots to which optimizations must be directed, in this

thesis we have presented a novel technique for improving the space efficiency of the

complete calling context tree without sacrificing accuracy. We have formalized

the concept of Hot Calling Context Tree and we casted the problem of identifying

the most frequent contexts into a data streaming setting: by adapting fast and

60

CHAPTER 5. CONCLUSIONS

space-efficient algorithms for mining frequent items in data streams, we have

devised an algorithm for interprocedural contextual profiling that can discard

on the fly cold contexts, thus reducing memory usage up to 1% of the space

required to maintain the complete CCT in main memory. In particular, we have

realized a highly tuned implementation of the Space Saving algorithm that is more

efficient in practice than the one proposed by the authors and that might be of

independent interest. Our approach appears to be extremely practical and can

be easily integrated with several techniques (e.g. static and adaptive bursting)

that have been proposed in the last years to reduce the overhead introduced by

instrumentation.

We have evaluated the effectiveness of our approach in terms of accuracy

and memory usage on several large-scale Linux applications: results not only

confirm, but reinforce the theoretical predictions concerning these two aspects.

We have analyzed the performance overhead introduced by our gcc-based profiling

tools using a set of benchmarks from the SPEC CPU2006 suite. In particular,

for C benchmarks the combination of Lazy Space Saving with static bursting

introduces in the worst case a 1.56x slowdown with respect to a native execution.

Moreover, this version of our profiler performs slightly faster than gprof in several

C benchmarks. We believe that the results of this experimental evaluation are

encouraging, since gprof does not provide additional context-sensitive information

beyond the extent of a call graph.

Outlook

In this thesis the focus is on frequency counts (i.e., on the number of activations

of each calling context throughout the execution) as performance metrics, but

our approach can be extended to arbitrary metrics such as execution time, cache

misses, or instruction stalls. Even though a preliminary analysis suggests that

the data streaming algorithms presented in this thesis might not adapt well to

scenarios in which items in the data stream are weighted, better data streaming

algorithmic solutions are already available. Another interesting direction to inves-

tigate is whether a similar approach can be applied to certain space-demanding

forms of path profiling: e.g., it would be interesting to understand if our tech-

61

CHAPTER 5. CONCLUSIONS

niques could help leverage the scalability problems encountered when collecting

performance metrics about interprocedural path profiling in order to capture both

inter- and intra-procedural control flow (i.e., acyclic paths that may cross proce-

dure boundaries) [21], or k-iteration paths [28] (i.e., intraprocedural cyclic paths

spanning up to k loop iterations) for large values of k.

A natural evolution of our work on performance profiling and algorithm design

would be to study methodologies and techniques for analyzing the performance

and behaviour of large distributed applications, developing profiling tools for big

data computing platforms and cloud architectures such as MapReduce [11]. In a

recent work [10] it is shown that collected profiling information can be extremely

large (e.g., about 100GB for a TeraSort job): we believe that an interesting

research direction would be to investigate whether streaming algorithms can be

adapted to this scenario in order to mine on-the-fly only data items that are

possibly relevant with respect to the metrics of interest.

62

Appendix A: Using the tools

In this appendix we describe how to install and use the profiling tools discussed

in this thesis. Source code can be retrieved through Google Code:

$ svn co -N http://hcct.googlecode.com/svn/trunk/ hcct_read_only

$ cd hcct_read_only

$ svn up thesis

$ cd thesis

The only library required for compilation is GLib21, available on most Linux

distributions. Source code has been extensively tested on several Ubuntu releases.

Pin tracer

The trace recorder based on Pin can be used to perform preliminary analysis on

the target program without the need for recompile it. Source code is organized

in two directories, rectrace and stool, within the module pin tracer. The

Pin instrumentation framework should be installed in a folder called pin under

pin tracer. Once the toolkit has been downloaded2, the main folder should be

renamed accordingly:

$ tar xzvf pin-2.11-49306-gcc.3.4.6-ia32_intel64-linux.tar.gz

$ mv pin-2.11-49306-gcc.3.4.6-ia32_intel64-linux pin

Compiling the pintool is very simple. The STool infrastructure is used by the

tracer recorder tool as an intermediate layer towards Pin, hence should be com-

piled first. No configuration is needed.

1http://www.gtk.org/2http://www.pintool.org

63

APPENDIX . APPENDIX A: USING THE TOOLS

In order to compile both modules:

$ cd stool

$ make

$ cd ../rectrace

$ make

The compiled pintool is saved as obj-ia32/rectrace.so in the rectrace folder.

A pintool should be invoked through pin in order to be properly executed. How-

ever, on several recent Linux distributions a preliminary operation is needed3:

$ sudo -i

(insert root password)

# echo 0 > /proc/sys/kernel/yama/ptrace_scope

# exit

Recording the execution of an operation is very simple. We will use xcalc as an

example:

$ ../pin/pin -t obj-ia32/rectrace.so -- xcalc

[TID=0] Starting new thread with output file xcalc-0.cs.trace

The analyzed application will work as usual, slowed down by instrumentation

and flush operations needed to record the trace on disk. Note that a separate file

is created for each thread.

Trace analyzer

The trace analyzer is independent of the instrumentation framework used for

recording and can be used to perform preliminary analysis or parameter tuning

of data streaming algorithms and sampling/bursting techniques. Source code is

contained in the module trace analyzer and can be compiled by simply execut-

ing the make command from inside the module directory.

3Pin uses the parent injection as default mode since it is less likely to cause a problem.

However, on many recent Linux distributions the usage of the ptrace system call is restricted

for security reasons. For more information, see Injection section in the Pin User Manual.

64


A collection of trace analysis tools is generated in the bin directory, one for

each algorithm, profiling methodology, and type of analysis to perform. Prefix

denotes the algorithm in use: cct for the exhaustive CCT construction; empty

for only parsing the stream; bss-hcct, lss-hcct, and lc-hcct for the con-

struction of the (φ, ε)-HCCT through BSS, LSS, and LC streaming algorithm,

respectively. Infix denotes the bursting technique in use: burst for static burst-

ing; adaptive-XX for adaptive bursting4; adaptive-no-RR for adaptive bursting

without re-enablement. When no infix is present, the stream of events is analyzed

wholly (i.e., no sampling is performed). Suffix denotes the type of analysis to per-

form (streaming algorithms only): metrics for evaluating accuracy metrics with

respect to the exhaustive CCT construction; perf for evaluating time consumed

by analysis routines (i.e., the cost of analysis) and peak memory usage.

The syntax of the tool is straightforward. Three parameters are manda-

tory: the name of trace file to analyze, the benchmark considered (it will ap-

pear in the log), and the name of output log file. Other parameters depend

on both the algorithm and the methodology in use. For instance, consider

lss-hcct-burst-metrics:

$ bin/lss-hcct-burst-metrics input_file.trace benchmark_name \

logs/output.log -phi <PHI> -epsilon <EPSILON> -tau <TAU> \

-sampling <SAMPLING_RATE> -burst <BURST_LENGTH>

PHI and EPSILON are defined as d1/φe and d1/εe, respectively. For instance,

a choice of parameters φ = 10−4 and ε = 2 ∗ 10−5 entails PHI=10000 and

EPSILON=50000. Similarly, TAU is defined as d1/τe and is used to evaluate the

hot edge coverage. SAMPLING RATE and BURST LENGTH are expressed in terms of

timer ticks: the former is defined as the number of timer ticks between to sam-

pling points, while the latter is defined as the duration of a burst expressed in

number of timer ticks. Default granularity for Pin tracer is 200 µs.

The syntax of the output log file is quite simple. We will briefly discuss

the format for the *-metrics tools. The first row is used as a header, and

subsequent executions of the trace analysis tools with the same output file would

not replicate it. Rows (one for each execution of the tool) are organized in

4XX is defined as 1/RR. For instance, adaptive-5 entails RR=20%.

65


columns separated by spaces and can be parsed through graphical visualization

tools such as gnuplot or user-defined scripts. The first columns contain general

information about the benchmark (e.g., the length of the stream, the number of

nodes of the call graph and of the [H]CCT, and the number of hot/cold nodes

and false positives), then several accuracy metrics are reported (e.g., degree of

overlap, hot edge coverage, and max/avg error on counters). When bursting is in

use, related information (e.g., the number of bursts, the resolution of the timer,

and the sampled percentage of the whole stream) are reported as well.

The gcc-based profiler

The gcc-based profiler is composed of a set of dynamic libraries, one for each

profiling methodology as usual. In order to compile the source code it is sufficient

to execute the make command from inside the gcc profiler module directory.

The compiled libraries should be reachable by executables:

$ ls libs

libhcct_cct_burst.so libhcct_empty_burst.so libhcct_lss.so

libhcct_cct_malloc.so libhcct_empty.so libhcct_tracer.so

libhcct_cct.so libhcct_lss_burst.so

$ sudo -i cp libs/* /usr/lib/

(insert root password)

The user can choose a library as default one and link applications to it during

recompilation. By modifying a symbolic link, it is possible to change the profiling

library in use in a transparent way. For instance, we choose LSS with static

bursting as default profiling library:

$ sudo ln -s /usr/lib/libhcct_lss_burst.so /usr/lib/libhcct.so

Recompiling the application to be analyzed does not require modifications to

the source code. It is sufficient to edit the Makefile in order to specify certain

options for the compiler (CFLAGS variable for C source, CXXFLAGS for C++) and

the linker (LDFLAGS variable). In particular, the compiler has to be instructed to

66


generate instrumentation calls for entry and exit to functions and also produce

debugging information5, while the linker should link the profiling library to the

executable and also wrap thread creation and termination functions6. Hence, the

required options are:

# Makefile

CFLAGS="-finstrument-functions -g"

LDFLAGS="-lhcct -lpthread -Wl,-wrap,pthread_create \

-Wl,-wrap,pthread_exit"

A collection of demo programs is available under the test folder. Actually,

the same demo program is compiled and linked to each of the profiling libraries

separately. Parameters for the streaming algorithms and the bursting mechanism

can be specified as environment variables:

$ DUMPPATH="/tmp" EPSILON=1000 PHI=100 SINTVL=200000 \

BLENGTH=20000 test/lss_burst

The meaning of the environment variables is straightforward. DUMPATH specifies

the path for storing the output file(s); values for EPSILON and PHI are defined as

for the trace analyzer; SINTVL and BLENGTH specify (in nanoseconds) sampling

interval and burst length, respectively.

For each thread a separate output file is produced, containing the tree con-

structed by the profiling tool. File names are chosen according to the syntax

<benchmark>-<PID/TID>.tree, and the tree is dumped with an in-order visit

to ease its reconstruction. A simple tool called analysis is provided under the

folder tools and computes simple statistics about the tree; this tool is a starting

point for the implementation of more complex analysis methods.

Routine addresses are stored in hexadecimal format in order to be resolved

using addr2line. Given an address and an executable, this tool uses the debug-

ging information in the executable to reconstruct which source file name and line

number are associated with the address. As an example, we executed the testing

program for the CCT construction and extracted a random routine address from

5Debugging information is needed to resolve addresses using the addr2line command.6The profiling library will invoke the Pthread library after an initialization phase.

67


the output files; as the reader can see, information reported by addr2line is very

accurate:

$ addr2line -e test/cct 8048e00

/home/pctips/hcct_read_only/thesis/gcc_profiler/example.c:29

Finally, the libhcct tracer.so library provides a trace recorder which is fully

equivalent to the Pin tracer, hence the output files can be analyzed using the

trace analyzer as well.

Note: since this profiler is still experimental, many new features and improve-

ments will be implemented soon. Please check the project page http://code.

google.com/p/hcct/ for news and releases.

68

List of Figures

1.1 Call graph, Call tree, and Calling Context Tree . . . . . . . . . . 6

2.1 Skewness of calling contexts distribution . . . . . . . . . . . . . . 19

2.2 CCT, HCCT, and (φ, ε)-HCCT . . . . . . . . . . . . . . . . . . . 20

2.3 Tree data structures and calling contexts classification . . . . . . . 21

4.1 Relation between φ and degree of overlap . . . . . . . . . . . . . . 45

4.2 Hot edge coverage - relation between φ and τ . . . . . . . . . . . 46

4.3 Max and avg frequency of calling contexts not in (φ, ε)-HCCT . . 48

4.4 Partition of (φ, ε)-HCCT nodes into cold, hot, and false positives . 49

4.5 False positives in the (φ, ε)-HCCT as a function of ε . . . . . . . . 50

4.6 Accuracy of context frequencies computed by LSS and LC . . . . 51

4.7 Hot edge coverage analysis with bursting enabled . . . . . . . . . 52

4.8 Space analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.9 Cost of analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.10 Slowdown introduced with respect to a native execution . . . . . . 57

4.11 Slowdown introduced with respect to gprof . . . . . . . . . . . . . 58

69

List of Tables

2.1 Number of nodes of call graph, call tree, calling context tree, and

number of distinct call sites for different applications. . . . . . . . 17

4.1 Benchmarks selected from SPEC CPU2006. . . . . . . . . . . . . 41

4.2 Typical thresholds. . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3 Execution times (in seconds) for SPEC CPU2006 benchmarks. . . 59

70

Listings

2.1 prune(X,MCCT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1 find-min() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Implementation of find-min() with sentinel and second min idx . 27

3.3 Tree node structure for Lossy Counting . . . . . . . . . . . . . . . 29

3.4 Activation paradigm of STool . . . . . . . . . . . . . . . . . . . . 31

3.5 Function prototypes for gcc-based instrumentation . . . . . . . . . 34

3.6 Wrapping of pthread create and pthread exit . . . . . . . . . . 36

3.7 Activation paradigm of the gcc-based profiling infrastructure . . . 37

3.8 Syntax of output format for gcc-based online profiler . . . . . . . 38

71

Bibliography

[1] Glenn Ammons, Thomas Ball, and James R. Larus. Exploiting hardware

performance counters with flow and context sensitive profiling. SIGPLAN

Not., 32(5):85–96, 1997.

[2] Matthew Arnold and Barbara G. Ryder. A framework for reducing the cost

of instrumented code. In PLDI, pages 168–179, 2001.

[3] Matthew Arnold and Peter F. Sweeney. Approximating the calling context

tree via sampling. Technical Report RC 21789, IBM Research, 2000.

[4] Thomas Ball and James R. Larus. Efficient path profiling. In MICRO

29: Proceedings of the 29th annual ACM/IEEE international symposium on

Microarchitecture, pages 46–57, 1996.

[5] Thomas Ball, Peter Mataga, and Mooly Sagiv. Edge profiling versus path

profiling: the showdown. In POPL, pages 134–148. ACM, 1998.

[6] Andrew R. Bernat and Barton P. Miller. Incremental call-path profiling.

Technical report, University of Wisconsin, 2004.

[7] Michael D. Bond and Kathryn S. McKinley. Probabilistic calling context.

SIGPLAN Not. (proceedings of the 2007 OOPSLA conference), 42(10):97–

112, 2007.

[8] Derek Bruening. Efficient, Transparent, and Comprehensive Runtime Code

Manipulation. PhD thesis, M.I.T., 2004.

[9] Bryan Buck and Jeffrey K. Hollingsworth. An API for runtime code patch-

ing. International Journal of High Performance Computing Applications,

14(4):317–329, 2000.

72

BIBLIOGRAPHY

[10] Jinquan Dai, Jie Huang, Shengsheng Huang, Bo Huang, and Yan Liu. Hi-

tune: dataflow-based performance analysis for big data cloud. In Proceed-

ings of the 3rd USENIX conference on Hot topics in cloud computing, Hot-

Cloud’11, 2011.

[11] Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing

on large clusters. In Proceedings of the 6th conference on Symposium on

Opearting Systems Design & Implementation - Volume 6, 2004.

[12] Daniele Cono D’Elia, Camil Demetrescu, and Irene Finocchi. Mining hot

calling contexts in small space. In Proceedings of the 32nd ACM SIGPLAN

conference on Programming language design and implementation, PLDI ’11,

pages 516–527, 2011.

[13] Nathan Froyd, John M. Mellor-Crummey, and Robert J. Fowler. Low-

overhead call path profiling of unmodified, optimized code. In ICS, pages

81–90, 2005.

[14] Susan L. Graham, Peter B. Kessler, and Marshall K. McKusick. gprof: a

call graph execution profiler (with retrospective). In Best of PLDI, pages

49–57, 1982.

[15] Satish Chandra Gupta. Need for speed – eliminating performance bot-

tlenecks. http://www.ibm.com/developerworks/rational/library/05/

1004_gupta/. (accessed September 2, 2012).

[16] Robert J. Hall and Aaron J. Goldberg. Call path profiling of monotonic

program resources in unix. In USENIX Summer, pages 1–14, 1993.

[17] Martin Hirzel and Trishul Chilimbi. Bursty tracing: A framework for low-

overhead temporal profiling. In Proc. 4th ACM Workshop on Feedback-

Directed and Dynamic Optimization, 2001.

[18] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser,

Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood.

Pin: building customized program analysis tools with dynamic instrumen-

tation. In PLDI, pages 190–200, 2005.

73

BIBLIOGRAPHY

[19] Nishad Manerikar and Themis Palpanas. Frequent items in streaming data:

An experimental evaluation of the state-of-the-art. Data Knowl. Eng.,

68(4):415–430, 2009.

[20] Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts

over data streams. In VLDB, pages 346–357, 2002.

[21] David Melski and Thomas W. Reps. Interprocedural path profiling. In CC,

pages 47–62, 1999.

[22] Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. An integrated

efficient solution for computing frequent and top-k elements in data streams.

ACM Trans. Database Syst., 31(3):1095–1133, 2006.

[23] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations

and Trends in Theoretical Computer Science, 1(2), 2005.

[24] Nicholas Nethercote and Julian Seward. Valgrind: a framework for heavy-

weight dynamic binary instrumentation. In PLDI, pages 89–100, 2007.

[25] Seongbae Park. pthread get,setspecific vs thread local variable (Bits and

pieces). https://blogs.oracle.com/seongbae/entry/pthread_get_set_

specific_vs. (accessed September 2, 2012).

[26] Carl Ponder and Richard J. Fateman. Inaccuracies in program profilers.

Softw., Pract. Exper., 18(5):459–467, 1988.

[27] The GNU project. gcc - the GNU C Compiler. http://gcc.gnu.org. (ac-

cessed September 2, 2012).

[28] Subhajit Roy and Y. N. Srikant. Profiling k-iteration paths: A generalization

of the Ball-Larus profiling algorithm. In CGO, pages 70–80, 2009.

[29] Mauricio Serrano and Xiaotong Zhuang. Building approximate calling con-

text from partial call traces. In Proceedings of the 7th annual IEEE/ACM

International Symposium on Code Generation and Optimization, CGO ’09,

pages 221–230, 2009.

74

BIBLIOGRAPHY

[30] J. Michael Spivey. Fast, accurate call graph profiling. Softw., Pract. Exper.,

34(3):249–264, 2004.

[31] John Whaley. A portable sampling-based profiler for Java virtual machines.

In Proceedings of the ACM 2000 Conference on Java Grande, pages 78–87.

ACM Press, 2000.

[32] Xiaotong Zhuang, Mauricio J. Serrano, Harold W. Cain, and Jong-Deok

Choi. Accurate, efficient, and adaptive calling context profiling. In PLDI,

pages 263–271, 2006.

75

Documents

Sapienza University of Rome - Dipartimento di Ingegneria ...delia/files/msc-thesis.pdf · Sapienza University of Rome ... the data streaming computational model can be used to handle