Upload
nguyentuong
View
212
Download
0
Embed Size (px)
Citation preview
Sapienza University of Rome
School of Information Engineering,
Computer Science, and Statistics
Master’s Thesis in
Engineering in Computer Science
Mining Hot Calling Contexts in Small Space
Advisor: Prof. Camil Demetrescu
Student: Daniele Cono D’Elia
A.Y. 2011/2012
Homines, dum docent, discunt.
(Seneca)
Introduction
A crucial problem in the development of computer software is to identify badly
designed portions of code that can affect execution performance. An empirical
rule [15] states that 90% of the time is spent in executing 10% of the code: hence,
you have to pinpoint and optimize this 10% of code to improve performance
bottlenecks, since optimizing the remaining 90% would have a little impact on
overall performance.
Dynamic program analysis encompasses the design and development of tools
for analyzing software by gathering information at execution time. Main appli-
cations include performance profiling, debugging, and program understanding.
Static techniques for performance profiling - they are based on the inspection of
the source or sometimes the object code of the software and were very used in the
past - are not sufficient anymore, mostly because of the massive usage of dynamic
libraries and polymorphism and late binding in object-oriented languages.
Traditional profilers such as gprof [14] and valgrind [24] provide information
about the execution frequency of single routines (vertex profiling) or of caller-
callee pairs (edge profiling). However, these data sometimes are not sufficient to
describe accurately the dynamics of execution [26, 30]. Context-sensitive profiling
provides, instead, more valuable information with finer granularity for program
understanding, performance analysis, and runtime optimization. Collecting con-
text information in modern object-oriented software is very challenging: source
code is composed of a large number of small routines, and the high frequency
of function calls and returns might result in considerable profiling overhead and
huge amounts of data to be analyzed by the profiler.
The main goal of this thesis is to show how sophisticated techniques from
the data streaming computational model can be used to handle the huge amount
i
of profiling data gathered at execution time, thus achieving little overheads in
the implementation of a context sensitive profiler. In particular, we adopted
and optimized two well-know algorithms for the frequent items problem, Lossy
Counting and Space Saving, and then realized two implementations of our context
sensitive profiler. The first implementation is based on Intel Pin and allows it to
rapidly perform profiling on any C/C++ application since the instrumentation
is done at runtime. Although Pin is one of the most efficient and widespread
instrumentation frameworks, preliminary experimental results have shown that
the overhead introduced for context sensitive profiling was large.
For this reason, we developed a much more efficient profiler based on gcc.
Although instrumentation is now performed at compiling time, the tool is very
flexible: it is sufficient to change a symbolic link or an environment variable to
modify even crucial parameters of the profiler. This tool also allows for a partial
instrumentation of the source code and outputs only calling contexts representa-
tive of hot spots to which optimizations must be directed.
The experiments show that our profiler is able to correctly identify all the
hottest calling contexts, producing accurate performance metrics. The running
time overhead is kept under control using bursting techniques [2, 17, 32] while
the peak memory usage is only 1% of standard context-sensitive profilers. This
thesis is largely based on, and extends, the paper Mining Hot Calling Contexts in
Small Space [12], written by the author of this dissertation together with Profes-
sors Camil Demetrescu and Irene Finocchi (Sapienza University of Rome), and
published in Proceedings of the 32nd ACM SIGPLAN Conference on Program-
ming Language Design and Implementation, PLDI 2011, San Jose, CA, USA,
June 4-8, 2011. Source code and documentation related to this work are hosted
on Google Code (http://code.google.com/p/hcct/).
Thesis structure. The remaining part of this thesis is structured as follows.
Chapter 1 describes performance profiling methodologies and most used program
instrumentation tools. Chapter 2 describes issues related to context-sensitive pro-
filing and our approach to the problem. Chapter 3 focuses on implementation and
engineering aspects, while Chapter 4 illustrates the outcome of our experimental
study. Conclusions are presented in Chapter 5.
ii
Contents
1 Performance profiling methodologies 1
1.1 Survey of techniques . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Static and dynamic analysis . . . . . . . . . . . . . . . . . 2
1.1.3 Collection of data . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.4 Data type . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Data aggregation and representation . . . . . . . . . . . . . . . . 4
1.2.1 Calling contexts . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Call graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Call tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.4 Calling context tree . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Program instrumentation tools . . . . . . . . . . . . . . . . . . . . 7
1.3.1 gprof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Dyninst . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.3 DynamoRIO . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.4 Valgrind . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.5 Pin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.6 gcc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Mining hot calling contexts 11
2.1 Previous approaches . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Frequent items in data streams . . . . . . . . . . . . . . . . . . . 13
2.2.1 Space Saving . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Lossy Counting . . . . . . . . . . . . . . . . . . . . . . . . 15
iii
CONTENTS
2.3 HCCT: Hot Calling Context Tree . . . . . . . . . . . . . . . . . . 16
2.3.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Update and query . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Implementation 25
3.1 Space Saving (BSS and LSS) . . . . . . . . . . . . . . . . . . . . . 26
3.2 Lossy Counting (LC) . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 The pintool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 The trace analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 The gcc-based profiler . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Experimental evaluation 39
4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1.3 Parameter tuning . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1 Accuracy: exact HCCT . . . . . . . . . . . . . . . . . . . . 45
4.2.2 Accuracy: (φ, ε)-HCCT . . . . . . . . . . . . . . . . . . . . 47
4.2.3 Memory usage and cost of analysis . . . . . . . . . . . . . 53
4.2.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5 Conclusions 60
Appendix A: Using the tools 63
Bibliography 72
iv
Chapter 1
Performance profiling
methodologies
1.1 Survey of techniques
1.1.1 Profiling
The development of computer software often leads to programs that are quite
large and typically modular, according to well-known principles from software
engineering. These programs are usually constituted of a large number of small
routines, written by several developers and eventually updated to introduce en-
hancements and new functionalities, sometimes introducing deviations from soft-
ware’s initial goals. When a large program is compiled, it becomes important
to pinpoint and optimize the pieces of code that dominate the execution time.
Actually, according to the 90-10 empirical rule [15], 90% of the time is spent in
executing 10% of the code: this rule suggests, in accordance to the well-know
Pareto principle, to optimize mostly this part of the program to achieve remark-
able improvements in terms of efficiency and execution time.
The main goal of a performance profiler is that of identifying which parts of
the program should be optimized, in order to improve the global execution speed.
There are many other applications for profilers: for instance, optimal memory
allocation, intrusion detection, and understanding of program behaviour.
1
CHAPTER 1. PERFORMANCE PROFILING METHODOLOGIES
1.1.2 Static and dynamic analysis
Static analysis techniques are based on the inspection of source code (and some-
times of the object code). This analysis is executed without actually executing
programs and is performed by an automated tool. The granularity of analysis
tools varies from those that only consider individual statements, to those that in-
clude the complete source code in their analysis. Static analysis techniques had a
lot of success in the past years, especially in procedural programming; nowadays
they are used mainly by compilers to apply optimizations to the code, preserving
at the same time the semantic correctness. This is due to the fact that changes
in the development methodologies in the last years have created the need for the
adoption of a dynamic analysis of the behaviour of a program: polymorphism,
late binding, and dynamically linked libraries are emblematic of this need. A
static analysis would be inaccurate since the increasing shortage of information
forces the profiler to make conservative assumptions in its analysis, thus under-
mining the validity of the analysis in several cases. The development of dynamic
analysis tools has put several challenges also from a methodological point of view,
encompassing different areas: building efficient instrumentation techniques that
introduce a moderated slowdown on execution time and do not interfere with the
internal logic of the analysed program (i.e. do not introduce heisenbugs); design-
ing efficient algorithms to process huge amount of data with limited time and
resources; dealing with sometimes obscure optimizations introduced by compiler
in the executable code.
1.1.3 Collection of data
A statistical profiler operates by sampling: it usually probes the target program’s
program counter on a regular basis using operating system interrupts. The anal-
ysis performed by a statistical profiler is the result of a statistical approximation,
with a high error probability on unfrequent functions (sometimes they are not
detected at all). In practice, these profiler can often provide a more accurate
analysis than other approaches, since ther are not as intrusive to the analyzed
program (it is possible to achieve a near full speed execution, so they can detect
issues that would otherwise be hidden) and don’t have as many side effects (such
2
CHAPTER 1. PERFORMANCE PROFILING METHODOLOGIES
as on memory caches). Moreover, a statistical profiler can show the amount of
time approximately spent in user mode versus kernel mode by every analyzed rou-
tine. Several drawbacks due to the usage of operating system interrupts can be
overcome by using dedicated hardware to capture the program counter without
interferences on the execution of the program under analysis.
An instrumenting profiler modifies the target program with additional instruc-
tions to collect the required information. Instrumenting the program will always
have an impact on the program execution, typically introducing a slowdown and
potentially causing inaccurate results or heisenbugs. However, instrumenting a
program in a careful way can lead to a minimal impact and a very high degree
of accuracy in the results. For this reason, the profiler we implemented relies on
instrumentation rather than on sampling. The impact of instrumentation on a
particular program depends on many factors: the placement of instrumentation
points, the mechanism used to capture information, and the inner structure of
the program to be analyzed. Several instrumentation methodologies are available
nowadays: manual (i.e. performed by the programmer, by adding instructions
to extract profiling information), compiler assisted, binary translation, runtime
instrumentation, and runtime injection.
An event-based profiler is typical of languages such as Java, .NET, Python,
and Ruby. In this case the runtime provides hooks to profilers, for trapping events
like enter, leaves, object creation, exception etc. These profilers are usually very
flexible: for instance, it is possible to vary both the profiling event and the
sampling frequency.
1.1.4 Data type
Profilers can also be classified according to the type of information they collect;
then a further distinction can be made according to the different representations
of the gathered information. There are two main categories: profilers that output
counters for instructions/routines and profilers that produce estimations on the
fraction of execution time spent in each routine. Counters are usually presented
in a simple table, while for time measures there could be more than one value
associated with each routine (for instance, time spent in user mode and in kernel
3
CHAPTER 1. PERFORMANCE PROFILING METHODOLOGIES
mode).
It is important to stress the fact that counters are not necessarily repre-
sentative of or proportional to the execution time of the functions. Moreover,
the completion time may vary among different invocations of the same function.
When the goal is to estimate accurately the distribution of execution time, a
counter-based profiler is not the right choice. However, counters are very useful
in a variety of scenarios and can be used in several contexts. For instance, the ex-
act number of invocations of a function can be used to verify the correctness of an
algorithm, or to identify algorithms whose complexity is not adequate to the tasks
they have to complete. Moreover, an accurate analysis of counters can provide
a rough estimation of the execution time for parts of code and help to improve
the performance and refactor already implemented algorithms. Finally, counters
can also be used to identify subtle programming errors and which portions of the
code are executed (for instance, to check whether a new implementation of an
abstraction has completely replaced the old one).
Profilers that estimate time measures operate by approximation: measuring
the exact execution time of each instruction or routine would introduce a too
large slowdown on global execution speed. For this reason, these profilers are
based on sampling techniques as described in the previous paragraph.
1.2 Data aggregation and representation
In this thesis we focus on counter-based profiling, although our approach can be
easily extendend to other metrics such as execution time or cache misses. Several
data structures have been proposed during the years to mantain information
about interprocedural control flow1.
The design of a data structure to store profiling information is strictly related
to the aggregation level of such information. A vertex profiler stores the number of
invocations for each function and does not distinguish the caller nor the context.
An edge profiler instead stores the number of invocations for each ordered pair of
caller-callee routines. This terminology derives from the graphical representation
1An intraprocedural data flow analysis operates on a control-flow graph for a single method;
an interprocedural analysis operates across function boundaries.
4
CHAPTER 1. PERFORMANCE PROFILING METHODOLOGIES
of the control flow of a computer software: every function is represented as a
vertex; two vertices A and B are linked by a directed edge A → B if function
A invokes function B; a path is a sequence of vertices that can be traversed by
starting from a node and following existing edges.
Ball and Larus [4] introduced an intraprocedural path profiling technique to
efficiently determine how many times each acyclic path in a routine executes,
extending the more common edge profiling. Melski and Reps [21] have proposed
interprocedural path profiling to capture both inter- and intra-procedural contro
flow; however, their approach suffers from scalability issues since the number of
paths existing across procedure boundaries can be very large.
1.2.1 Calling contexts
A context-sensitive analysis is an interprocedural analysis that takes into account
the calling context when analyzing the target of a function call. A calling context
is a sequence of routine calls that are concurrently active on the run-time stack
and that lead to a program location. The utility of calling context information
was already clear in the 80s, and nowadays collecting context information in mod-
ern object-oriented information is very challenging: since application are often
structured in a large number of small routines, the high frequency of function
calls and returns might result in considerable profiling overhead, huge amounts
of data to be analyzed by the programmer, and heisenbugs. For this reason,
edge profilers still today are the most used profilers. On the other hand, both
context-sensitive and path profilers provides more valuable information than edge
profilers, while at the same time it is still possible to reconstruct an edge profile
from this information [5].
We will now define three well-known data structures for counter-based profilers.
1.2.2 Call graph
A call graph (CG) is a succinct representation of function invocations: every
function is represented as a vertex and every directed arc shows a caller-callee
relationship between two functions. The use of call graph profiles was pionereed
5
CHAPTER 1. PERFORMANCE PROFILING METHODOLOGIES
in the 80s by gprof [14]. Although this representation is very space- and time-
efficient, this approach can lead to misleading results [26, 30] and partial com-
prehension of the behaviour of the program. For instance, in Figure 1.1 the call
graph is missing the information that routine D is invoked by C only when C is
invoked by B (calling context A→B→C→D occurs two times, while A→C→D
does not appear in the execution). Call graphs are used by edge profilers and are
not suited for the representation of context sensitive information.
AB
CD
2
2
2
1
1
A
B BC C
D C
D D
A
B C
D C
D
2 2
1 1
2
Figure 1.1: Call graph, Call tree, Calling Context Tree (from left to right).
1.2.3 Call tree
In a call tree (CT) each node represents a different routine invocation: a new node
is added for each method invocation and then attached to the parent node (i.e.,
the caller method). This yields very accurate context information, but requires
extremely large (possibly unbounded) space.
1.2.4 Calling context tree
The inaccuracy of call graphs and the huge size of call trees have motivated the
introduction [1] of calling context trees (CCT). The CCT is a succinct summary
of the call tree, built recursively from the root by merging for each node children
representing different invocations of the same methods: in other words, CCTs do
not distinguish between different invocations of the same routine within the same
context. The edge weight of a CCT represents the number of nodes merged into
6
CHAPTER 1. PERFORMANCE PROFILING METHODOLOGIES
the target node, and also the calling frequency of the callee by the caller (i.e., the
parent node) within the particular calling context of the caller. This representa-
tion takes advantage of the tree structure by encoding every calling context as a
simple node: the context can be easily reconstructed by moving along the path
from the root to the target node. While preserving a good accuracy, typically
CCTs are several orders of magnitude smaller than call trees: the presence of re-
cursion may heavily increase the number of nodes, but this scenario is infrequent
in practice.
However, as noticed in previous works [7, 32], even CCTs may be very large
and difficult to analyze in several applications. The exhaustive approach to the
construction of a CCT is based on the instrumentation of each routine call and
return, incurring considerable slowdown even when using efficient instrumenta-
tion mechanisms [1, 32]: these two issues will be addressed and analyzed in more
detail in Chapter 2.
1.3 Program instrumentation tools
Here follows a survey of the most used instrumentation tools and frameworks.
1.3.1 gprof
This profiler has been introduced in 1982 [14] by a group of researchers of the
Berkeley University as an extension of older prof Unix tool. It uses hybrid of
instrumentation and sampling. Instrumentation is done by compiler (i.e., the -pg
option of gcc compiler) and is used to count how many times a single procedure
of the program is invoked; moreover, also the edge of the call graph related
to each invocation is stored. The time spent in each routine is estimated by
statistical sampling: the program counter of the monitored software is probed at
regular intervals using operating system interrupts (e.g., programmed via profil
syscall). Typical sampling period is 0.01 seconds (10 milliseconds). In 2004 gprof
paper appeared on the list of 50 most influential PLDI papers of all time: it
revolutionized the performance analysis field and became a milestone and the
tool of choice for many developers.
7
CHAPTER 1. PERFORMANCE PROFILING METHODOLOGIES
1.3.2 Dyninst
Dyninst [9] is a multi-platform API for runtime code-patching developed as part
of the Paradyn project at the University of Wisconsin-Madison and University
of Maryland. The goal of this API is to provide a machine independent interface
to permit the creation of tools and applications that use runtime code patch-
ing. Several APIs are available, such as the InstructionAPI for decoding and
representing machine instructions in a platform-independent manner, and the
StackwalkerAPI for walking the run-time stack of a process.
1.3.3 DynamoRIO
DynamoRIO [8] is a runtime code manipulation system: it supports code trans-
formations while the target program executes. Unlike many frameworks, it allows
arbitrary modifications to application instructions (a powerful instruction manip-
ulation library for IA-32 and AMD64 is provided). DynamoRIO’s API abstracts
away the details of the underlying infrastructure and allows the tool builder to
implement efficient dynamic tools for a wide variety of uses, such as program
analysis, profiling, and optimization.
1.3.4 Valgrind
This tool [24] was originally designed to be a free memory debugging tool for
Linux, but has since evolved to become a generic framework for creating dy-
namic analysis tools for memory debugging, memory leak detection, and profil-
ing. Valgrind is essentially a virtual machine using just-in-time (JIT) compilation
techniques: nothing from the original program ever gets run directly on the host
processor. After an initial translation into an intermediate processor-neutral for-
mat, a Valgrind-based tool is free to make any transformation on the code before
Valgrind translates it back into machine code and executes it. Although a con-
siderable amount of performance iIn particulars lost in these transformations, the
intermediate format is much more suitable for instrumentation than the original.
Several tools are included with Valgrind, such as Memcheck for memory usage
inspection and callgrind for the construction of the call graph.
8
CHAPTER 1. PERFORMANCE PROFILING METHODOLOGIES
1.3.5 Pin
Pin [18] is an instrumentation system developed by Intel and the University of
Colorado. The goal of this project is to provide easy-to-use, portable, transparent,
and efficient instrumentation. Instrumentation tools (called pintools) are written
in C/C++ using the Pin API, that is designed to be as much architecture inde-
pendent as possible. Pin uses just-in-time compilation to instrument executables
while they are running; several techniques are used to optimize instrumentation,
such as inlining, register re-allocation, and instruction scheduling. According to
authors, the fully automated approach delivers significantly better instrumenta-
tion performance than similar tools such as DynamoRIO and Valgrind. Another
interesting feature of Pin is the possibility to attach to a process, instrument it,
collect profiles, and eventually detach. Pin preserves the original behaviour of
the analyzed application by providing instrumentation transparency: the appli-
cation observes the same addresses and values as it would in a native execution:
this makes the information collected by instrumentation more accurate and rel-
evant. At the highest level of its architecture, Pin consists of a virtual machine,
a code cache, and an instrumentation API; the virtual machine consists of a JIT
compiler, an emulator (for instructions that cannot be executed directly, such as
system calls), and a dispatcher. When an istrumented program is running, three
binary programs are present in memory: the application, Pin, and the pintool.
1.3.6 gcc
The GNU C Compiler [27] includes an experimental and somehow obscure fea-
ture: the -finstrument-functions option. This feature was originally implemented
by Cygnus Solutions and makes the compiler execute two special user-defined
functions at the top and bottom of every function - the only information pro-
vided to the user are two parameters containing the address of the routine and
the calling site. These two functions can be used to perform a compiler assisted
instrumentation of any C/C++ application on Linux, achieving a good degree of
performance while maintaining full control of program execution. Profiling tools
can be implemented as dynamic libraries to be linked to programs compiled with
the above-mentioned option. Clearly, the developer of a tool has to deal manually
9
CHAPTER 1. PERFORMANCE PROFILING METHODOLOGIES
with aspects such as program launch and termination, multithreading, etc. - the
only functionality provided by gcc is intercepting routine enter and exit events.
Although this at first glance can seem discouraging, in this thesis we learned how
it is possible to write an efficient gcc-based profiler using a limited number of
lines of code to deal with these aspects.
1.4 Conclusions
In this chapter we have provided an overview on general performance profiling
techniques, showing the importance of the use of dynamic program analysis tools
in order to pinpoint and optimize the hot spots that dominate the execution
time. Calling Context Trees have been presented as main data structure for
compactly representing context-sensitive profiling information gathered during
the execution of a program. Several instrumentations frameworks are nowadays
available for developers: in particular, Pin can be used to quickly implement
analysis tools on several platforms, while the GNU C Compiler offers a compiler-
assisted instrumentation feature to develop highly-tuned tools with low overhead
introduced.
10
Chapter 2
Mining hot calling contexts
In this chapter we will firstly discuss some state-of-the-art techniques for context
sensitive profiling. Then we will present the frequent items problem in the data
streaming computational model and two well-known space-efficient algorithms to
solve it. Finally, we will introduce our novel space-efficient approach, discussing
the main choices behind its design and its correlation with other techniques.
2.1 Previous approaches
In the previous chapter we highlighted two relevant issues for context sensitive
profiling: the exhaustive approach to the construction of the CCT introduces con-
siderable slowdown even using efficient instrumentation mechanisms, and CCTs
may be very large and difficult to analyze in several applications. In this section
we will discuss some of the most relevant approaches that have been proposed in
literature in the past years.
Early approaches. The utility of calling context information was already clear
to the authors of gprof: this tool approximates context sensitive profiles by as-
sociating procedure timing with caller-callee pairs rather than with single pro-
cedures. This level of context sensitivity, however, may yield sever inaccura-
cies [26, 30]. Since exhaustive instrumentation can lead to a large slowdown,
Bernat and Miller [6] proposed to generate path profiles including only methods
of interest in order to reduce overhead. A very popular approach is based on sam-
11
CHAPTER 2. MINING HOT CALLING CONTEXTS
pling [3, 13, 16, 31]: the run-time stack is periodically sampled and the context is
reconstructed through stack-walking. For call-intensive programs, sample-driver
stack-walking can be significantly (i.e., at least one order of magnitude) faster
than exhaustive instrumentation. This approach, however, can lead to a signif-
icant loss of accuracy with respect to the exhaustive construction of the CCT:
results may be highly inconsistent in different executions, and sampling guaran-
tees neither high coverage [7] nor accuracy of performance metrics [32].
Adaptive bursting. Several works explore the combination of sampling with
bursting [2, 17, 32]. Most recently, Zhuang et al. suggest to perform stack-
walking followed by a burst during which the profiler traces every routine enter
and exit event: experimental results show that adaptive bursting can yield much
more accurate results with respect to traditional sample-based stack-walking1.
While static bursting collects a highly accurate profile, it can still cause signif-
icant performance degradation: the authors propose an adaptive mechanism to
dynamically disable profile collection for previously observed calling contexts and
periodically re-enable it according to a re-enablement ratio parameter which re-
flects the trade-off between accuracy and overhead. Their profiler is based on the
Java Virtual Machine Profiler Interface (JVMPI), while experiments have been
performed with a sampling rate parameter of 10ms and a burst length of 0.2 ms.
Reducing space. A few previous works have addressed techniques to reduce
profile data (or at least the amount of data presented to the user) in context
sensitive profiling. Bernat and Miller [6] suggest to let the user choose a subset
of routines to be analyzed. Quite recently, Bond and McKinley [7] introduced
a new approach called probabilistic calling context that continuously maintains
a probabilistically unique value representing the current calling context. The
proposed representation is extremely compact (just a 32-bit value per context)
and, according to the authors, the average overhead introduced in the Java virtual
machine is around 3%. This approach is efficient and accurate enough for tasks
such as residual testing, bug detection, and intrusion detection. However, it
1According to the authors, a low-overhead solution using sampled stack-walking alone is less
than 50% accurate, based on degree of overlap with a complete CCT.
12
CHAPTER 2. MINING HOT CALLING CONTEXTS
cannot be used for profiling with the purpose of understanding and improving
application performance, where it is crucial to maintain for each context the
sequence of active routine calls along with performance metrics. Moreover, Bond
and McKinley target applications where coverage of both hot and cold contexts
is necessary; this is not the case in performance analysis, where identifying a few
hot contexts is typically sufficient to guide code optimization.
2.2 Frequent items in data streams
In the last years a lot of effort has been put on the design of algorithms able
to perform analyses on a near-real time basis on massive data streams; in these
streams input data often come at a high rate and cannot be stored due to their
possibly unbounded size [23]. This line of research has been mainly motivated
by networking and database applications, in which streams may be very long
and stream items may also be drawn from a very large universe. Streaming
algorithms are designed to address these requirements, providing approximate
answers in contexts where it is impossible to obtain an exact solution using only
limited space.
These techniques can be adapted in order to analyze profiling information
gathered during a program’s execution. In fact, the stream of routine enter and
exit events collected by a profiler represents a good case of massive data stream:
items belong to a possibly large universe (consider an application composed of
many small routines frequently invoked from different call sites) and the length of
the stream itself is very high. Moreover, in order to keep the introduced slowdown
under control, timing requirements to process each item are strict.
The Frequent Items (a.k.a. heavy hitters) problem [22] has been extensively
studied in data streaming computational models. Given a frequency threshold φ
in [0,1] and a data stream of length N , the problem (in its simplest formulation)
is to find all items that appear more than bφNc times, i.e., having frequency
greater than bφNc. For instance, for φ=0.01 the problem is to find all items that
appear at least 1% of the times. It can be proved that any algorithm able to
provide an exact solution requires Ω(N) space [23]. Therefore, research focused
on solving an approximate version of the problem [20, 22]:
13
CHAPTER 2. MINING HOT CALLING CONTEXTS
Definition 1 ε-Deficient Frequent Items problem. Given two parameters φ, ε ∈[0, 1], with ε < φ, return all items with frequency ≥ bφNc and no item with
frequency ≤ b(φ − ε)Nc. The absolute difference between estimated and true
frequency is at most εN .
In the approximate solution, false negatives cannot exist: all frequent items
are correctly output. Instead, some false positives are allowed: however, their real
frequency is guaranteed to be at most εN -far from the threshold bφNc. Many
different algorithms for computing (φ, ε)-heavy hitters have been proposed in the
literature in the last ten years. Counter-based algorithms, according to extensive
experimental studies [19], have better performance and ease of implementation
than sketch-based algorithms. Counter-based algorithms track a subset of items
from the input and monitor counts associated with them; sketch-based algorithms
act as general data stream summaries, and can be used for other types of approx-
imate statistical analysis of the data stream, apart from being used to find the
frequent items.
In a preliminary analysis we considered three counter-based algorithms: Space
Saving, Sticky Sampling, and Lossy Counting. Space Saving [22] and Lossy Count-
ing [20] are deterministic and use 1ε
and 1ε
log(εN) entries, respectively. Sticky
Sampling [20] is probabilistic: it fails to produce the correct answer with a minus-
cole probability, say δ, and uses at most 2ε
log(φ−1δ−1) entries in its data structure;
in our experiments, Sticky Sampling consistently proved itself to be less efficient
and accurate than its competitors, so we will not mention it any further.
2.2.1 Space Saving
Space Saving [22] monitors a set of d1/εe = M triples of the form (item, counti,
εi) initialized respectively by the first 1/ε distinct items, their exact counts, and
zero. After the init phase, when a query q is observed in the stream the update
operation works as follows:
• if q is monitored, the corresponding counter is incremented;
• if q is not monitored, the (item, counti, εi) triple with the smallest count
is chosen as a victim and has its item replaced with q and its count in-
14
CHAPTER 2. MINING HOT CALLING CONTEXTS
cremented. Moreover, εi is set to the count value of the victim triple - it
measures the maximum possible estimation error.
The update time is bounded by the dictionary operation of checking whether an
item is monitored. Heavy hitters queries are answered by returning monitored
entries such that count > dφNe. Space Saving has the following properties [22]:
1. The minimum counter value min is no greater than bεNc;
2. For any monitored element ei, 0 ≤ εi ≤ min, that is, fi ≤ (fi + εi) =
counti ≤ fi +min. In other words, count is overestimated by at most min
with respect to the true frequency fi;
3. Any element whose true frequency is greater than min is monitored;
4. The algorithm uses a number of counters of min(|A|, d1/εe), where |A| is
the cardinality of the universe from which the items are drawn. This can
be proved to be optimal;
5. Space Saving solves the ε-Deficient Frequent Items problem, by reporting
all the elements with frequencies more than dφNe, such that no reported
element has true frequency less than d(φ− ε)Ne;
6. Items whose true frequencies are more than d(φ− ε)Ne but less than dφNecould be output. They are the so-called false positives.
2.2.2 Lossy Counting
Lossy Counting [20] maintains a set M of (item, count, ∆) triples, where count
represents the exact frequency of the item since it was last inserted in M and ∆ is
the maximum possible underestimation in count: the algorithm guarantees that
at any time the true frequency of a monitored item is ≤ count + ∆. While Space
Saving overestimates counters, Lossy Counting takes an opposite approach.
The incoming data stream is conceptually divided into bursts of width d1/εe.During the i-th burst, if the observed item already exists in M, the corresponding
count is incremented, otherwise a new entry is inserted with count = 1 and
∆ = i − 1. At the end of the burst, M is pruned by deleting triples such that
15
CHAPTER 2. MINING HOT CALLING CONTEXTS
(count + ∆) ≤ i. Heavy hitters queries are answered by returning entries in M
such that count ≥ b(φ− ε)Nc. Lossy Counting has the following properties [20]:
1. Lossy Counting solves the ε-Deficient Frequent Items problem, by reporting
all the elements with frequencies more than φN , such that no reported
element has true frequency less than (φ− ε)N ;
2. Estimated frequencies are less than the true frequencies by at most εN , i.e,
count ≤ fe ≤ count+ εN ;
3. If an element e does not appear in M, then fe ≤ εN ;
4. Lossy Counting computes an ε-deficient synopsis using at most 1ε
log(εN)
entries. However, if we make the assumption that in real world datasets,
elements with very low frequency (at most εN2
) tend to occur more or less
uniformly at random: in this scenario Lossy Counting requires no more
than space 7ε
in the worst case.
2.3 HCCT: Hot Calling Context Tree
As suggested in Section 2.2, the execution trace of routine invocations and ter-
minations can be naturally regarded as a stream of item. Each item is a triple
of the form (routine address, call site, event type). Table 2.1 reports essential
informations about some well-known Linux applications. It is easy to notice that
the number of nodes of the call graph (i.e., the number of distinct routines) is
small compared to the stream length (i.e., to the number of nodes of the call tree),
even for large-scale applications such as gimp, inkscape, and those belonging to
the OpenOffice suite. Hence, a non-contextual profiler can mantain a hash table
of size proportional to the number of distinct routines, using routine names as
hash keys to efficiently update the corresponding metrics.
In the case of contextual profiling, the number of distinct calling contexts
(i.e., the number of nodes of the CCT) is too large: hashing would be inefficient,
and for some applications it would even be impossible to maintain the tree in
main memory. Consider, for instance, oocalc: the optimistic assumption that
16
CHAPTER 2. MINING HOT CALLING CONTEXTS
Application |Call graph| Call sites |CCT| |Call tree|amarok 13 754 113 362 13 794 470 991 112 563
ark 9 933 76 547 8 171 612 216 881 324
audacity 6 895 79 656 13 131 115 924 534 168
bluefish 5 211 64 239 7 274 132 248 162 281
dolphin 10 744 84 152 11 667 974 390 134 028
firefox 6 756 145 883 30 294 063 625 133 218
gedit 5 063 57 774 4 183 946 407 906 721
ghex2 3 816 39 714 1 868 555 80 988 952
gimp 5 146 93 372 26 107 261 805 947 134
gnome-sudoku 5 340 49 885 2 794 177 325 944 813
gwenview 11 436 86 609 9 987 922 494 753 038
inkscape 6 454 89 590 13 896 175 675 915 815
oocalc 30 807 394 913 48 310 585 551 472 065
ooimpress 16 980 256 848 43 068 214 730 115 446
oowriter 17 012 253 713 41 395 182 563 763 684
pidgin 7 195 80 028 10 743 073 404 787 763
quanta 13 263 113 850 27 426 654 602 409 403
vlc 5 692 47 481 3 295 907 125 436 877
Table 2.1: Number of nodes of call graph, call tree, calling context tree, and
number of distinct call sites for different applications.
each CCT node requires 20 bytes2 already results in almost 1 GB just to store
a CCT with 48 million nodes for a few minutes usage session of the spreadsheet
component of the OpenOffice suite. As we will report in Chapter 4, the CCTs
associated with some benchmarks of the SPEC CPU2006 suite cannot be kept
in main memory since the 3 GB user space memory limit for data is reached in
a few minutes. Motivated by the fact that execution traces are typically very
long (i.e., hundreds of millions of routine enter and exit events) and their items
(i.e., calling contexts) are taken from a large universe, we cast the problem of
identifying the most frequent contexts into a data streaming setting.
The choice to focus on most frequent contexts is motivated by several reasons.
2Previous works use larger nodes [1, 30].
17
CHAPTER 2. MINING HOT CALLING CONTEXTS
Scalability is a key issue: even the approach based on an extremely compact (32-
bit per context) representation proposed by Bond and McKinley [7] can run
out of memory on large traces with tens of millions of contexts. Probabilistic
calling context was not designed for profiling with the purpose of understanding
and improving application performance, where it is crucial to maintain for each
context the sequence of concurrently active routine calls along with performance
metrics. In this scenario, only the most frequent contexts are of interest, since
they represent the hot spots to which optimizations must be directed according
to the previously mentioned 90-10 rule [15]. In fact, the cumulative distribution
of calling context frequencies shown in Figure 2.1 is extremely skewed: about
90% of the total number of calls takes place in the top 10% of calling contexts
ordered by decreasing frequencies (i.e., from the hottest to the coldest). This
skewness suggests that space could be greatly reduced by keeping information
about hot contexts only and discarding on the fly contexts that are likely to be
cold. As observed in [32]: “Accurately collecting information about hot edges
may be more useful than accurately constructing an entire CCT that includes
rarely called paths.”
2.3.1 Approach
In this thesis we define a novel run-time data structure, called Hot Calling Context
Tree (HCCT), that compactly represents all the hot calling contexts encountered
during a program’s execution. The HCCT is a subtree of the CCT that includes
only hot nodes and their ancestors3 and offers an additional intermediate point in
the spectrum of data structures for representing interprocedural control flow (see
Section 1.2). For each hot calling context, the HCCT also maintains an estimate
of its performance metrics (in this thesis we focus on frequency counts, although
our approach can be easily extended to arbitrary metrics such as execution time
and cache misses).
Let N be the number of calling contexts encountered during a program’s
execution. Clearly, N equals the number of routine invocations in the execution
trace, the number of nodes of the call tree, as well as the sum of the frequency
3Ancestors are needed to reconstruct the calling context starting from the root node.
18
CHAPTER 2. MINING HOT CALLING CONTEXTS
0
20
40
60
80
100
0 10 20 30 40 50 60 70 80 90 100
% o
f th
e tota
l num
ber
of calls
(degre
e o
f overlap w
ith full
CC
T)
% of hottest calling contexts
Cumulative frequencies
amarok audacity
firefoxgedit
oocalcpidgin
quantavlc
Figure 2.1: Skewness of calling contexts distribution.
counts of CCT nodes. Given a frequency threshold parameter φ ∈ [0, 1], we will
classify a calling context as hot if the frequency count of the corresponding CCT
node is greater than bφNc, otherwise the context considered cold. We define
the Hot Calling Context Tree as the unique subtree of the CCT obtained by
pruning all cold nodes that are not ancestors of a hot node. It is worth noticing
that all the leaves in the HCCT are necessarily hot, while root and intermediate
nodes can correspond to either hot or cold contexts. In graph theory, the HCCT
corresponds to the Steiner tree of the CCT with hot nodes used as terminals.
According to the space lower bound for the heavy hitters problem [23], the
HCCT cannot be calculated exactly in a space smaller than that required to store
the entire CCT. Hence, we relax the problem and compute an approximatation
of the HCCT, which we denote by (φ, ε)-HCCT. The parameters ε and φ control
the degree of approximation and have the same meaning as in the ε-Deficient
Frequent Items problem discussed in Section 2.2. The (φ, ε)-HCCT contains all
19
CHAPTER 2. MINING HOT CALLING CONTEXTS
100
100
100 10050
50
40
10
10 1010
1
falsepositive
falsepositive
(a) (b) (c)
Figure 2.2: (a) CCT; (b) HCCT; and (c) (φ, ε)-HCCT. Hot nodes are darker.
The number close to each node is the frequency count of the corresponding calling
context. In this example N = 581, φ = 1/10, and ε = 1/30: the approximate
HCCT includes all contexts with frequency ≥ bφNc = 58 and no context with
frequency ≤ b(φ− ε)Nc = 38.
hot nodes (true positives), but may possibly contain some cold nodes without
hot descendants (false positives). The true frequency of these false positives,
however, is guaranteed to be at most εN far from bφNc: hence, it is still close to
the hotness threshold. Differently from the HCCT, a (φ, ε)-HCCT is not uniquely
defined as a subtree of the CCT, since the set of spanned (φ, ε)-heavy hitters is
not unique: depending on the behaviour of the adopted algorithm, nodes with
frequencies with frequencies smaller than bφNc and larger than b(φ− ε)Nc may
be either included in such a set or not. Figure 2.2 provides a graphical comparison
of the three mentioned tree structures.
With the help of the Venn diagram in Figure 2.3, some important relations
can be identified:
• H⊆A, where H is the set of hot contexts and A is a set of (φ, ε)-heavy
hitters. Nodes in A \H are false positives.
• H⊆HCCT. Nodes in HCCT \H are cold nodes that have a hot descendant
in H, hence must be kept in the HCCT.
20
CHAPTER 2. MINING HOT CALLING CONTEXTS
Figure 2.3: Tree data structures and calling contexts classification.
• A⊆ (φ, ε)-HCCT. Nodes in (φ, ε)-HCCT \A are cold nodes that have a
descendant in A, hence must be kept in the (φ, ε)-HCCT.
• HCCT⊆ (φ, ε)-HCCT, as implied by the previous inclusions. Both of them
are connected subtrees of the full CCT.
In Figure 2.3 two additional sets are shown: the set M of monitored nodes
and the subtree MCCT spanning all nodes in M. The counter-based streaming
algorithms used to compute the set A of (φ, ε)-heavy hitters monitor a slightly
larger set M⊇A. When a most frequent elements query is issued, these algo-
rithms extract A from M and return it. To keep track of the context-sensitive
information, we mantain the subtree MCCT of the CCT consisting of the nodes
in M and of all their ancestors (hence, MCCT⊇ (φ, ε)-HCCT). At query time,
the MCCT is appropriately pruned and the (φ, ε)-HCCT is returned.
2.3.2 Update and query
The set M of monitored contexts must be updated at each function call; more-
over, when M is changed, the MCCT needs to be brought up to date as well.
To describe how this happens, we introduce two abstract main functions to be
provided by the interface of the streaming algorithm:
update(x,M)→ v: given a calling context x, update M to reflect the new occur-
rence of x in the stream. If x was already monitored in M, its frequency
21
CHAPTER 2. MINING HOT CALLING CONTEXTS
count will be increased by one; otherwise, x is added to M and a victim
context v may be evicted from M and returned.
query(M)→ A: remove low-frequency items from M and return the subset A of
(φ, ε)-heavy hitters (see Section 2.3.1 and Figure 2.3).
Details on the implementation of update and query depend on the specific
streaming algorithm. It is worth mentioning that for Lossy Counting the update
operation does not return a victim since the pruning is performed on a periodical
basis, while for Space Saving a victim is evicted from M every time a new item
is added to M after the initial startup phase4.
During the construction of the MCCT a pointer to the current calling context
should be maintained, creating a new node if the current context x is not already
in the tree. The MCCT is pruned on a periodical basis in the case of Lossy
Counting, or according to the victim context returned by the update operation
in the case of Space Saving; for both algorithms, the pseudocode of the pruning
algorithm is the same and is reported in Listing 2.1.
It is important to notice that in order to keep the tree connected, nodes can
be removed from the MCCT only if they are leaves. An internal node with a
hot descendant can be removed instead only from the set of monitored elements.
Note that removing a victim might expose a path of unmonitored ancestors that
no longer have descendants in M: these nodes are pruned as well in a bottom-
up recursive manner. The current context x is never removed from the MCCT:
this guarantees that no node in the path from the root of the tree to x will be
removed, since at least one of these nodes will be needed to represent the next
calling context to be observed in the execution.
1 v = update (x ,M)
2 while ( v != x && v i s a l e a f in MCCT && v i s not in M) do
3 remove v from MCCT
4 v = parent ( v )
Listing 2.1: prune(X,MCCT)
4Startup phase ends when the number of monitored elements reaches d 1εe.
22
CHAPTER 2. MINING HOT CALLING CONTEXTS
The (φ, ε)-HCCT can be computed from the MCCT in an analogous way.
The query operation invoked on M returns the support A of the (φ, ε)-HCCT:
pruning is performed on all MCCT nodes that have no descendant in A, following
bottom up path traversals as in Listing 2.1.
2.3.3 Discussion
The space required to store the heavy hitters in the data structure M depends on
the specific streaming algorithm in use, and in both cases is roughly proportional
to 1/ε (see Section 2.2 for the exact bounds). An appropriate choice of ε appears
to be crucial for the effectiveness of our approach: smaller values of ε guarantee
more accurate results (more precise counters and possibly less false positives), but
imply a larger memory footprint. The high skewness of context frequency dis-
tribution guarantees the existence of very convenient tradeoffs between accuracy
and memory usage, as we will see in Chapter 4.
The space required by the MCCT dominates the space required by M, since
some of the ancestors of contexts monitored in M may be cold contexts without a
corresponding entry in M. The number of cold ancestors depends on the structure
of the CCT corresponding to the particular execution of a computer program
and on the properties of the stream of events associated with the execution trace:
for this reason, it cannot be analyzed theoretically. Experimental results (see
Chapter 4) show that this amount is negligible with respect to |M|.Performing updates of the MCCT in a very short time is crucial for the
effectiveness of our approach. As discussed in Section 2.3, hashing is inefficient
when the number of distinct contexts is large. Our implementation requires
constant time for the streaming update operation and hinges upon very simple
and cache-efficient data structures. We will discuss these aspects in Chapter 3.
Our approach differs from previously proposed solutions such as, e.g., adaptive
bursting [32], in the capability to automatically adapt to the case where the hot
calling contexts vary over time. This peculiarity is achieved by exploiting the
characteristics of data streaming algorithms for frequent items: these algorithms
are able to remove from the set of monitored elements those contexts that are
losing their popularity over time, while being gradually replaced by contexts
23
CHAPTER 2. MINING HOT CALLING CONTEXTS
growing more popular. Hence, heavy hitters queries can be issued at any point
in time - online querying - and the algorithms will always be able to return the
right set of hot contexts up to that time.
Our solution is orthogonal to previous techniques such as sample-based stack
walking and adaptive bursting (see Section 2.1), which focused on reducing the
overhead introduced by instrumentation without explicitly taking into account
intensive memory usage and its implications (e.g., performance degradation in-
duced by poor access locality). Moreover, as we will see in Chapter 4, our ap-
proach enables the profiling of large-scale applications which may be difficult or
sometimes impossible5 to analyze with traditional context-sensitive profiling tech-
niques. Finally, it can be extendend to support arbitrary performance metrics,
exploiting the ability of some streaming algorithms to compute heavy hitters in
a set of items of different weights.
2.4 Conclusions
In this chapter we have seen that state-of-the-art techniques for context-sensitive
profiling do not take esplicitly into account space-efficiency issues, while Calling
Context Trees built for recent applications may be very large and in several cases
they cannot be kept in main memory. Our approach takes advantage of efficient
algorithms from the data streaming computational model in order to focus the
analysis on relevant contexts only, building a subtree of the Calling Context Tree
called HCCT (Hot Calling Context Tree) while discarding on-the-fly contexts that
are likely to be cold. Since our solution is orthogonal to techniques that have
been proposed for reducing instrumentation overhead, it can be easily integrated
with them in order to keep the introduced slowdown under control.
5When the space needed for the exhaustive construction of the CCT exceeds the 3 GB user
space memory limit of the virtual address space on 32-bit Linux machines.
24
Chapter 3
Implementation
In this chapter we will discuss the implementation of the HCCT construction
algorithm. In particular, we will present two variants - based on Lossy Counting
and Space Saving [22], respectively - that are cast in a common framework in
which also different streaming algorithms can be plugged in.
We use a first-child, next-sibling representation for trees since it is very space-
efficient and still guarantees that the children of each node can be explored in
time proportional to their number. Each MCCT node also contains a pointer
to its parent (to make the prune operation more efficient - see Listing 2.1), the
routine ID, the call site, and the performance metrics. The parent field is not
required in CCT nodes since no pruning is needed. As routine ID, we use the
routine address. On 32-bit architectures, CCT and MCCT nodes require 20
and 24 bytes, respectively. Moreover, since call sites can be represented with
the 16 LSBs of the address, 2 bytes can be used to store additional information.
Actually, we use this space - without increasing the number of bytes per node - to
encode a Boolean flag that tells if the calling context associated with the node is
currently monitored by the streaming algorithm. According to our experiments,
the average number of scanned children for a node is a small constant around
2-3: hence the first-child, next-sibling representation turns out to be convenient
also for checking whether a (routine ID, call site) pair already appears among
the children of a node. To improve time and space efficiency, we allocate nodes
through a custom, page-based allocator, which maintains blocks of fixed size.
Any additional algorithm-specific field is stored as trailing field.
25
CHAPTER 3. IMPLEMENTATION
3.1 Space Saving (BSS and LSS)
The authors of Space Saving propose [22] a data structure called Stream-Summary
that uses an ordered doubly linked lists of buckets and a collection of linked lists
of counters. All elements with the same counter value are linked together in a
linked list; every node in the list points to a parent bucket in which the count
value is stored. Every bucket points to the first element of the linked list of
counters, and buckets are kept in a doubly linked list, sorted by their values.
Initially, M empty counters are created and attached to a single parent bucket
with value 0. When an element of the stream is processed, the SS lookup rou-
tine verifies whether an element is already monitored and returns a pointer to
the counter. If the element is not actually monitored, the SS update routine se-
lects a victim counter for replacement as described in Section 2.2.1. Finally, the
increment counter operation links the counter to the proper bucket. In partic-
ular, it first checks the neighbor of its bucket with the larger value. If its value is
equal to the new value of the element, the element is detached from its current
list and inserted as first child in the neighbors list. Otherwise, a new bucket with
the correct value is created and properly attached to the ordered bucket list; then
the element is attached to the new bucket. The old bucket is deleted if it now
points to an empty child list.
In addition to the implementation descrived above, we devised a more efficient
variant based on a lazy priority queue. We will refer to our implementation
as Lazy Space Saving (LSS) and to the one proposed by the authors of Space
Saving as Bucket Space Saving (BSS). In both implementations, the dictionary
operation needed by the update(x,M) procedure (see Section 2.3.2) is avoided
by maintaining a cursor pointer to the current calling context x and using it to
directly access the monitored flag of the MCCT node associated with x.
Our approach is based on a lazy priority queue that supports two operations,
find-min and increment, which returns the item with minimum count and in-
crements a counter, respectively. We use an unordered array M of size 1/ε, where
each array entry contains a pointer to a node of the MCCT. We also maintain
in a lazy fashion the value min of the minimum counter and the smallest index
min-idx of an array entry that points to a monitored node with counter equal
26
CHAPTER 3. IMPLEMENTATION
to min. It is important to notice that more than one monitored node may have
count equal to min.
The increment operation does not change M, since counters are stored di-
rectly inside MCCT nodes. However, min and min-idx may become temporarily
out of date after an increment: this is the reason why we call the approach lazy.
1 while ( M[ min−idx ] != min and min−idx <= M ) do
2 min−idx = min−idx + 1
3 i f (min−idx > M) then
4 min = minimum in M
5 min−idx = s m a l l e s t index j s . t . M[ j ]=min
6 return min
Listing 3.1: find-min()
The find-min operation described in Listing 3.1 restores the invariant prop-
erty on min and min-idx: it finds the next index in M with counter equal to min.
If such an index does not exist (i.e., min-idx > M), it completely rescans M in
order to find a new min value and its corresponding min-idx. In practice, several
strategies are possible: an elegant implementation that proved itself to be very
efficient is reported in Listing 3.2.
1 while ( queue[++min idx]−>counter > min ) ;
2 i f ( min idx < e p s i l o n ) return ;
3
4 unsigned long i ;
5 min = queue [ second min idx]−>counter ;
6 for ( min idx = 0 , i = 0 ; i < e p s i l o n ; ++i )
7 i f ( queue [ i ]−>counter < min ) 8 second min idx = min idx ;
9 min idx = i ;
10 min = queue [ i ]−>counter ;
11 12 queue [ e p s i l o n ]−>counter = min ;
Listing 3.2: Implementation of find-min() with sentinel and second min idx
27
CHAPTER 3. IMPLEMENTATION
Here follows an important theoretical result on the performance of LSS:
Lemma 1 After a find-min query, the lazy priority queue correctly returns the
minimum counter value in O(1) amortized time.
Proof. Counters are never decremented. Hence, at any time, if a monitored
item with counter equal to min exists, it must be found in a position larger than
or equal to min-idx. This yields correctness of the algorithm.
To analyze the running time, let ∆ be the value of min after k find-min and
increment operations. Since there are |M | counters ≥ ∆, counters are initialized
to 0, and each increment operation adds 1 to the value of a single counter, it
must be k ≥ |M |∆. For each distinct value assumed by min, the array is scanned
twice. We therefore have at most 2∆ array scans each of length |M |, and the
total cost of find-min operations throughout the whole sequence of operations is
upper bounded by 2|M |∆. It follows that the amortized cost is (2|M |∆)/k ≤ 2,
hence O(1) amortized time. 2
Our highly tuned implementation of LSS is more efficient in practice than the
one proposed by the authors in terms of both performance and memory usage,
and might be of independent interest. Note that both of our implementations do
not store the εi field for monitored elements. This field measures the maximum
possible estimation error and is useful to check whether there might be any false
positives among heavy hitters output by the algorithm [22]. However, since this
check does not provide exact answers, the εi field is not stored by default1 in order
to save space: in fact, 4 bytes per node would be needed to store the maximum
possible overestimation error when a node in inserted into M.
3.2 Lossy Counting (LC)
Our implementation of Lossy Counting [20] maintains an unordered linked list
of monitored contexts. Every node conceptually represents a (item, count, ∆)
triple, where count represents the exact frequency of the items since it was last
inserted and ∆ is the maximum possible underestimation in count.
1This behaviour can be changed by modifying a macro in the source code of the algorithm.
28
CHAPTER 3. IMPLEMENTATION
The pruning operation is performed on a periodical basis. When the length
of the incoming stream becomes a multiple of the width of a bucket (i.e., N mod
w == 0), the linked list of monitored items is scanned sequentially: an item
is removed from the list if count + ∆ ≤ bcurrent, where bcurrent is the bucket in
use. In other implementations, the ∆ field is not stored and during each pruning
operation count is decremented by one, removing the node when count reaches
zero. However, this approach loses information on the frequencies of monitored
items, hence only an unordered list of heavy hitters can be output.
1 #d e f i n e MakeParent (p , b i t ) ( ( long ) ( p) | ( b i t ) )
2 #d e f i n e GetParent ( x ) ( ( l c h c c t n o d e t ∗) ( ( x−>parent ) & (∼3 ) ) )
3 #d e f i n e UnsetMonitored ( x ) ( MakeParent ( GetParent ( x ) , 0 ) )
4 #d e f i n e IsMonitored ( x ) ( ( x)−>parent & 1)
5 typedef struct l c h c c t n o d e s l c h c c t n o d e t ;
6 struct l c h c c t n o d e s 7 unsigned long r o u t i n e i d ;
8 unsigned long counter ;
9 l c h c c t n o d e t ∗ f i r s t c h i l d ;
10 l c h c c t n o d e t ∗ n e x t s i b l i n g ;
11 unsigned short c a l l s i t e ;
12 unsigned long parent ; // b i t s t e a l i n g f i e l d
13 l c h c c t n o d e t ∗ l c n e x t ; // l i n k e d l i s t o f counters
14 unsigned long d e l t a ;
15
Listing 3.3: Tree node structure for Lossy Counting
We modified the original formulation of Lossy Counting in order to take advan-
tage of the tree structure and of the presence of cold ancestors. The bit stealing
technique is used to check whether a node in the tree is currently monitored by
the streaming algorithm: when a context is removed from the set of monitored
elements, the corresponding node is nevertheless kept in the tree when it is not a
leaf (i.e., it is an internal node with a hot descendant). For each non-monitored
cold ancestor we keep track of the values for count and ∆ and, when the node is
inserted into M, the old values are restored and count is simply incremented by
29
CHAPTER 3. IMPLEMENTATION
one. We also modified the pruning strategy for the query operation: the deletion
rule proposed by Manku and Motwani [20] removes an element from M if count
< (φ− ε)N, while our implementation takes advantage of maintaining deltas by
modifying the deletion condition into count + ∆ < φN. In our experiments the
introduced changes improved the performance of the algorithm by reducing the
percentage of false positives by a factor 2.
3.3 The pintool
The first version of our profiler is based on Intel Pin (see Section 1.3 and [18]).
We developed a common infrastructure called STool in which different tools can
be plugged in: in particular, we implemented a trace recorder and two online
profilers based on Space Saving and Lossy Counting, respectively. A recorded
timestamped execution trace is fundamental to ensure deterministic replay of
the execution of interactive applications: since the behaviour of the application
changes among different executions, comparing the results from different sessions
might lead to misleading conclusions.
A pintool is conceptually divided into two parts: the instrumentation code,
which specifies the type of the code and where to insert it, and the analysis code,
that is the code that has to be executed in the insertion points.
Pin’s just in time (JIT) instrumentation occurs immediately before a code
sequence is executed for the first time. This mode of operation is known as trace
instrumentation and lets the pintool inspect and instrument an executable one
trace at a time. Traces usually begin at the target of a taken branch and end
with an unconditional branch, hence including calls and returns. Pin guarantees
that a trace has an unique entrance, but it may contain multiple exits. A trace is
composed of a sequence of Basic Blocks (BBL). A BBL is a single entrace, single
exit sequence of instructions. Pin offers the developer the possibility to insert a
single analysis call for a BBL, instead of one analysis call for every instruction
(instruction instrumentation mode): obviously, reducing the number of analysis
calls makes the instrumentation more efficient. Basic Blocks are picked out at
execution time, so there might be slight differences with respect to the classical
definition of a BBL for a compiler. Pin offers also other instrumentation modes
30
CHAPTER 3. IMPLEMENTATION
such as routine instrumentation and image instrumentation. In this thesis we
adopt the trace instrumentation mode in order to construct the [H]CCT in a
reliable way, by tracking all the invocations and returns from methods that occur
during the execution of a program.
STool has been implemented in C++ in a modular structure and provides an
intermediate layer to write profiling tools on top of Pin’s trace instrumentation.
By using a callback mechanism the instrumentation code becomes transparent
for the developer and is separated from the analysis code. Listing 3.4 reports the
activation paradigm of the infrastructure: multi-threading2 and call site identifi-
cation are natively supported, as well as tracing of routine enter/exit events and
optionally also of memory read/write accesses.
typedef struct int argc ;
char∗∗ argv ;
s i z e t a c t i va t i o nRe cS i z e ;
s i z e t threadRecSize ;
BOOL c a l l i n g S i t e ;
// c a l l b a c k s f o r ana l y sy s
STool TFiniH appEnd ;
STool TThreadH threadStar t ;
STool TThreadH threadEnd ;
STool TRtnH rtnEnter ;
STool TRtnH rtnExi t ;
STool TMemH memRead ;
STool TMemH memWrite ;
STool TBufH memBuf ;
STool TSetup ;
Listing 3.4: Activation paradigm of STool
For the construction of the [H]CCT the precision in tracing routine invocations
and terminations is crucial. Distinguishing an invocation from a terminatio is not
trivial [29], since very often the same control transfer instructions are used in both
2A separate tree is constructed for each thread.
31
CHAPTER 3. IMPLEMENTATION
scenarios. Consider the following example: A transfers the control to B, C, and
D, then the control returns to A skipping the return instructions in B and C. This
scenario is quite commmon due to the optimizations introduced by compilers, and
tracking only the rtnEnter and rtnExit events is not enough. A stackwalking
strategy based on frame pointers is not practicable, since most applications and
dynamic libraries are compiled without frame pointers for efficiency reasons. An
alternative approach inspects the opcode of a jump instruction to distinguish
between call and return, but in practice several exceptions to this schema are
found. Another relevant problem is related to the use of a proxy code in order
to transfer the control flow without using a call/return schema: this is the case
of SETJMP/LONGJMP in C and of exception handling in Java. To deal with
these issues, we maintain a copy of the stack, called Shadow Stack, handled by
a method capable of reconstruct consistently the call stack.
Pin provides an API to spawn internal threads transparent to the instru-
mented program. This feature is very useful to realize a timer thread: techniques
such as sampling and bursting (see Section 2.1) can be implemented using a
timer to turn off the profiler and speed up program’s execution. In our trace
recorder we implement a timer thread with a 200 µs granularity in order to have
a recorded timestamped execution trace. Trace recording is highly optimized:
we adopt a double-buffering strategy to reduce the frequency and the number of
I/O operations. In particular, when the main buffer becomes full, the secondary
buffer is empty and their roles are swapped, while an internal thread is spawned
to asynchronously write data from the full buffer to disk. Buffers are organized
in pairs (i.e., a pair for each thread) and the pair reserved for the main thread is
an order of magnitude bigger than the others since most of the routine enter/exit
events take place in the main thread.
3.4 The trace analyzer
Recorded traces are stored in a compact binary format. Every file starts with
a 52-bytes preamble containing a magic number (4 bytes), the offsets of symbol
and library tables (8+8 bytes), and the number of routine enter/exit and memory
read/write events (8+8+8+8 bytes). Header is followed by a stream of events of
32
CHAPTER 3. IMPLEMENTATION
variable length. The length of each entry depends on the type of the event: 7
bytes are needed to represent a routine enter event (1 byte to encode the event
type, 4 bytes for the routine address, and 2 bytes for the call site) while 1 byte
is enough to represent routine exit and timer tick events. Memory access events
are not recorded by our profiler since we are focusing on frequency counts as
performance metrics.
We designed and implemented in C a sophisticated trace analyzer in order to
construct the CCT and the (φ, ε)-HCCT in a reproducible and reliable manner.
The main component of the trace analyzer is the driver module, which reads a
recorded trace from disk and produces a stream of events as in a live execution
of a program. This module adopts an internal buffering strategy to deal with
large trace files3 and provides a mechanism to measure the time spent in analysis
subtracting the fraction spent in I/O operations.
The burst module can be configured to simulate techniques such as static
and adaptive bursting (see Section 2.1) and relies on the timer ticks recorded in
the trace. This module also provides a Shadow Stack for stack walking in order
to accurately reconstruct the context and realign the tree when a burst starts.
Finally, it provides an optional support for the adaptive bursting with weight
compensation technique described in [32].
The metrics module plays a key role in the evaluation of accuracy of perfor-
mance metrics: it compares the (φ, ε)-HCCT with the exact CCT and writes the
log to disk. The cct module constructs the CCT, while the modules bss-hcct,
lss-hcct, and lc-hcct implement the streaming algorithms BSS, LSS, and LC,
respectively, for the construction of the (φ, ε)-HCCT. Logs produced by the trace
analyzer use a simple column-based representation and are well-suited for being
processed by graphical visualization tools such as gnuplot.
3.5 The gcc-based profiler
Our online profiling tools based Pin suffered from an important slowdown while
analyzing several applications. For this reason, we have developed a more effi-
cient version based on gcc, the GNU C compiler [27]. With respect to online
3Very often larger than 4 GB, hence a special descriptor must be used.
33
CHAPTER 3. IMPLEMENTATION
instrumentation frameworks such as Pin and Valgrind [24], this approach has es-
sentially one drawback: the program must be recompiled in order to be analyzed.
However, this initial cost is compensated by a more efficient analysis and instru-
mentation code: the instrumented program is directly executed by the CPU,
without translations in an intermediate language for inserting profiling code and
manipulating the execution.
In order to interfere as little as possible with the instrumented program, we
implemented our gcc-based profiler as a dynamic library. This approach has
an important advantage: analysis code can be modified without the need for
recompile the program again. Routine enter and exit events are intercepted
using two special user-defined methods executed just after function entry and
before function exit, respectively. These two methods are automatically inserted
by the compiler into the produced executable when the -finstrument-functions
option is specified. This option was originally implemented by Cygnus Solutions
and has recently been introduced also in other non-C/C++ compilers of the gcc
suite such as gfortran. Listing 3.5 shows the function prototypes for the two
instrumentation methods: the only information provided to the developer are
two parameters containing the address of the start of the current function and
the call site.
void c y g p r o f i l e f u n c e n t e r (void ∗ t h i s f n ,
void ∗ c a l l s i t e ) ;
void c y g p r o f i l e f u n c e x i t (void ∗ t h i s f n ,
void ∗ c a l l s i t e ) ;
Listing 3.5: Function prototypes for gcc-based instrumentation
This compiler assisted instrumentation mechanism is flexible: a function may
be given the attribute no instrument function, in which case instrumentation will
not be done. The list of functions not to be instrumented can also be specified
through the -finstrument-functions-exclude-file-list option of gcc. This can be
used, for example, for high-priority interrupt routines and any functions from
which the profiling functions cannot safely be called (perhaps signal handlers, if
the profiling routines generate output or allocate memory). Instrumentation is
also done for functions expanded inline in other functions: the profiling calls will
34
CHAPTER 3. IMPLEMENTATION
indicate where, conceptually, the inline function is entered and exited.
Since the only functionality provided by this instrumentation framework is
intercepting routine enter and exit events, the developer has to deal manually
with aspects such as program launch/termination and multithreading. In GNU
C/C++, it is possible to declare special attributes constructor and destructor
for functions. The constructor attribute causes the function to be called auto-
matically before execution enters main (). Similarly, the destructor attribute
causes the function to be called automatically after main() has completed or
exit() has been called. Functions with these attributes are useful for initializing
data that will be used implicitly during the execution of the program (e.g., data
structures for streaming algorithms and MCCT construction).
Multithreading support requires special attention due to the need for hav-
ing distinct data structures (i.e., trees and sets of monitored elements) for each
thread. The POSIX thread API provides the pthread get,setspecific()
methods to access per-thread data by using a key. Several compilers, instead,
offer non-standard C extensions to support Thread Local Storage (TLS) of ar-
bitrary size. TLS provides significantly better performance than pthread-based
storage [25] and can be used in gcc via the the thread storage class specifier:
this means that, in a multi-thread application, an unique instance of the variable
is created for each thread that uses it, and destroyed when the thread terminates.
We use thread variables to store pointers to data structures specific to a thread.
Thread creation and termination events have to be intercepted in order to
respectively create data structures needed for profiling and save the collected
information to disk. The GNU linker offers the --wrap symbol option to wrap
a particular function: any undefined reference to symbol will be resolved to
wrap symbol, while any undefined reference to real symbol will be resolved
to symbol (useful to call the original function). Listing 3.6 shows the code used
in our profiler to wrap both pthread crete and pthread exit.
35
CHAPTER 3. IMPLEMENTATION
1 // hand les a l s o thread terminat ion made wi thout p t h r e a d e x i t
2 void∗ a t t r i b u t e ( ( no in s t rument func t i on ) )
3 aux pthread c rea te (void ∗ arg ) 4 h c c t i n i t ( ) ;
5
6 // Ret r i eve o r i g i n a l rou t ine address and argument
7 void∗ o r i g a r g =((void ∗∗) arg ) [ 1 ] ;
8 void ∗(∗ s t a r t r o u t i n e ) ( void∗)=((void ∗∗) arg ) [ 0 ] ;
9 f r e e ( arg ) ;
10
11 // Run ac tua l a p p l i c a t i o n thread ’ s i n i t rou t ine
12 void∗ r e t =(∗ s t a r t r o u t i n e ) ( o r i g a r g ) ;
13
14 hcct dump ( h c c t g e t r o o t ( ) , 1 ) ;
15
16 return r e t ;
17 18
19 int a t t r i b u t e ( ( no in s t rument func t i on ) )
20 wrap pth r ead c r ea t e ( pthread t ∗ thread , p t h r e a d a t t r t
21 ∗ attr , void ∗(∗ s t a r t r o u t i n e ) ( void ∗ ) , void ∗ arg ) 22 void∗∗ params=mal loc (2∗ s izeof (void ∗ ) ) ;
23 params [0 ]= s t a r t r o u t i n e ;
24 params [1 ]= arg ;
25 return r e a l p t h r e a d c r e a t e ( thread , at t r ,
26 aux pthread create , params ) ;
27 28
29 void a t t r i b u t e ( ( no in s t rument func t i on ) )
30 wrap pth r ead ex i t (void ∗ v a l u e p t r ) 31 hcct dump ( h c c t g e t r o o t ( ) , 1 ) ;
32 r e a l p t h r e a d e x i t ( v a l u e p t r ) ;
33
Listing 3.6: Wrapping of pthread create and pthread exit
36
CHAPTER 3. IMPLEMENTATION
Sampling and bursting (see Section 2.1) are implemented using a timer thread
in order to turn off analysis code when needed. With kernel version 2.6.21 on-
wards4, high resolution timers (HRT) are available under Linux. The timer thread
uses the clock nanosleep() system call: like nanosleep(), it allows the caller
to sleep for an interval specified with nanosecond precision. The main difference
between the two system calls is that clock nanosleep() allows the caller to se-
lect the clock against which the sleep interval is to be measured. In particular,
we use the CLOCK PROCESS CPUTIME ID clock that measures CPU time
consumed by all threads in the process. This system call first appeared in Linux
2.6, while support is available in glibc since version 2.1; POSIX.1 specifies that
clock nanosleep() has no effect on signals dispositions or the signal mask.
Instrumentation framework has been realized in a modular structure in order
to facilitate the development of profiling analysis tool. As in STool (see Sec-
tion 3.3), a callback mechanism is used to separate instrumentation code from
analysis code. Listing 3.7 reports the simple activation paradigm of the infrastruc-
ture. Parameters such as φ, ε, sampling interval, and burst length can be specified
as environment variables and retrieved through the hcct getenv method. Each
profiling tool is implemented as a shared library. The tool in use can be changed
by simply modifying the symbolic link pointing to the default analysis library.
int hcct ge tenv ( ) ;
int h c c t i n i t ( ) ;
void h c c t e n t e r (UINT32 r o u t i n e i d , UINT16 c a l l s i t e ) ;
void h c c t e x i t ( ) ;
void∗ h c c t g e t r o o t ( ) ;
void hcct dump ( ) ;
#i f BURSTING==1
void h c c t a l i g n ( ) ;
#e n d i f
Listing 3.7: Activation paradigm of the gcc-based profiling infrastructure
At the time being, we have implemented a trace recorder with the same output
format of the Pin-based version (see Section 3.4) and an online profiling tool based
4Man page: http://www.kernel.org/doc/man-pages/online/pages/man7/time.7.html
37
CHAPTER 3. IMPLEMENTATION
on LSS. We have focused on LSS since, as we will see in Chapter 4, it performs
faster than LC. The constructed (φ, ε)-HCCT is stored on a file using a syntax
that resembles the DIMACS format for graphs5 and reported in Listing 3.8.
c <too l> [< eps i l on> <phi>] [< s amp l ing in t e rva l>
<bur s t l ength> <bur s t en t e r ev en t s >]
c <command> <proce s s / thread id> <working d i r e c to ry>
v <node id> <parent id> <counter> <r ou t i n e i d> <c a l l s i t e >
[ . . . ]
v <node id> <parent id> <counter> <r ou t i n e i d> <c a l l s i t e >
p <num nodes>
Listing 3.8: Syntax of output format for gcc-based online profiler
We have also implemented a simple analysis tool to reconstruct the tree in
main memory starting from a log file, in order to compute arbitrary statistics.
This tool is a starting point for the implementation of more complex analysis
methods, for instance to compare the characteristics of two HCCTs. Routine
addresses are stored in hexadecimal format, in order to be able to resolve them
using the addr2line command. Given an address and an executable, this pro-
gram uses the debugging information in the executable6 to figure out which file
name and line number are associated with a given address.
3.6 Conclusions
In this chapter we have discussed the details of the implementation of Space Sav-
ing and Lossy Counting, that have been both optimized in order to take advantage
of the tree-based data structures. We have also presented the three main software
components developed in this thesis: a pintool, useful to analyze any application
without need for recompilation, and the related STool infrastructure; a trace an-
alyzer, that can be used to perform offline analysis and accuracy evaluation; and
a gcc-based profiler, designed to achieve better performance in the analysis of
C/C++ Linux applications by using compiler-assisted instrumentation.
5http://www.dis.uniroma1.it/challenge9/format.shtml#graph6For an accurate address translation program should be compiled with the -g option.
38
Chapter 4
Experimental evaluation
In this chapter we present an extensive experimental study of our data-streaming
based profiling mechanism. We analyze the performances and the accuracy of the
produced (φ, ε)-HCCT with respect to several metrics and using many different
parameter settings. Beside the approach to the construction of the tree based on
exhaustive instrumentation, we integrate our implementations with static burst-
ing and adaptive bursting with re-enablement techniques [32] to reduce running
time overhead. The results of experimental analysis not only confirm, but rein-
force the theoretical prediction: the (φ, ε)-HCCT accurately represents the hot
portions of the complete CCT using only an extremely small percentage of the
space required by the CCT. The use of bursting techniques does not affect accu-
racy in a substantial way, and the global slowdown is often comparable to the one
introduced by gprof while providing a much deeper context-sensitive analysis.
4.1 Methodology
4.1.1 Benchmarks
We use two distinct sets of benchmarks to evaluate the accuracy of the (φ, ε)-
HCCT and the performance of our profilers, respectively.
Interactive applications. Accuracy tests are performed on a variety of well-
known large-scale interactive Linux applications. Since the behaviour of an inter-
39
CHAPTER 4. EXPERIMENTAL EVALUATION
active application changes among different executions, comparing the accuracy
of results from different session might lead to wrong considerations. For this rea-
son, we use the tracer based on Pin described in Section 3.4 to record execution
traces and then analyze them to evaluate accuracy metrics. The complete list of
benchmarks and related essential information (e.g. the number of distinct call-
ing contexts) are reported in Table 2.1, while the distribution of calling context
frequencies is shown in Figure 2.1 - we introduced these aspects in Section 2.3.
Benchmarks include graphics programs (inkscape, gimp and gwenview), audio
players/editors (amarok and audacity), an archiver (ark), an Internet browser
(firefox), an HTML editor (quanta), and the Open Office suite for word pro-
cessing (oowriter), spreadsheets (oocalc), and drawing (ooimpress). After the
startup, in each session we interactively used the application for approximately
ten up to twenty minutes, in order to simulate a typical usage session: e.g., we
created a 420×300 pixel image with gimp applying a variety of graphic filters
and color effects, we reproduced fifty images of ∼4.5MB each in slideshow mode
with gwenview, we played a ten minutes OGG audio file with amarok, and we
wrote a two pages formatted text, including tables, with oowriter. Statistical
information about test sets reported in Table 2.1 shows that even short sessions
of a few minutes result in CCTs consisting of tens of millions of calling contexts,
whereas the call graph has only a few thousands nodes.
SPEC CPU2006 suite. Performance metrics are evaluated on some of the
benchmarks of the SPEC CPU2006 suite in order to measure the efficiency of our
gcc-based profiler with CPU-intensive applications.
We have chosen a subset of the SPEC CPU2006 benchmarks according to
three exclusion criteria: implementation using Fortran, CCT size too small (e.g.,
bzip2), and incompatibility with gcc-based instrumentation (e.g., gcc and perl
crashed after a few minutes execution). Six of the selected benchmarks belong to
the integer component of the suite: astar is a pathfinding library for 2D maps
including the well-known A* algorithm; gobmk is a complex artificial intelligence
game; h264ref is a video compression application; mcf involves combinatorial
optimization in vehicle scheduling; omnetpp uses a discrete event simulator to
model a large Ethernet campus network; sjeng is a highly-ranked chess program
40
CHAPTER 4. EXPERIMENTAL EVALUATION
Application Suite component Language |CCT|astar CINT2006 C++ 844
gobmk CINT2006 C too large
h264ref CINT2006 C 3 903
mcf CINT2006 C 30 507
omnetpp CINT2006 C++ 162 742
povray CFP2006 C++ 71 685
sjeng CINT2006 C too large
soplex CFP2006 C++ 7 519
Table 4.1: Benchmarks selected from SPEC CPU2006 for performance evaluation.
that also plays several chess variants. The two remaining selected benchmarks
belong instead to the floating point component of the suite: povray renders a
1280x1024 pixel anti-aliased image of a chessboard with all the pieces in the
starting position using the ray-tracing technique; soplex solves a linear program
using a simplex algorithm and sparse linear algebra.
The size of the CCTs associated to the benchmarks is highly variable. For
two benchmarks, gobmk and sjeng, the 3 GB user space memory limit is reached
in a few minutes since the CCTs are extremely large. On the other hand, for
other benchmarks such as astar and h264ref, the trees are composed of a few
thousand nodes. As we will see in Section 4.2.4, there is no linear correlation
between the size of the tree and the overhead introduced by instrumentation. We
are not surprised by the variability in the size of the CCTs: these benchmarks are
strictly optimized for performance assessment and adopt different internal logic
(consider, for instance, the impact on the size of the CCT of a massive usage
of recursion in order to solve a problem), while large-scale Linux applications
described in the previous paragraph are based on a GUI and use well-known
heavy libraries such as GTK and QT. For each benchmark we will analyze the
impact of both instrumentation code only and instrumentation plus analysis code,
to evaluate also the effectiveness of gcc as an instrumentation framework.
Platform. Our experiments were performed on a 2.53GHz Intel Core2 Duo
T9400 with 128KB of L1 data cache, 6MB of L2 cache, and 4 GB of main memory
41
CHAPTER 4. EXPERIMENTAL EVALUATION
DDR3 1066, running Ubuntu 8.04, Linux Kernel 2.6.24, 32 bit.
4.1.2 Metrics
We test the accuracy of the (φ, ε)-HCCT produced by our profilers according to
a variety of metrics:
1. Degree of overlap, considered in [2, 3, 32], is used to measure the com-
pleteness of the (φ, ε)-HCCT with respect to the full CCT and defined as
follows:
overlap((φ, ε)-HCCT,CCT) =1
N
∑arcs e∈(φ,ε)-HCCT
w(e)
where N is the total number of routine activations (corresponding to the
sum of the counters associated t o CCT nodes) and w(e) is the true fre-
quency of the target node of arc e in the CCT.
2. Hot edge coverage, as defined in [32], measures the percentage of hot edges
of the CCT that are covered by the (φ, ε)-HCCT, using an edge-weight
threshold τ ∈ [0, 1] to determine hotness, as follows:
cover((φ, ε)-HCCT,CCT, τ) =|e ∈ (φ, ε)-HCCT:w(e) ≥ τH1||e ∈ CCT:w(e) ≥ τH2|
where H1 and H2 are the weights of the hottest arcs in the (φ, ε)-HCCT
and in the CCT, respectively.
3. Maximum frequency of uncovered calling contexts, where a context is un-
covered if it is not included in the (φ, ε)-HCCT:
maxUncov((φ, ε)-HCCT,CCT) = maxe∈CCT\(φ,ε)-HCCT
w(e)
H1× 100
Average frequency of uncovered contexts is defined similarly.
4. Number of false positives, i.e., |A \ H|: the smaller this number, the bet-
ter the (φ, ε)-HCCT approximates the exact HCCT obtained from CCT
pruning.
42
CHAPTER 4. EXPERIMENTAL EVALUATION
5. Counter accuracy, i.e., maximum error in the frequency counters of (φ, ε)-
HCCT nodes with respect to their true value in the complete CCT:
maxError((φ, ε)-HCCT) = maxe∈(φ,ε)-HCCT
|w(e)− w(e)|w(e)
× 100
where w(e) and w(e) are the true and the estimated frequency of context
e, respectively. Average counter error is defined similarly.
An accurate solution should maximize degree of overlap and hot edge coverage,
and minimize the remaining metrics. With respect to performance metrics, in-
stead, we compare execution time of the SPEC CPU2006 benchmarks.
4.1.3 Parameter tuning
According to the theoretical analysis, the choice of φ and ε parameters affects
both the space used by the algorithms and the accuracy of the solution. In our
study we considered many different choices of φ and ε across rather etherogeneous
sets of execution traces, always obtaining similar results that we summarize in
this section.
A rule of thumb validated by previous experimental studies [19] suggests that
it is sufficient to choose them such that φ/ε = 10 in order to obtain a small
number of false positives and high counter accuracy. We found this choice overly
pessimistic in our scenario: the extremely skewed distribution of calling context
frequencies makes it possible to use much smaller values of this ratio (i.e., larger
values of ε for a fixed φ) without sacrificing accuracy. Since space usage is roughly
proportional to 1/ε, this yields substantial benefits on the space usage. Unless
otherwise stated, in all our experiments we used φ/ε = 5.
Since the length of the stream N is unknown a priori during the execution,
an appropriate choice of the hotness threshold φ may appear to be difficult.
In fact, too large values might result in returning too many contexts without
being able to discriminate which of them are actually hot, resulting also in a
waste of space. On the other hand, too small values might result in returning
too few contexts without being able to provide any relevant information about
hot spots in the execution of the program. Our experiments suggest that an
appropriate choice of φ is mostly independent of the stream length and of the
43
CHAPTER 4. EXPERIMENTAL EVALUATION
specific benchmark: as shown in Table 4.2, different benchmarks1 have a number
of HCCT nodes of the same order of magnitude when using the same φ threshold.
This fact is not surprising since it is a consequence of the skewness of context
frequency distribution, and greatly simplifies the choice of φ in practice. Unless
otherwise stated, in our experiments we used φ = 10−4, which corresponds to
mining roughly the hottest 1 000 calling contexts.
Table 4.2: Typical thresholds.
HCCT nodes HCCT nodes HCCT nodes
Benchmark φ = 10−3 φ = 10−5 φ = 10−7
audacity 112 9 181 233 362
dolphin 97 14 563 978 544
gimp 96 15 330 963 708
inkscape 80 16 713 830 191
oocalc 136 13 414 1 339 752
quanta 94 13 881 812 098
4.2 Experimental results
This section is organized as follows. Results related to accuracy are divided into
two parts: in Section 4.2.1 we discuss degree of overlap and hot edge coverage
for the exact HCCT, while in Section 4.2.2 we report results about the remaining
metrics for the (φ, ε)-HCCT. This choice aims to measure the quality of our
approach from a theoretical point of view: degree of overlap and hot edge coverage
can be evaluated for the exact HCCT without involving a particular streaming
algorithm. Moreover, since the HCCT is a subtree of the (φ, ε)-HCCT computed
by streaming algorithms, the results described for the exact HCCT apply to the
(φ, ε)-HCCT as well. In Section 4.2.3 we discuss the cost of our approach in terms
of memory usage and time spent for analysis. Finally, in Section 4.2.4 we analyze
the efficiency of our gcc-based profiler using the SPEC CPU2006 test suite.
1Results for omitted benchmarks are similar.
44
CHAPTER 4. EXPERIMENTAL EVALUATION
4.2.1 Accuracy: exact HCCT
Degree of overlap. Not surprisingly, the variation of the degree of overlap
with the full CCT shown in Figure 4.1 for decreasing values of φ is coherent with
the shape of cumulative distributions of calling context frequencies discussed in
previous sections. This figure illustrates the relationship between degree of over-
lap and hotness threshold, plotting the value φ of the largest hotness threshold
for which a given degree of overlap d can be achieved: using any φ ≤ φ, the
achieved degree of overlap will be larger than or equal to d.
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
10 20 30 40 50 60 70 80 90 100
Max φ
Degree of overlap (%)
Largest value of φ that guarantees a given degree of overlap
amarokaudacitybluefish
firefoxgedit
gwenviewoocalcpidgin
quantavlc
Figure 4.1: Relation between φ and degree of overlap between the exact HCCT
and the full CCT on a representative subset of benchmarks.
Since the HCCT is a subtree of the (φ, ε)-HCCT, the degree of overlap eval-
uated for the HCCT is a lower bound to the corresponding value in the (φ, ε)-
HCCT. The value of φ decreases as d increases: if we want to achieve a larger
degree of overlap, we must include in the HCCT a larger number of nodes, which
corresponds to choosing a smaller hotness threshold. However, when computing
the (φ, ε)-HCCT, the value of φ indirectly affects2 the space used by the algo-
2The φ/ε ratio is fixed.
45
CHAPTER 4. EXPERIMENTAL EVALUATION
rithm and in practice cannot be too small (see Section 4.1.3). By analyzing hot
edge coverage and uncovered frequency for the HCCT and the (φ, ε)-HCCT, re-
spectively, we will show that even when the degree of overlap is not particularly
large, the HCCT and the (φ, ε)-HCCT are nevertheless good approximations of
the full CCT: values of φ ∈ [10−4, 10−6] represent on all our benchmarks a good
tradeoff between accuracy and space reduction.
Hot edge coverage. While degree of overlap measures the similarity of two
trees in their entirety, the hot edge coverage metrics proposed by Zhuanget al. [32]
focuses on the similarity of edges in each graph having relatively high edge
weights. In particular, since in our approach edge weights are represented by
the count field of the target node, hot edge coverage measures the similarity of
the hottest calling contexts. We compare the hot edge coverage of the HCCT
with respect to the full CCT to evaluate the percentage of covered hot contexts.
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
1e-09 1e-08 1e-07 1e-06 1e-05 0.0001 0.001
Min
τ
φ
Minimum τ for which hot edge coverage is 100%
firefox oocalc
amarok ark
inkscape
Figure 4.2: Hot edge coverage of the exact HCCT: relation between φ and edge-
weight threshold τ .
The analysis of hot edge coverages is reported in Figure 4.2. In this figure we
plot, as a function of φ, the smallest value τ of the hotness threshold τ for which
46
CHAPTER 4. EXPERIMENTAL EVALUATION
hot edge coverage of the HCCT is 100%. Results are shown only for some of the
less skewed, and thus more difficult, benchmarks. Since the HCCT is a subtree
of the (φ, ε)-HCCT, the hot edge coverage evaluated for the HCCT is a lower
bound to the corresponding value in the (φ, ε)-HCCT.
Note that τ is directly proportional to and roughly one order of magnitude
larger than φ. This is because the HCCT contains all contexts with frequency
≥ bφNc, and always contains the hottest context (which implies H1 = H2 in the
definition of hot edge coverage in Section 4.1.2). Hence, the hot edge coverage
is 100% as long as bφNc ≥ τH1, which yields τ = bφNc/H1. The experiment
shows that 100% hot edge coverage is always obtained for τ ≥ 0.01: even for
values of φ for which the degree of overlap may be small, the HCCT represents
the hot portions of the full CCT remarkably well according to hot edge coverage.
As a frame of comparison, notice that the τ thresholds used in [32] to analyze
hot edge coverage are always larger than 0.05, and for those values we always
guarantee total coverage. Clearly, the higher the threshold, the fewer hot edges
included in the range. For instance, a threshold of 0.95 will use all nodes whose
count values are within 5% of the hottest CCT node in computing the coverage.
Note that this is different, and possibly more challenging, than simply choosing
the top 5% hottest contexts, or nodes whose count values represent 5% of the
sum of the frequency counts of the CCT nodes.
4.2.2 Accuracy: (φ, ε)-HCCT
Uncovered frequency. Analysis of maximum and average frequency of calling
contexts not included in the (φ, ε)-HCCT provides important information on
the effectiveness of our approach especially when the degree of overlap is small.
Consider, as an example, φ = 10−4, which yields degree of overlap as small as
10% on two of the less skewed benchmarks (oocalc and firefox).
In this apparently bad scenario, Figure 4.3 analyzes how the remaining 90%
of the total CCT weight is distributed among uncovered contexts: the average
frequency of uncovered contexts is about 0.01% of the frequency of the hottest
context, and the maximum frequency is typically less than 10%. This suggests
that uncovered contexts are likely to be uninteresting with respect to the hottest
47
CHAPTER 4. EXPERIMENTAL EVALUATION
1e-05
0.0001
0.001
0.01
0.1
1
10
100
amarok
arkaudacity
bluefish
dolphin
firefox
gedit
ghex2
gimpsudoku
gwenview
inkscape
oocalc
ooimpress
oowriter
pidgin
quanta
vlcUncovere
d fre
qu
en
cy (
% o
f th
e h
ott
est co
nte
xt)
Benchmarks
Avg/max frequency of contexts not included in (φ,ε)-HCCT
LSS avgLSS max
Figure 4.3: Maximum and average frequency of calling contexts not included in
the (φ, ε)-HCCT generated by LSS. Results for LC are almost identical and are
not shown in the chart.
contexts, and that the distribution of calling context frequencies obeys a “long-
tail, heavy-tail” phenomenon: the CCT contains a huge number of calling con-
texts that rarely get executed, but overall these low-frequency contexts account
for a significant fraction of the total CCT weight. Discarding on-the-fly informa-
tion about these contexts is essential to reduce space usage, and the efficiency of
data streaming algorithms plays a fundamental role in the process.
In our experiments, Space Saving and Lossy Counting showed an almost iden-
tical behaviour from the point of view of uncovered frequencies. Note that in this
section we do not report accuracy results for BSS, since it computes the same
(φ, ε)-HCCT as LSS.
Partition of the nodes. Figure 4.4 shows the percentages of cold nodes, true
hots, and false positives in the (φ, ε)-HCCT using φ = 10−4 and ε = φ/5. Both
algorithms include in the tree only very few false positives (less than 10% of the
total number of tree nodes in the worst case), and Lazy Space Saving consistently
48
CHAPTER 4. EXPERIMENTAL EVALUATION
proved to be better than Lossy Counting.
0
20
40
60
80
100
amarok
ark audacity
bluefish
dolphin
firefox
gedit
ghex2
gimp
sudoku
gwenview
inkscape
oocalc
ooimpress
oowriter
pidgin
quanta
vlc
Cold
nodes / h
ot n
od
es / fa
lse
positiv
es (
%)
Classification of (φ,ε)-HCCT nodes
Figure 4.4: Partition of (φ, ε)-HCCT nodes into: cold (bottom bar), hot (mid-
dle bar), and false positives (top bar) for LC and LSS. The two bars for each
benchmark are related to LC (left) and LSS (right), respectively.
These results are not surprising: previous experimental studies [19] show that
these algorithms are very accurate in terms of recall (i.e., the fraction of relevant
instances that are retrieved by the algorithm). The percentage of false positives
is influenced by the choice of the φ/ε ratio: as the maximum possible estimation
error decreases, Lossy Counting and Space Saving identify heavy hitters more
accurately. Figure 4.5 shows how the number of false positives considerably
decreases as we choose smaller values of ε for a fixed φ, getting very close to 0%
on most benchmarks when the ratio φ/ε is larger than 10.
Since the average node degree is a small constant around 2-3, cold HCCT
nodes are typically a fraction of the total number of nodes. In our experiments
we observed that this fraction strongly depends on the hotness threshold φ, and
in particular decreases with φ: cold nodes that have a hot descendant are indeed
more likely to become hot when φ is smaller.
49
CHAPTER 4. EXPERIMENTAL EVALUATION
0
2
4
6
8
10
12
14
16
18
2 4 6 8 10 12 14 16
Fals
e p
ositiv
es (
% o
f (φ
,ε)-
HC
CT
nod
es)
φ/ε
Number of false positives as a function of ε
LSS - firefoxLSS - oocalc
LSS - inkscapeLSS - ark
LC - firefoxLC - oocalc
LC - inkscapeLC - ark
Figure 4.5: False positives in the (φ, ε)-HCCT as a function of ε on a represen-
tative subset of benchmarks. The value of φ is fixed to 10−4.
Accuracy of counters. A very interesting feature of our approach is that
counter estimates are very close to the true frequencies, as shown in Figure 4.6.
Lossy Counting on average achieves almost exact counters for the hot contexts
(0.057% average difference from the true frequency), and even the maximum
error in the worst case is smaller than 8%. When a heavy hitter is inserted in the
set of monitored items, it is unlikely that at the end of the current bucket it is
removed from M: for this reason very few hot contexts are underestimated in a
non-negligible way. Lazy Space Saving computes instead less accurate counters.
This is due to the fact that many heavy hitters may not appear in the initial
part of the stream: when they are inserted in the set of monitored items for the
first time, the overestimation is often non-negligible. We implemented a variant
of LSS that uses the εi field (see Section 2.2.1) to improve accuracy of computed
frequencies by subtracting εi from count for each node output at the end of the
query operation. With this trick, the average error becomes comparable to LC.
50
CHAPTER 4. EXPERIMENTAL EVALUATION
0 3 6 9
12
15
18
amar
okark
auda
citybl
uefis
hdolphinfir
efox
gedit
ghex
2gim
psu
dokugw
enviewin
ksca
peooca
lcooim
pres
soo
writ
erpidg
inqu
antavl
c
amar
okark
auda
citybl
uefis
hdolphinfir
efox
gedit
ghex
2gim
psu
dokugw
enviewin
ksca
peooca
lcooim
pres
soo
writ
erpidg
inqu
antavl
c
Avg/max error (%)
Be
nch
ma
rks
Avg
/ma
x c
ou
nte
r e
rro
r a
mo
ng
ho
t e
lem
en
ts (
% o
f th
e t
rue
fre
qu
en
cy)
L
SS
avg
err
or
L
C a
vg
err
or
L
SS
ma
x e
rro
r
LC
ma
x e
rro
r
Ma
xA
vg
Fig
ure
4.6:
Acc
ura
cyof
conte
xt
freq
uen
cies
com
pute
dby
LSS
and
LC
,m
easu
red
onhot
conte
xts
incl
uded
inth
e(φ,ε
)-
HC
CT
.L
ossy
Cou
nti
ng
onav
erag
eac
hie
ves
alm
ost
exac
tco
unte
rs,
per
form
ing
sign
ifica
ntl
yb
ette
rth
anSpac
eSav
ing.
51
CHAPTER 4. EXPERIMENTAL EVALUATION
Integration with bursting. The integration with static and adaptive bursting
techniques [32] does not substantially affect the accuracy of the (φ, ε)-HCCT, with
the exception of adaptive bursting with no re-enablement. This is not surprising,
since without re-enablement information about calling contexts observed in a
previous burst is ignored: hence, the corresponding nodes are more likely to
become cold as the number of routine enter events increases. When re-enablement
is active, the RR parameter specifies the probability of updating the tree with
profiling information for contexts observed during a previous burst rather than
discarding it. A higher RR causes more bursts to be re-enabled, adding higher
overhead, whereas a smaller RR could miss more burts and lowers the quality of
the constructed tree.
0
10
20
30
40
50
60
70
80
90
100
static
RR 20%
RR 10%
RR 5%
adaptive
Hot edge c
overa
ge (
%)
Comparison of hot edge coverage results using static and adaptive bursting
LSS LC
Figure 4.7: Hot edge coverage analysis on the vlc benchmark for LSS/LC com-
bined with static and adaptive bursting for τ = 0.025, with φ = 4 · 10−5,
ε = φ/2, sampling interval 2 msec, and burst length 0.2 msec. RR stands for
re-enablement, adaptive for adaptive bursting without re-enablement.
We take as an example the vlc benchmark and we focus on hot edge coverage
for τ = 0.025. The HCCT computed without bursting has coverage 100%, which
is also achieved using static busting and adaptive bursting with re-enablement
52
CHAPTER 4. EXPERIMENTAL EVALUATION
ratio 20% as shown in Figure 4.7. The hot edge coverage decreases as sampling
becomes more aggressive, but in this experiment is always larger than 78% and
91% for LSS and LC, respectively.
4.2.3 Memory usage and cost of analysis
Memory usage. Our trace analyzer (see Section 3.4) keeps track of the peak
memory usage of the analysis tools. The peak memory usage is proportional to
the number of MCCT nodes, which is typically larger than the actual number of
hot contexts obtained after pruning. In spite of this, a considerable amount of
space is saved by all our algorithms.
Figure 4.8 plots the peak memory usage of our analysis tools as a percentage
of the full CCT. On many benchmarks, our profilers use less than 1% of the
space required by the full CCT, and even in the worst case the space usage is
about 6.5%. Space Saving is always more efficient than Lossy Counting, and the
lazy approach is preferable to the bucket-based implementations suggested by
the authors [22]. It should be noted, however, that the space used in practice by
Lossy Counting is much smaller than the theoretical prediction (see Section 2.2.2)
and considerably smaller than the 7ε
bound discussed by the authors [20] for real
world datasets.
Quite surprisingly, the integration with static bursting improves space usage
(results for adaptive bursting are similar and are not reported in the chart). This
depends on the fact that sampling reduces the variance of calling context frequen-
cies: MCCT cold nodes that have a hot descendant are more likely to become hot
when sampling is active, and monitoring these nodes in place of others reduces
the total MCCT size. The histogram also shows that static/adaptive bursting
alone (i.e., without streaming) is not sufficient to reduce the space substantially:
along with hot contexts, a large fraction of cold contexts is also sampled and
included in the CCT. On the oocalc benchmark, even adaptive bursting without
re-enablement requires 127 times more space than LSS. We also observed that
the larger the applications, the larger the space reduction of our approach over
bursting alone.
53
CHAPTER 4. EXPERIMENTAL EVALUATION
0.1 1
10
10
0
amar
ok
ark
auda
city
blue
fish
dolphin
firef
ox
gedit
ghex
2
gim
p
sudo
ku
gwen
view
inks
cape
ooca
lc
ooim
pres
soo
writ
er
pidg
in
quan
ta
vlc
Virtual memory peak(% of full CCT)
Be
nch
ma
rks
Sp
ace
co
mp
ariso
n w
ith
fu
ll C
CT
LS
S +
sta
tic b
urs
tL
SS
BS
SL
C +
sta
tic b
urs
tL
Csta
tic b
urs
t
Fig
ure
4.8:
Spac
ean
alysi
son
seve
ral
ben
chm
arks
ofB
SS/L
SS/L
C,
stat
icburs
ting,
and
LSS/L
Cco
mbin
edw
ith
stat
ic
burs
ting
wit
hsa
mpling
inte
rval
2m
sec,
and
burs
tle
ngt
h0.
2m
sec.
54
CHAPTER 4. EXPERIMENTAL EVALUATION
Cost of analysis. Since our approach is independent of any specific instru-
mentation mechanisms, and different techniques for tracing routine enter and
exit events might incur rather different overheads in practice, in this paragraph
we will focus on the time required by the analysis routines only. Our trace ana-
lyzer provides a mechanism to measure the time spent in analysis subtracting the
fraction spent in I/O operations (see Section 3.4). Compared to the construc-
tion of the full CCT, streaming algorithms incur a small overhead, which can be
considerably reduced by exploiting bursting techniques.
0.25
0.5
1
2
4
8
16
32
64
static
RR 20%
RR 10%
RR 5%
adaptive
static
RR 20%
RR 10%
RR 5%
adaptive
Speedup w
.r.t.
CC
T c
onstr
uction
Comparison of analysis running times using static and adaptive bursting
Full CCT
BurstingLSS
LC
Sampling interval = 10 ms Sampling interval = 2 ms No sampling
Figure 4.9: Speedup analysis relative to full CCT construction on the vlc bench-
mark for LSS/LC, static and adaptive bursting alone, and LSS/LC combined
with static and adaptive bursting: (1) no sampling (left histogram); (2) sampling
interval 2 msec and burst length 0.2 msec (middle histogram); (3) sampling inter-
val 10 msec and burst length 0.2 msec (right histogram). The CCT construction
baseline is about 1.5 seconds for 2.5 · 108 routine enter/exit events.
In Figure 4.9 we plot the speedup that can be obtained by combining our
approach with static and adaptive bursting. We compared the performance of
six variants (from slowest to fastest) for both LSS and LC: full (φ, ε)-HCCT
construction (on the entire routine enter/exit stream), static bursting, adaptive
55
CHAPTER 4. EXPERIMENTAL EVALUATION
bursting with re-enablement ratio equal to 20%, 10%, and 5%, and adaptive
bursting without re-enablement. We omitted sample-driven stack-walking, which
has been shown to be largely inferior to bursting with respect to the accuracy [22].
The construction of the HCCT greatly benefits from bursting, achieving speedups
up to 35× in the execution time of analysis routines. LSS is always faster than
LC, but both of them are slower than bursting without streaming: the speedup
difference, however, is moderate and for LSS is typically smaller than 2.
4.2.4 Performance
Performance of our gcc-based profiling tools is evaluated against running times of
suitable SPEC CPU2006 benchmarks compiled with the base set of optimizations.
According to SPEC Run and Reporting Rules3, the median of three runs for each
benchmark is picked. In these experiments we used ε = 2 ∗ 10−5.
In Figure 4.10 we plot the slowdown introduced by our profiling tools and by
gprof. We do not report results for LC since LSS is always faster than LC. We
also do not report results for LSS with adaptive bursting since the improvement
brought with respect to LSS with static bursting is not considerably remarkable
as in the evaluation made for analysis code only at the end of the previous section:
the overhead introduced by instrumentation code only (the empty analysis tool)
often accounts for a large fraction of total execution time, hence the benefits of
adaptive bursting with re-enablement are less appreciable.
The overhead introduced by our streaming-based approach with respect to
canonical CCT construction is small, while the integration with static bursting
reduces by half the slowdown in many benchmarks. In the worst case, LSS with
static bursting introduces 1.56x slowdown for C benchmarks (sjeng) and 3.46x
for C++ benchmarks (omnetpp). CCT construction times for gobmk and sjeng
are not available since the corresponding CCTs are too large to be maintained
in main memory. For the sake of completeness, absolute execution times are
reported at the end of this section in Table 4.3.
Note that there is no linear correlation between the size of the CCT (reported
in Table 4.1) and the slowdown introduced by our profiling tools. Consider, for
3http://www.spec.org/cpu2006/Docs/runrules.html
56
CHAPTER 4. EXPERIMENTAL EVALUATION
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
astar
gobmk
h264ref
mcf
omnetpp
povray
sjeng
soplex
Slo
wd
ow
n
Benchmarks
Analysis of slowdown introduced by profiling wrt native execution
CCTLSS
LSS+staticgprof
empty
Figure 4.10: Analysis of slowdown introduced by profiling tools with respect to
a native execution of a set of SPEC CPU2006 benchmarks. The tool empty
instruments routine enter and exit events without performing any analysis and
is used to evaluate the effectiveness of gcc-based instrumentation. Statistics are
reported also for the traditional exhaustive construction of the CCT.
instance, mcf and povray: the size of the respective CCTs are of the same order
of magnitude, while the slowdown for mcf is much smaller than for povray. More-
over, the slowdown does not either increase or decrease with the size of the CCT:
for instance, the overhead introduced for mcf is remarkably lower than for astar
and omnetpp, whose CCTs are considerably smaller and larger, respectively, than
the CCT of mcf.
In Figure 4.11 we compare the performance of our profilers with gprof. The
reasons behind this choice are several: its widespread use in the profiling com-
munity, the use of a compiler assisted instrumentation (the -pg option of gcc),
and the compatibility with SPEC CPU2006 benchmarks. gprof is a very efficient
tool: the introduced slowdown with respect to a native execution is 2.07x in the
worst case scenario (omnetpp benchmark, see Figure 4.10), while it is very close
to 1.5x for most benchmarks. We believe that aiming at achieving better perfor-
57
CHAPTER 4. EXPERIMENTAL EVALUATION
mance than gprof would be a very ambitious, and probably unrealistic, goal - at
least without the adoption of sampling and bursting techniques: gprof has been
largely studied and optimized over the years, and computes much less accurate
information in terms of analysis depth.
0
0.5
1
1.5
2
2.5
3
astar
gobmk
h264ref
mcf
omnetpp
povray
sjeng
soplex
Slo
wdow
n
Benchmarks
Comparison of performance of gcc-based profilers with gprof
CCTLSS
LSS+staticempty
Figure 4.11: Analysis of slowdown introduced by profiling tools with respect to
an execution of SPEC CPU2006 benchmarks under gprof.
A comparison of the performance of the empty tool with gprof is very useful
in order to evaluate the effectiveness of gcc-based instrumentation. Quite sur-
prisingly, the empty tool is consistently slower than gprof (two times slower in
the worst case, represented by soplex benchmark) for C++ benchmarks: the
-finstrument-functions instrumentation framework provided by g++ seems to be
not as efficient as in gcc. Since this is still an experimental feature for developers,
we are confident that performance improvements might be realized soon.
In conclusion, we believe that the results of the comparison with gprof are
encouraging: in three benchmarks (i.e., h264ref, mcf, and sjeng) LSS with static
bursting performs slightly faster, and only in two benchmarks the slowdown is
greater than 1.6x (in particular, it is close to 2x for omnetpp and less than 2.4x
for soplex).
58
CHAPTER 4. EXPERIMENTAL EVALUATION
Benchmark native CCT LSS LSS+static gprof empty
astar 667 1705 1748 1452 1001 1271
gobmk 688 N/A 1776 1049 984 867
h264ref 1097 2582 2676 1676 1685 1455
mcf 399 567 581 571 576 460
omnetpp 480 2169 2317 1663 797 1290
povray 393 1992 2056 1241 813 956
sjeng 817 N/A 2368 1278 1296 1037
soplex 495 1412 1567 1229 526 1059
Table 4.3: Execution times (in seconds) for SPEC CPU2006 benchmarks.
4.3 Conclusions
In this chapter we have presented and discussed an extensive experimental study
of our profiling mechanism, using two distincts sets of benchmarks for accuracy
and performance metrics evaluation. Data streaming algorithms have proved
themselves to be very effective in this setting: both algorithms include very
few false positives in the (φ, ε)-HCCT, while average counter error is negligible
especially for LC. On the other hand, the overhead introduced with respect to
canonical CCT construction is small, while the integration with static bursting
reduces by half the slowdown in many benchmarks. Moreover, the results of the
comparison with gprof are encouraging, with our tool performing slightly faster
in some benchmarks while providing much more accurate contextual profiling
information.
59
Chapter 5
Conclusions
Summary
Program analysis tools are extremely important in software development and
engineering. In particular, profilers are largely used to understand program be-
haviour (e.g., identify critical sections of code), aid program optimization (i.e.,
make some aspects of it work more efficiently or use fewer resources), and detect
anomalies of program execution or code injection attacks. The call graph con-
structed by many performance profiling tools such as gprof does not accurately
describe context-sensitive information, since it represents pairs of caller-callee
routines rather than maintaining a separate node for each call stack that a pro-
cedure can be activated with. For this reason, calling context trees have been
proposed in order to offer a compact representation of all calling contexts encoun-
tered during a program’s execution. However, even for short runs of medium-sized
applications, CCTs can be rather large and difficult to analyze. Moreover, their
sheer size might hurt execution time due to poor access locality during construc-
tion and query.
Motivated by the observation that only a very small fraction of calling contexts
are representative of hot spots to which optimizations must be directed, in this
thesis we have presented a novel technique for improving the space efficiency of the
complete calling context tree without sacrificing accuracy. We have formalized
the concept of Hot Calling Context Tree and we casted the problem of identifying
the most frequent contexts into a data streaming setting: by adapting fast and
60
CHAPTER 5. CONCLUSIONS
space-efficient algorithms for mining frequent items in data streams, we have
devised an algorithm for interprocedural contextual profiling that can discard
on the fly cold contexts, thus reducing memory usage up to 1% of the space
required to maintain the complete CCT in main memory. In particular, we have
realized a highly tuned implementation of the Space Saving algorithm that is more
efficient in practice than the one proposed by the authors and that might be of
independent interest. Our approach appears to be extremely practical and can
be easily integrated with several techniques (e.g. static and adaptive bursting)
that have been proposed in the last years to reduce the overhead introduced by
instrumentation.
We have evaluated the effectiveness of our approach in terms of accuracy
and memory usage on several large-scale Linux applications: results not only
confirm, but reinforce the theoretical predictions concerning these two aspects.
We have analyzed the performance overhead introduced by our gcc-based profiling
tools using a set of benchmarks from the SPEC CPU2006 suite. In particular,
for C benchmarks the combination of Lazy Space Saving with static bursting
introduces in the worst case a 1.56x slowdown with respect to a native execution.
Moreover, this version of our profiler performs slightly faster than gprof in several
C benchmarks. We believe that the results of this experimental evaluation are
encouraging, since gprof does not provide additional context-sensitive information
beyond the extent of a call graph.
Outlook
In this thesis the focus is on frequency counts (i.e., on the number of activations
of each calling context throughout the execution) as performance metrics, but
our approach can be extended to arbitrary metrics such as execution time, cache
misses, or instruction stalls. Even though a preliminary analysis suggests that
the data streaming algorithms presented in this thesis might not adapt well to
scenarios in which items in the data stream are weighted, better data streaming
algorithmic solutions are already available. Another interesting direction to inves-
tigate is whether a similar approach can be applied to certain space-demanding
forms of path profiling: e.g., it would be interesting to understand if our tech-
61
CHAPTER 5. CONCLUSIONS
niques could help leverage the scalability problems encountered when collecting
performance metrics about interprocedural path profiling in order to capture both
inter- and intra-procedural control flow (i.e., acyclic paths that may cross proce-
dure boundaries) [21], or k-iteration paths [28] (i.e., intraprocedural cyclic paths
spanning up to k loop iterations) for large values of k.
A natural evolution of our work on performance profiling and algorithm design
would be to study methodologies and techniques for analyzing the performance
and behaviour of large distributed applications, developing profiling tools for big
data computing platforms and cloud architectures such as MapReduce [11]. In a
recent work [10] it is shown that collected profiling information can be extremely
large (e.g., about 100GB for a TeraSort job): we believe that an interesting
research direction would be to investigate whether streaming algorithms can be
adapted to this scenario in order to mine on-the-fly only data items that are
possibly relevant with respect to the metrics of interest.
62
Appendix A: Using the tools
In this appendix we describe how to install and use the profiling tools discussed
in this thesis. Source code can be retrieved through Google Code:
$ svn co -N http://hcct.googlecode.com/svn/trunk/ hcct_read_only
$ cd hcct_read_only
$ svn up thesis
$ cd thesis
The only library required for compilation is GLib21, available on most Linux
distributions. Source code has been extensively tested on several Ubuntu releases.
Pin tracer
The trace recorder based on Pin can be used to perform preliminary analysis on
the target program without the need for recompile it. Source code is organized
in two directories, rectrace and stool, within the module pin tracer. The
Pin instrumentation framework should be installed in a folder called pin under
pin tracer. Once the toolkit has been downloaded2, the main folder should be
renamed accordingly:
$ tar xzvf pin-2.11-49306-gcc.3.4.6-ia32_intel64-linux.tar.gz
$ mv pin-2.11-49306-gcc.3.4.6-ia32_intel64-linux pin
Compiling the pintool is very simple. The STool infrastructure is used by the
tracer recorder tool as an intermediate layer towards Pin, hence should be com-
piled first. No configuration is needed.
1http://www.gtk.org/2http://www.pintool.org
63
APPENDIX . APPENDIX A: USING THE TOOLS
In order to compile both modules:
$ cd stool
$ make
$ cd ../rectrace
$ make
The compiled pintool is saved as obj-ia32/rectrace.so in the rectrace folder.
A pintool should be invoked through pin in order to be properly executed. How-
ever, on several recent Linux distributions a preliminary operation is needed3:
$ sudo -i
(insert root password)
# echo 0 > /proc/sys/kernel/yama/ptrace_scope
# exit
Recording the execution of an operation is very simple. We will use xcalc as an
example:
$ ../pin/pin -t obj-ia32/rectrace.so -- xcalc
[TID=0] Starting new thread with output file xcalc-0.cs.trace
The analyzed application will work as usual, slowed down by instrumentation
and flush operations needed to record the trace on disk. Note that a separate file
is created for each thread.
Trace analyzer
The trace analyzer is independent of the instrumentation framework used for
recording and can be used to perform preliminary analysis or parameter tuning
of data streaming algorithms and sampling/bursting techniques. Source code is
contained in the module trace analyzer and can be compiled by simply execut-
ing the make command from inside the module directory.
3Pin uses the parent injection as default mode since it is less likely to cause a problem.
However, on many recent Linux distributions the usage of the ptrace system call is restricted
for security reasons. For more information, see Injection section in the Pin User Manual.
64
APPENDIX . APPENDIX A: USING THE TOOLS
A collection of trace analysis tools is generated in the bin directory, one for
each algorithm, profiling methodology, and type of analysis to perform. Prefix
denotes the algorithm in use: cct for the exhaustive CCT construction; empty
for only parsing the stream; bss-hcct, lss-hcct, and lc-hcct for the con-
struction of the (φ, ε)-HCCT through BSS, LSS, and LC streaming algorithm,
respectively. Infix denotes the bursting technique in use: burst for static burst-
ing; adaptive-XX for adaptive bursting4; adaptive-no-RR for adaptive bursting
without re-enablement. When no infix is present, the stream of events is analyzed
wholly (i.e., no sampling is performed). Suffix denotes the type of analysis to per-
form (streaming algorithms only): metrics for evaluating accuracy metrics with
respect to the exhaustive CCT construction; perf for evaluating time consumed
by analysis routines (i.e., the cost of analysis) and peak memory usage.
The syntax of the tool is straightforward. Three parameters are manda-
tory: the name of trace file to analyze, the benchmark considered (it will ap-
pear in the log), and the name of output log file. Other parameters depend
on both the algorithm and the methodology in use. For instance, consider
lss-hcct-burst-metrics:
$ bin/lss-hcct-burst-metrics input_file.trace benchmark_name \
logs/output.log -phi <PHI> -epsilon <EPSILON> -tau <TAU> \
-sampling <SAMPLING_RATE> -burst <BURST_LENGTH>
PHI and EPSILON are defined as d1/φe and d1/εe, respectively. For instance,
a choice of parameters φ = 10−4 and ε = 2 ∗ 10−5 entails PHI=10000 and
EPSILON=50000. Similarly, TAU is defined as d1/τe and is used to evaluate the
hot edge coverage. SAMPLING RATE and BURST LENGTH are expressed in terms of
timer ticks: the former is defined as the number of timer ticks between to sam-
pling points, while the latter is defined as the duration of a burst expressed in
number of timer ticks. Default granularity for Pin tracer is 200 µs.
The syntax of the output log file is quite simple. We will briefly discuss
the format for the *-metrics tools. The first row is used as a header, and
subsequent executions of the trace analysis tools with the same output file would
not replicate it. Rows (one for each execution of the tool) are organized in
4XX is defined as 1/RR. For instance, adaptive-5 entails RR=20%.
65
APPENDIX . APPENDIX A: USING THE TOOLS
columns separated by spaces and can be parsed through graphical visualization
tools such as gnuplot or user-defined scripts. The first columns contain general
information about the benchmark (e.g., the length of the stream, the number of
nodes of the call graph and of the [H]CCT, and the number of hot/cold nodes
and false positives), then several accuracy metrics are reported (e.g., degree of
overlap, hot edge coverage, and max/avg error on counters). When bursting is in
use, related information (e.g., the number of bursts, the resolution of the timer,
and the sampled percentage of the whole stream) are reported as well.
The gcc-based profiler
The gcc-based profiler is composed of a set of dynamic libraries, one for each
profiling methodology as usual. In order to compile the source code it is sufficient
to execute the make command from inside the gcc profiler module directory.
The compiled libraries should be reachable by executables:
$ ls libs
libhcct_cct_burst.so libhcct_empty_burst.so libhcct_lss.so
libhcct_cct_malloc.so libhcct_empty.so libhcct_tracer.so
libhcct_cct.so libhcct_lss_burst.so
$ sudo -i cp libs/* /usr/lib/
(insert root password)
The user can choose a library as default one and link applications to it during
recompilation. By modifying a symbolic link, it is possible to change the profiling
library in use in a transparent way. For instance, we choose LSS with static
bursting as default profiling library:
$ sudo ln -s /usr/lib/libhcct_lss_burst.so /usr/lib/libhcct.so
Recompiling the application to be analyzed does not require modifications to
the source code. It is sufficient to edit the Makefile in order to specify certain
options for the compiler (CFLAGS variable for C source, CXXFLAGS for C++) and
the linker (LDFLAGS variable). In particular, the compiler has to be instructed to
66
APPENDIX . APPENDIX A: USING THE TOOLS
generate instrumentation calls for entry and exit to functions and also produce
debugging information5, while the linker should link the profiling library to the
executable and also wrap thread creation and termination functions6. Hence, the
required options are:
# Makefile
CFLAGS="-finstrument-functions -g"
LDFLAGS="-lhcct -lpthread -Wl,-wrap,pthread_create \
-Wl,-wrap,pthread_exit"
A collection of demo programs is available under the test folder. Actually,
the same demo program is compiled and linked to each of the profiling libraries
separately. Parameters for the streaming algorithms and the bursting mechanism
can be specified as environment variables:
$ DUMPPATH="/tmp" EPSILON=1000 PHI=100 SINTVL=200000 \
BLENGTH=20000 test/lss_burst
The meaning of the environment variables is straightforward. DUMPATH specifies
the path for storing the output file(s); values for EPSILON and PHI are defined as
for the trace analyzer; SINTVL and BLENGTH specify (in nanoseconds) sampling
interval and burst length, respectively.
For each thread a separate output file is produced, containing the tree con-
structed by the profiling tool. File names are chosen according to the syntax
<benchmark>-<PID/TID>.tree, and the tree is dumped with an in-order visit
to ease its reconstruction. A simple tool called analysis is provided under the
folder tools and computes simple statistics about the tree; this tool is a starting
point for the implementation of more complex analysis methods.
Routine addresses are stored in hexadecimal format in order to be resolved
using addr2line. Given an address and an executable, this tool uses the debug-
ging information in the executable to reconstruct which source file name and line
number are associated with the address. As an example, we executed the testing
program for the CCT construction and extracted a random routine address from
5Debugging information is needed to resolve addresses using the addr2line command.6The profiling library will invoke the Pthread library after an initialization phase.
67
APPENDIX . APPENDIX A: USING THE TOOLS
the output files; as the reader can see, information reported by addr2line is very
accurate:
$ addr2line -e test/cct 8048e00
/home/pctips/hcct_read_only/thesis/gcc_profiler/example.c:29
Finally, the libhcct tracer.so library provides a trace recorder which is fully
equivalent to the Pin tracer, hence the output files can be analyzed using the
trace analyzer as well.
Note: since this profiler is still experimental, many new features and improve-
ments will be implemented soon. Please check the project page http://code.
google.com/p/hcct/ for news and releases.
68
List of Figures
1.1 Call graph, Call tree, and Calling Context Tree . . . . . . . . . . 6
2.1 Skewness of calling contexts distribution . . . . . . . . . . . . . . 19
2.2 CCT, HCCT, and (φ, ε)-HCCT . . . . . . . . . . . . . . . . . . . 20
2.3 Tree data structures and calling contexts classification . . . . . . . 21
4.1 Relation between φ and degree of overlap . . . . . . . . . . . . . . 45
4.2 Hot edge coverage - relation between φ and τ . . . . . . . . . . . 46
4.3 Max and avg frequency of calling contexts not in (φ, ε)-HCCT . . 48
4.4 Partition of (φ, ε)-HCCT nodes into cold, hot, and false positives . 49
4.5 False positives in the (φ, ε)-HCCT as a function of ε . . . . . . . . 50
4.6 Accuracy of context frequencies computed by LSS and LC . . . . 51
4.7 Hot edge coverage analysis with bursting enabled . . . . . . . . . 52
4.8 Space analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.9 Cost of analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.10 Slowdown introduced with respect to a native execution . . . . . . 57
4.11 Slowdown introduced with respect to gprof . . . . . . . . . . . . . 58
69
List of Tables
2.1 Number of nodes of call graph, call tree, calling context tree, and
number of distinct call sites for different applications. . . . . . . . 17
4.1 Benchmarks selected from SPEC CPU2006. . . . . . . . . . . . . 41
4.2 Typical thresholds. . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Execution times (in seconds) for SPEC CPU2006 benchmarks. . . 59
70
Listings
2.1 prune(X,MCCT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1 find-min() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Implementation of find-min() with sentinel and second min idx . 27
3.3 Tree node structure for Lossy Counting . . . . . . . . . . . . . . . 29
3.4 Activation paradigm of STool . . . . . . . . . . . . . . . . . . . . 31
3.5 Function prototypes for gcc-based instrumentation . . . . . . . . . 34
3.6 Wrapping of pthread create and pthread exit . . . . . . . . . . 36
3.7 Activation paradigm of the gcc-based profiling infrastructure . . . 37
3.8 Syntax of output format for gcc-based online profiler . . . . . . . 38
71
Bibliography
[1] Glenn Ammons, Thomas Ball, and James R. Larus. Exploiting hardware
performance counters with flow and context sensitive profiling. SIGPLAN
Not., 32(5):85–96, 1997.
[2] Matthew Arnold and Barbara G. Ryder. A framework for reducing the cost
of instrumented code. In PLDI, pages 168–179, 2001.
[3] Matthew Arnold and Peter F. Sweeney. Approximating the calling context
tree via sampling. Technical Report RC 21789, IBM Research, 2000.
[4] Thomas Ball and James R. Larus. Efficient path profiling. In MICRO
29: Proceedings of the 29th annual ACM/IEEE international symposium on
Microarchitecture, pages 46–57, 1996.
[5] Thomas Ball, Peter Mataga, and Mooly Sagiv. Edge profiling versus path
profiling: the showdown. In POPL, pages 134–148. ACM, 1998.
[6] Andrew R. Bernat and Barton P. Miller. Incremental call-path profiling.
Technical report, University of Wisconsin, 2004.
[7] Michael D. Bond and Kathryn S. McKinley. Probabilistic calling context.
SIGPLAN Not. (proceedings of the 2007 OOPSLA conference), 42(10):97–
112, 2007.
[8] Derek Bruening. Efficient, Transparent, and Comprehensive Runtime Code
Manipulation. PhD thesis, M.I.T., 2004.
[9] Bryan Buck and Jeffrey K. Hollingsworth. An API for runtime code patch-
ing. International Journal of High Performance Computing Applications,
14(4):317–329, 2000.
72
BIBLIOGRAPHY
[10] Jinquan Dai, Jie Huang, Shengsheng Huang, Bo Huang, and Yan Liu. Hi-
tune: dataflow-based performance analysis for big data cloud. In Proceed-
ings of the 3rd USENIX conference on Hot topics in cloud computing, Hot-
Cloud’11, 2011.
[11] Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing
on large clusters. In Proceedings of the 6th conference on Symposium on
Opearting Systems Design & Implementation - Volume 6, 2004.
[12] Daniele Cono D’Elia, Camil Demetrescu, and Irene Finocchi. Mining hot
calling contexts in small space. In Proceedings of the 32nd ACM SIGPLAN
conference on Programming language design and implementation, PLDI ’11,
pages 516–527, 2011.
[13] Nathan Froyd, John M. Mellor-Crummey, and Robert J. Fowler. Low-
overhead call path profiling of unmodified, optimized code. In ICS, pages
81–90, 2005.
[14] Susan L. Graham, Peter B. Kessler, and Marshall K. McKusick. gprof: a
call graph execution profiler (with retrospective). In Best of PLDI, pages
49–57, 1982.
[15] Satish Chandra Gupta. Need for speed – eliminating performance bot-
tlenecks. http://www.ibm.com/developerworks/rational/library/05/
1004_gupta/. (accessed September 2, 2012).
[16] Robert J. Hall and Aaron J. Goldberg. Call path profiling of monotonic
program resources in unix. In USENIX Summer, pages 1–14, 1993.
[17] Martin Hirzel and Trishul Chilimbi. Bursty tracing: A framework for low-
overhead temporal profiling. In Proc. 4th ACM Workshop on Feedback-
Directed and Dynamic Optimization, 2001.
[18] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser,
Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood.
Pin: building customized program analysis tools with dynamic instrumen-
tation. In PLDI, pages 190–200, 2005.
73
BIBLIOGRAPHY
[19] Nishad Manerikar and Themis Palpanas. Frequent items in streaming data:
An experimental evaluation of the state-of-the-art. Data Knowl. Eng.,
68(4):415–430, 2009.
[20] Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts
over data streams. In VLDB, pages 346–357, 2002.
[21] David Melski and Thomas W. Reps. Interprocedural path profiling. In CC,
pages 47–62, 1999.
[22] Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. An integrated
efficient solution for computing frequent and top-k elements in data streams.
ACM Trans. Database Syst., 31(3):1095–1133, 2006.
[23] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations
and Trends in Theoretical Computer Science, 1(2), 2005.
[24] Nicholas Nethercote and Julian Seward. Valgrind: a framework for heavy-
weight dynamic binary instrumentation. In PLDI, pages 89–100, 2007.
[25] Seongbae Park. pthread get,setspecific vs thread local variable (Bits and
pieces). https://blogs.oracle.com/seongbae/entry/pthread_get_set_
specific_vs. (accessed September 2, 2012).
[26] Carl Ponder and Richard J. Fateman. Inaccuracies in program profilers.
Softw., Pract. Exper., 18(5):459–467, 1988.
[27] The GNU project. gcc - the GNU C Compiler. http://gcc.gnu.org. (ac-
cessed September 2, 2012).
[28] Subhajit Roy and Y. N. Srikant. Profiling k-iteration paths: A generalization
of the Ball-Larus profiling algorithm. In CGO, pages 70–80, 2009.
[29] Mauricio Serrano and Xiaotong Zhuang. Building approximate calling con-
text from partial call traces. In Proceedings of the 7th annual IEEE/ACM
International Symposium on Code Generation and Optimization, CGO ’09,
pages 221–230, 2009.
74
BIBLIOGRAPHY
[30] J. Michael Spivey. Fast, accurate call graph profiling. Softw., Pract. Exper.,
34(3):249–264, 2004.
[31] John Whaley. A portable sampling-based profiler for Java virtual machines.
In Proceedings of the ACM 2000 Conference on Java Grande, pages 78–87.
ACM Press, 2000.
[32] Xiaotong Zhuang, Mauricio J. Serrano, Harold W. Cain, and Jong-Deok
Choi. Accurate, efficient, and adaptive calling context profiling. In PLDI,
pages 263–271, 2006.
75