Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Operating System-Level
On-Chip Resource Management in
The Multicore Era
by
Xiao Zhang
Submitted in Partial Fulfillment
of the
Requirements for the Degree
Doctor of Philosophy
Supervised by
Professor Sandhya Dwarkadas
Department of Computer ScienceArts, Sciences, and Engineering
Edmund A. Hajim School of Engineering and Applied Sciences
University of RochesterRochester, New York
2010
ii
Curriculum Vitae
Xiao Zhang was born in Jishou, a beautiful county-level city in Hunan province
of the People’s Republic of China on September 2nd, 1982. In 2000, he entered
the University of Science and Technology of China and graduated in 2004 with a
Bachelor of Science degree in Computer Science. From 2005 to 2010, he attended
the University of Rochester where he pursued a Doctor of Philosophy in Computer
Science under the direction of Professor Sandhya Dwarkadas. He received the
Master of Science degree in Computer Science from the University of Rochester
in 2008. During the the summer of 2008 and 2009, he interned at VMware Inc,
performing collaborative research with Richard West, Puneet Zaroo, and Carl
Waldspurger.
iii
Acknowledgments
This dissertation could not have been possible without Dr. Sandhya
Dwarkadas who not only serves as my advisor but also motivates and challenges
me throughout my academic program at the University of Rochester. I am heartily
thankful for your encouragement, guidance, and support.
I am also grateful to Dr. Kai Shen who unreservedly offered helpful and
insightful suggestions and taught me how to tackle system problems. I greatly
appreciate your advice and guidance.
It is an honor for me to thank my committee members Chen Ding and Michael
Huang, my thesis defense chair Paul Ampadu, and the other faculty in the systems
group Michael Scott and Engin Ipek, for introducing me to systems research and
shaping my sense of research.
During my internship at VMware, I had the privilege of working with Carl
Waldspurger, Puneet Zaroo, Richard West, and Haoqiang Zheng. Their help and
encouragement built up my confidence during my stays at VMware.
I am also indebted to my friends and colleagues at the University of Rochester:
Arrvindh Shriraman, Tongxin Bai, Girts Folkmanis (now at Google), Rongrong
Zhong, Xiaoming Gu, and Qi Ge.
I would like to thank my parents, Ping Zhang and Lijuan Yang. They were
always supporting me and helped me make the right decision to come to the
Univeristy of Rochester.
iv
Lastly, but most importantly, I would like to thank my wife, Yang Gao. She
has always been there cheering me on and standing by me through the good and
bad times.
This material is based upon research supported by the National Science Foun-
dation (grants numbers: CNS-0411127, CAREER Award CCF-0448413, CNS-
0509270, CNS-0615045, CNS-0615139, CCF-0621472, CCF-0702505, ITR/IIS-
0312925, CCR-0306473, and CNS-0834451), the National Institutes of Health (5
R21 GM079259-02 and 1 R21 HG004648-01), IBM Faculty Partnership Awards,
and the University of Rochester. Any opinions, findings, and conclusions or rec-
ommendations expressed in this material are those of the author(s) and do not
necessarily reflect the views of the above named organizations.
v
Abstract
CPU manufactures are trending toward designs with multiple cores on a chip in
order to continue to scale with technology. One common feature of these multicore
chips is resource sharing among sibling cores that sit on the same chip, such as
shared last level cache and memory bandwidth. Without careful management,
such sharing could open a loophole in terms of performance, fairness, and security
concerns.
My dissertation addresses resource management issues on multicore chips at
the operating system level. Specifically, I introduce three techniques to control
resource usage and study a variety of resource management policies that consider
fairness, quality of service, performance, or power.
First, I propose a hot-page coloring approach that enforces cache partitioning
on only a small set of frequently accessed (or hot) pages to segregate most inter-
thread cache conflicts. Cache colors are allocated using miss ratio curves. The
cost of identifying hot pages online is reduced by leveraging knowledge of spatial
locality during a page table scan of access bits. Hotness-based page coloring
greatly alleviates the disadvantages of naive page coloring (memory allocation
constraint and recoloring overhead) in practice.
Second, I demonstrate that resource-aware scheduling on multicore-based SMP
platforms can mitigate resource contention. Resource-aware scheduling employs a
simple heuristic that can be easily derived from hardware performance counters.
vi
By grouping applications with similar memory access behaviors, resource con-
tention can be reduced and better overall system performance can be achieved.
Aside from the benefits of reduced hardware resource contention, it also provides
opportunities for CPU power savings and thermal reduction.
Finally, I show how to reuse existing hardware features to control resource
usage. I demonstrate it online a hardware execution throttling (e.g., volt-
age/frequency scaling, duty-cycle modulation, and cache prefetcher adjustment)
based framework to effectively control shared resource usage (regardless of resource
type) on multicore chips.
vii
Table of Contents
Curriculum Vitae ii
Acknowledgments iii
Abstract v
List of Tables x
List of Figures xi
Foreword 1
1 Motivation and Introduction 2
1.1 Multicore Resource Management Concerns . . . . . . . . . . . . . 2
1.2 Challenges to Addressing Multicore Resource Management . . . . 5
1.3 Dissertation Statement . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . 7
2 Background and Related Work 9
2.1 Hardware Performance Counters . . . . . . . . . . . . . . . . . . . 9
viii
2.2 Resource-aware Scheduling . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Power Management . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Cache Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Hardware Execution Throttling . . . . . . . . . . . . . . . . . . . 21
3 Toward Practical Page Coloring 23
3.1 Issues of Page Coloring in Practice . . . . . . . . . . . . . . . . . 24
3.2 Page Hotness Identification . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 Sequential Page Table Scan . . . . . . . . . . . . . . . . . 25
3.2.2 Acceleration for Non-Accessed Pages . . . . . . . . . . . . 29
3.3 Hot Page Coloring . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 MRC-Driven Partition Policy . . . . . . . . . . . . . . . . 32
3.3.2 Hotness-Driven Page Recoloring . . . . . . . . . . . . . . . 33
3.4 Relief of Memory Allocation Constraints . . . . . . . . . . . . . . 34
3.5 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 Related Work and Summary . . . . . . . . . . . . . . . . . . . . . 49
4 Resource-aware Scheduling on Multi-chip Multicore Machines 53
4.1 Resource Contention on Multi-chip Multicore Machines . . . . . . 54
4.1.1 Mitigating Memory Bandwidth Contention . . . . . . . . . 54
4.1.2 Efficient Cache Sharing . . . . . . . . . . . . . . . . . . . . 57
4.2 Additional Benefits on CPU Power Savings . . . . . . . . . . . . . 59
4.2.1 Constraint of DVFS on Multicore Chips . . . . . . . . . . 59
4.2.2 Model-Driven Frequency Setting . . . . . . . . . . . . . . . 60
4.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 Discussion and Summary . . . . . . . . . . . . . . . . . . . . . . . 70
ix
5 Hardware Execution Throttling 72
5.1 Comparisons of Existing Multicore Management Mechanisms . . . 72
5.1.1 Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1.2 Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Hardware Throttling Based Multicore Management . . . . . . . . 77
5.2.1 Throttling Mechanisms in Consideration . . . . . . . . . . 77
5.2.2 Resource Management Policies . . . . . . . . . . . . . . . . 78
5.2.3 A Simple Heuristic-Based Greedy Solution . . . . . . . . . 79
5.3 A Flexible Model-Driven Iterative Refinement Framework . . . . . 81
5.3.1 Performance Prediction Models . . . . . . . . . . . . . . . 82
5.3.2 Online Deployment Issues . . . . . . . . . . . . . . . . . . 86
5.4 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4.1 Offline Evaluation . . . . . . . . . . . . . . . . . . . . . . . 89
5.4.2 Online Evaluation . . . . . . . . . . . . . . . . . . . . . . . 96
5.5 Related Work and Summary . . . . . . . . . . . . . . . . . . . . . 100
6 A Unified Middleware 103
6.1 Design and Implementation . . . . . . . . . . . . . . . . . . . . . 103
6.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7 Conclusions and Future Directions 112
Bibliography 115
x
List of Tables
2.1 Brief description of four L1/L2 cache prefetchers on Intel Core 2
Duo processors [Intel Corporation, 2006]. . . . . . . . . . . . . . 22
3.1 Memory footprint sizes and numbers of excess page table entries
for 12 SPECCPU2000 benchmarks. The excess page table entries
are those that do not correspond to physically allocated pages. . . 28
4.1 Benchmark suites and scheduling partitions of 5 tests. Comple-
mentary mixing mingles high-low miss-ratio applications such that
two chips are equally pressured in memory bandwidth. Similarity
grouping separates high and low miss-ratio applications on different
chips (Chip-0 hosts high miss-ratio ones in these partitions). . . . 63
5.1 Summary of the comparison among methods. . . . . . . . . . . . 95
5.2 Average runtime overhead in milliseconds of calculating best duty
cycle configuration. Before each round of sampling, Exhaus-
tive searches and compares all possible configurations while Hill-
Climbing limits calculation to a small portion. . . . . . . . . . . . 97
xi
List of Figures
1.1 Performance comparison between cache sharing and partitioning.
We run three pairs of SPECCPU2000 benchmarks on a 3 GHz Intel
Woodcrest dual-core chip (two cores share a 4 MB L2 cache). Ideal
represents the application running alone and serves as a baseline
performance. Cache partitioning applies page coloring to partition
the 4 MB cache among two applications. Default cache sharing is
the hardware default cache sharing without any control. . . . . . . 4
2.1 An illustration of the page coloring technique. . . . . . . . . . . . 19
3.1 Unused bits of page table entry (PTE) for 4K page on 64-bit and
32-bit x86 platforms. Bits 11-9 are hardware defined unused bits for
both platforms [Intel Corporation, 2006; AMD Corporation, 2008].
Bits 62-48 on the 64-bit platform are reserved but not used by
hardware right now. Our current implementation utilizes 8 bits in
this range for maintaining the page hotness counter. . . . . . . . 27
xii
3.2 Illustration of a page non-access correlation as a function of the
spatial page distance. Results are for 12 SPECCPU2000 bench-
marks with 2-millisecond sampled access time windows. For each
distance value D, the non-access correlation is defined as the prob-
ability that the next D pages are not accessed in a time window if
the current page is not accessed. We take snapshots of each bench-
mark’s page table every 5 seconds and present average non-access
correlation results here. . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Illustration of sequential page table scan with locality jumping. . 31
3.4 An example of our cache partitioning policy between swim and
mcf. The cache miss ratio curve for each application is constructed
(offline or during an online learning phase) by measuring the miss
ratio at a wide range of possible cache partition sizes. Given the
estimation of application performance at each cache partitioning
point, we determine that the best partition point for the two ap-
plications is if 1 MB cache is allocated to swim and 3 MB cache to
mcf. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Procedure for hotness-based page recoloring. A key goal is that hot
pages are distributed to all assigned colors in a balanced way. . . 35
3.6 Overhead comparisons under different page hotness identification
methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.7 Proportion of skipped page table entries (PTEs) due to our locality-
jumping approach in page hotness identification. . . . . . . . . . . 39
3.8 Jeffrey divergence on identified page hotness between various ap-
proaches and the baseline (an approximation of “true page hotness”). 40
3.9 Rank error rate on identified page hotness between various ap-
proaches and the baseline (an approximation of “true page hotness”). 41
xiii
3.10 All-page comparison of page hotness identification results for sequential
table scan with locality-jumping approach (at once-per-100-millisecond
sampling frequency) and the baseline page hotness. Pages are sorted by
their baseline hotness. The hotness is normalized so that the hotness of
all pages in an application sum up to 1. . . . . . . . . . . . . . . . . 42
3.11 Normalized execution time of different victim applications under
different cache pollution schemes. The polluting application is
swim. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.12 Contention relations of two groups of SPECCPU2000 benchmarks.
If A points to B, that means B has more than 50% performance
degradation when running together with A on a shared cache, com-
pared to running alone when B can monopolize the whole cache. . 44
3.13 Performance comparisons under different cache management poli-
cies for 6 multi-programmed tests (four applications each) on a
dual-core platform. . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.14 Unfairness comparisons (the lower the better) under different cache
management policies for 6 multi-programmed tests (four applica-
tions each) on a dual-core platform. . . . . . . . . . . . . . . . . . 48
4.1 Cache miss-ratio (L2 cache misses per kilo data references )and
cache miss-rate (L2 misses per kilo instructions) of 12 SPEC-
CPU2000 benchmarks. In general, these two metrics show high
correlation. We label the first six benchmarks (mcf, swim, equake,
applu, wupwise, and mgrid) as high miss-ratio applications and
the later six ones (parser, bzip, gzip, mesa, twolf, and art) as low
miss-ratio applications. . . . . . . . . . . . . . . . . . . . . . . . . 56
xiv
4.2 Normalized miss ratios of 12 SPECCPU2000 benchmarks at differ-
ent cache sizes. The normalization base for each application is its
miss ratio at 512 KB cache space. Cache size allocation is enforced
using page coloring [Zhang et al., 2009b]. Solid lines mark the six
applications with the highest miss ratios while dotted lines mark
the six applications with the lowest miss ratios. Threshold of label-
ing high/low miss-ratio is based on their miss-ratio values shown in
Figure 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 The accuracy of our variable-frequency performance model. Fig-
ure (A) shows the measured normalized performance (to that of
running at the full CPU speed of 3 GHz). Figure (B) shows our
model’s prediction error (defined as prediction−measurementmeasurement
). . . . . . 62
4.4 Performance (higher is better) of the different scheduling policies
at full CPU speed. . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5 Performance comparisons of different scheduling policies when
Chip-0 is scaled to 2 GHz. In subfigure (A), the performance nor-
malization base is the default scheduling without frequency scaling
in all cases. In subfigure (B), the performance loss is calculated
relative to the same scheduling policy without frequency scaling in
each case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.6 Performance and power consumption for per-chip frequency scaling
under the similarity grouping schedule. Figure (B) only shows the
range of active power (from idle power at around 224 watts), which
is mostly consumed by the CPU and memory in our platform. . . 66
4.7 Power efficiency for per-chip frequency scaling under the similarity
grouping schedule. Figure (A) uses whole system power while (B)
uses active power in the efficiency calculation. . . . . . . . . . . . 67
xv
4.8 Performance and power consumption for baseline and fair per-chip
frequency scaling under the similarity grouping scheduling. . . . . 68
4.9 On-chip temperature changes in Celsius degree for the per-chip fre-
quency scaling under the similarity grouping scheduling. In each
case, we present a relative number beyond(+) or below(-) the tem-
perature measured under the default scheduling. . . . . . . . . . . 69
5.1 SPECJbb’s performance when its co-runner swim is regulated using
two different approaches: scheduling quantum adjustment (default
100-millisecond quantum) and hardware throttling. Each point
in the plot represents performance measured over a 50-millisecond
window. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 We co-schedule swim and SPECWeb on an Intel Woodcrest chip
where two sibling cores share a 4MB L2 cache. Here we compare
the effectiveness of different mechanisms in reducing unfairness. . 74
5.3 Accuracy comparison of our model and a naive method. Performance
prediction error is defined as |prediction−measurement|measurement . The average predic-
tion error of each application in each set is reported here. Solid lines
represent prediction by our model and dashed lines represent prediction
by a naive method. . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
xvi
5.4 Examples of our iterative model for some real tests. X-axis shows the
N -th sample. For the top half of the figure, the Y-axis is the L1 dis-
tance (or Manhattan distance) from the current sample to optimal (best
configuration as chosen by the Oracle). Configuration is represented as
a quad-tuple (u, v, w, z) with each dimension indicating the duty cycle
level of the corresponding core. For the bottom half of the figure, Y-axis
is the average performance prediction error of all considered points over
applications in the set. Here considered points are selected according to
the hill climbing algorithm in Section 5.3.2. . . . . . . . . . . . . . . 91
5.5 Comparison of methods with unfairness ≤ 0.10. In (a), the unfair-
ness target threshold is indicated by a solid horizontal line (lower
is good). In (b), performance is normalized to that of Oracle. In
(c), Oracle require zero samples. . . . . . . . . . . . . . . . . . . . 93
5.6 Comparison of methods for high-priority thread QoS ≥ 0.60. In (a),
the QoS target is indicated by a horizontal line (higher is good).
In (b), performance is normalized to that of Oracle. In (c), Oracle
require zero samples. . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.7 Online test results of 5 SPECCPU2000 sets. Default is the default
system running without any throttling. Only duty cycle modula-
tion is used by Model as the throttling mechanism. . . . . . . . . 96
5.8 Online unfairness test of four server applications on platform
“Woodcrest” and “Nehalem”. Default is the default system running
without any throttling. Model here only uses duty cycle modula-
tion as throttling mechanism. . . . . . . . . . . . . . . . . . . . . 98
xvii
5.9 Online QoS test of four server applications on “Woodcrest” and
“Nehalem”. (a) shows results of 4 different tests with each selecting
a different server application as the high-priority QoS one. Same
applies to (b). Default refers to the default system running without
any throttling. Model only uses duty cycle modulation as throttling
mechanism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.10 Online test of power efficiency (performance per watt). Default
is the default system running without any throttling. Model w.o.
DVFS only uses duty cycle modulation as throttling mechanism.
Model w. DVFS combines two throttling mechanisms (duty cycle
modulation and dynamic voltage/frequency scaling). . . . . . . . 100
6.1 Comparison results of experiment where CPUs are not over-
committed (number of concurrently running applications equals
number of cores). . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2 Sensitivity tests with varing sampling interval (10 milliseconds, 100
milliseconds, and 1 second) and restart frequency (5, 10, 20, and
30 samples). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3 Comparison results of experiment where CPUs are over-committed
(number of concurrently running applications is larger than number
of cores). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
1
Foreword
I am very fortunate and honored to collaborate with professors and students
at the University of Rochester. Chapter 3 is based on work published at Eu-
roSys’09 [Zhang et al., 2009b], in collaboration with Sandhya Dwarkadas and
Kai Shen. I initiated and implemented the hotness-based page coloring project.
Chapter 4 is based on work published at USENIX ATC’10 [Zhang et al., 2010b],
in collaboration with Kai Shen, Sandhya Dwarkadas, and Rongrong Zhong. I
revealed the opportunity and challenge of voltage/frequency scaling on existing
multichip multicore machines and came up with the idea of similarity grouping.
Rongrong Zhong helped set up the MySQL benchmark for this project. Chapter 5
is based on work published at USENIX ATC’09 [Zhang et al., 2009a], in collabora-
tion with Sandhya Dwarkadas and Kai Shen; and work under submission [Zhang
et al., 2010c], in collaboration with Rongrong Zhong, Sandhya Dwarkadas, and
Kai Shen. I described a new hardware throttling mechanism for multicore resource
management and developed an iterative refinement framework to automatically
configure its settings. Rongrong Zhong contributed to the core performance pre-
diction model and proposed the hill-climbing search algorithm in this project. I
was the principal developer for this project. Needless to say, Professor Dwarkadas
and Professor Shen provided valuable suggestions and guidance for all projects. I
could not accomplish these projects without their tremendous support.
2
1 Motivation and Introduction
Multicore chips, for instance, Intel’s Nehalem, AMD’s Opteron, IBM’s Cell,
NVIDIA’s GPGPU, and ARM’s Cortex-A9, are dominant on today’s market.
These vendors largely cover server, PC, home entertainment, and mobile device
markets. One of the common features of the multicore architecture is that all
cores on a single chip share some cache (usually the last level cache) and off-chip
memory bandwidth. Such sharing presents new challenges due to the uncontrolled
resource competition from simultaneously executing processes. However, today’s
operating systems manage multicore processors in a time-shared manner similar to
traditional single-core uniprocessor systems and are oblivious to on-chip resource
contention. Some attention is paid to cache locality among the multiple cores by
a hierarchical load balancing, which preferentially migrates processes to sibling
cores. The additional challenges due to the subtle interactions of simultaneously
executing processes sharing on-chip resources have not been addressed in main-
stream operating systems, largely due to the complex nature of the interactions.
1.1 Multicore Resource Management Concerns
The major issue with multicore resource management is uncontrolled resource
contention. For example, processes that are simultaneously accessing the shared
3
cache can conflict with each other and result in skewed performance. The perfor-
mance of a process that would normally have been high due to the cache being
large enough to fit its working set could be severely impacted by a simultane-
ously executing process with aggressive and massive cache demand, resulting in
the first process’s cache lines being evicted by the second process. Figure 1.1
shows examples of pair-wise running a set of SPECCPU2000 benchmarks on an
Intel Woodcrest dual-core chip with two cores sharing a 4 MB L2 cache. Here Ideal
means without resource contention (i.e., application runs alone). Cache partition-
ing applies page coloring to partition the shared cache among two applications 1.
Default cache sharing is the hardware default cache sharing without any control.
From this figure we can see that careful cache space management like cache parti-
tioning can achieve significant overall performance and fairness improvement over
default cache sharing.
The contention resulting from uncontrolled resource utilization raises the con-
cern of performance isolation on multicore chips. On one hand, performance can
fluctuate and is hard to predict. In Figure 1.1, for example, swim has a relative
performance (normalized to ideal) of about 0.9 when run together with twolf, and
its performance drops to around 0.7 when it is co-scheduled with equake. On
the other hand, fairness is not well maintained since aggressive threads tend to
occupy more resources and therefore make more progress, while victim threads ex-
hibit poor performance even given equal amount of CPU time. Figure 1.1 shows
that art achieves a relative performance (normalized to ideal) of 0.3 while its
co-runner swim can sustain performance above 0.7.
Uncontrolled resource usage also triggers possible security loopholes. A mali-
cious thread can take advantage of this loophole to launch a denial of service (DoS)
attack at the chip level [Moscibroda and Mutlu, 2007] and make a service hosted in
a cloud computing facility (e.g., Amazon [Amazon] and GoGRID [GoGrid, 2008])
1Details on how we actually partition the cache can be found in Chapter 3.
4
swim art
0.2
0.4
0.6
0.8
1
Applications
No
rma
lize
d p
erf
orm
an
ce
swim equake
0.2
0.4
0.6
0.8
1
ApplicationsN
orm
aliz
ed
pe
rfo
rma
nce
swim twolf
0.2
0.4
0.6
0.8
1
Applications
No
rma
lize
d p
erf
orm
an
ce
Ideal Cache Partitioning Default Cache Sharing
Figure 1.1: Performance comparison between cache sharing and partitioning. We
run three pairs of SPECCPU2000 benchmarks on a 3 GHz Intel Woodcrest dual-
core chip (two cores share a 4 MB L2 cache). Ideal represents the application
running alone and serves as a baseline performance. Cache partitioning applies
page coloring to partition the 4 MB cache among two applications. Default cache
sharing is the hardware default cache sharing without any control.
totally inaccessible. In addition to DoS attacks, multicore chips are also prone to
information leakage. Malicious hackers can infer other applications’ cache miss
patterns on a shared cache and hence their execution behaviors. Previous work
[Percival, 2005; Zhang et al., 2007] shows that it is possible to steal the private
RSA key in OpenSSL [OpenSSL, 2007] by a sibling thread/core eavesdropping
RSA encryption/decryption execution patterns2.
2Security is imperative to multicore resource management but this dissertation does not
explore security implications directly.
5
1.2 Challenges to Addressing Multicore Re-
source Management
The first challenge is that commodity operating systems such as Linux lack ca-
pabilities to learn applications’ chip level resource consumption and competition.
Unlike other system resources such as memory and disk, operating systems basi-
cally treat processors as black boxes and have no knowledge of how chip resources
are allocated among competing threads. For example, commodity operating sys-
tems can not determine how much cache space a running thread actually occupies
due to lack of low-level hardware resource accounting.
The second challenge is limited available mechanisms for operating systems to
enforce a thread’s chip level resource allocation/usage. The state of the art ex-
isting mechanism to partition shared cache space is page coloring. This technique
itself exerts adverse effects in practice: expensive overhead during re-partitioning
and memory allocation constraints. Another studied mechanism is to adjust a
thread’s CPU time-slice to compensate or penalize threads for under-utilization
or over-utilization of shared resources. Modern operating systems schedule threads
in a round robin fashion: a CPU runs a thread for a time-slice defined by its pri-
ority and then performs a context switch to run the next available thread. By
modifying its time-slice, operating systems can effectively control threads’ resource
usage. However, this mechanism complicates CPU scheduling and works at coarse
granularity since a typical time-slice is tens to hundreds of milliseconds.
The last but not least challenge is the absence of appropriate management
policies for a selected management mechanism. A good policy should address
practical concerns (e.g., fairness) and be easy to adopt. Since multicore resource
contention is a complicated issue, a well-designed policy should be flexible to a set
of conditions (e.g., varying architecture parameters) and different management
objectives (e.g., performance vs. power).
6
1.3 Dissertation Statement
This dissertation addresses multicore resource management with a focus on fair-
ness, performance, and power. We present three novel system-level approaches to
tackle this problem: resource-aware scheduling, hotness-based page coloring, and
hardware execution throttling. We demonstrate that our approaches achieve bet-
ter or competitive performance over the default system and provide capabilities
to satisfy a variety of other management objectives such as fairness, quality of
service (QoS), and power savings.
1.4 Contributions
The approaches described in this dissertation utilize a series of system-level tools
and mechanisms, such as performance counters, page coloring, duty cycle modula-
tion, and frequency/voltage scaling, to address resource management on multicore
chips.
• We devise and implement an efficient way to track memory page access
frequency (i.e., page hotness). The cost of identifying hot pages online is
reduced by leveraging knowledge of spatial locality during a page table scan
of access bits. Based on this, we propose hot-page-based page coloring, which
enforces coloring on only a small set of frequently accessed (or hot) pages for
each process. Guided by a miss-ratio-curve driven partitioning policy, hot-
page-based selective coloring can significantly alleviate the coloring-induced
adverse effects in practice and considerably improve performance over naive
page coloring.
• We present a simple yet efficient resource-aware scheduling on multicore-
based symmetric multiprocessors. The scheduling policy considers both
7
memory bandwidth congestion and cache space interference, and has ad-
ditional benefits in the ability to engage chip-wide CPU power savings.
• We advocate hardware execution throttling as an effective tool to support
fair use of shared resources on multicore chips. We also propose a flexi-
ble framework to automatically find a proper hardware execution throttling
configuration for a user-specified objective. A variety of resource manage-
ment objectives, such as fairness, QoS, performance, and power efficiency
can be targeted. The essence of our framework is an iterative prediction
refinement procedure and a customizable model that currently incorporates
both duty cycle modulation and voltage/frequency scaling effects. Our ex-
perimental results show that our approach quickly arrives at the exact or
close to optimal configuration.
1.5 Dissertation Organization
Chapter 2 discusses background and related work, including hardware perfor-
mance counters, CPU scheduling, hardware cache partitioning, page coloring,
power management, and hardware execution throttling.
Chapter 3 elaborates our contribution of making page coloring more prac-
tical [Zhang et al., 2009b] in general systems. Page coloring is the only pure
software solution to partition a cache without any hardware support. However,
traditional page coloring places additional constraints on memory space alloca-
tion and incurs substantial overhead for page recoloring. We propose a hot-page
coloring approach enforcing coloring on only a small set of frequently accessed
(or hot) pages to segregate most inter-thread cache conflicts. We also designed
an efficient online hot-page-identifying implementation by leveraging knowledge
of spatial locality during a page table scan of access bits. Our results demonstrate
8
that hot page identification and selective coloring can significantly alleviate the
coloring-induced adverse effects in practice.
Chapter 4 draws attention to resource-aware scheduling on multicore-based
SMP platforms. Specifically, our scheduling policy (similarity grouping) groups
applications with different cache miss ratios on different chips. On one hand, it
avoids memory bandwidth over-saturation since no memory intensive applications
will run concurrently on all chips. On the other hand, it helps separate low miss
ratio applications that may be more sensitive to cache pressure from high miss
ratio applications that will aggressively occupy the cache space but with less
benefits. Such scheduling also creates the opportunity for non-uniform per-chip
voltage/frequency settings.
In Chapter 5, we describe hardware execution throttling that can effectively
control both cache space and memory bandwidth resource usage. By throttling
down the execution speed of some of the cores, we can control an application’s
relative resource utilization to achieve desired management objectives. In addi-
tion, we introduce a model-based iterative refinement framework to automatically
and quickly determine an optimal (or close to optimal) hardware execution throt-
tling configuration for a given user-specified optimization target. The capability
of fast-searching makes such an approach particularly useful on platforms with
hundreds or thousands of possible configurations.
The three multicore resource management solutions described above are or-
thogonal yet complementary to each other. Chapter 6 will show a unified pro-
totype middleware combining both similarity grouping scheduling and hardware
execution throttling. We will conclude and discuss future research directions in
Chapter 7.
9
2 Background and Related
Work
In this Chapter, we provide some necessary background on system techniques
described in this dissertation and discuss related work in those areas.
2.1 Hardware Performance Counters
Hardware performance counters are a set of registers sitting on chip and they can
be programmed to count various hardware events. These counters increase mono-
tonically and can be initialized with arbitrary starting value. Counter overflow
can be captured by hardware triggered interrupts but it seldom happens since
counter bit length is sufficient (usually between 40 and 64 bits).
Architected performance counters were introduced on modern processors in
the early 1990s, and have since provided a rich source of architectural statistical
information about program execution characteristics. Nowadays major processor
vendors such as Intel, IBM, AMD, and Sun are all equipped with performance
counters although the number of counters may vary. For example, Intel Pentium
4 processor with hyper-threading has 18 general purpose counters shared by two
sibling hardware threads [Intel Corporation, 2006]. Sun UltraSPARC series has
2 performance counters in each virtual processor [Sun Microsystems, Inc, 2005].
10
IBM PowerPC 64-bit processors usually contain of 6 to 8 counters depending on
different models [Oprofile].
Configuring performance counters only requires writing platform-specific regis-
ters, which typically takes about hundreds of cycles. This extremely low overhead
property makes it broadly used in systems research for a variety of purposes. Early
utilization of performance counters was mainly focused on workload profiling, de-
bugging, and modeling. Sweeney et al. [Sweeney et al., 2004] utilized performance
counters to monitor program behavior. On a multiprocessor platform, they mod-
ified Jikes Java research virtual machine (RVM) to correctly attribute counter
values to each Java thread in multithreaded applications. Relying on the traced
counter statistics, they filtered out hardware events with low correlation to per-
formance (they used instructions per cycle as their performance metric) and made
some interesting observations on pseudojbb (a variant of SPECJbb2000) bench-
mark. One interesting “anomaly” they found was that application’s performance
improves automatically over time in Jikes. The reason for that was Jikes RVM had
an adaptive optimization system (AOS) which behaved conservatively at the be-
ginning of application execution. During execution, it gradually learned to choose
more advanced optimization levels for certain code segments based on the runtime
feedback. Luo et al. [Luo and John, 2001] and Seshadri et al. [Seshadri and Meri-
cas, 2001] also conducted research on performance issues of server applications by
leveraging performance counters. Luo’s work was focused on scalability of Java
applications such as SPECJbb2000 and VolanoMark. Their finding indicated that
with increasing number of threads, applications could exhibit better instruction
locality while the resource stalls also increase and eventually dwarfed the bene-
fits from instruction locality. Seshadri’s study suggests instruction cache and L2
cache are two primary hotspots highly relevant to application’s performance on
PowerPC Processors. Eeckhout et al. [Eeckhout et al., 2002] used a time series of
counter statistics to compare the mutual behavioral differences among different
11
program inputs and help to select representative input data sets. In the work
of Sherwood [Sherwood et al., 2003], Balasubramonian [Balasubramonian et al.,
2000], and Shen [Shen et al., 2004], performance counters were used to determine
program phases. The rationale was that program phases were the execution dura-
tion over which the behavior remained more or less stable, and phase transitions
could be detected using changes in hardware event counts.
Performance counters have also been widely used in power and thermal man-
agement. Bellosa et al. [Bellosa, 2000; Bellosa et al., 2003] first proposed pro-
cessor counter-based power consumption modeling, namely event-driven energy
accounting. They pre-calculated/calibrated energy consumption base units for
a variety of hardware events such as cache references, cache misses, and branch
instructions, and converted each observed event into the corresponding energy
consumption. Such event-driven energy accounting method made it possible to
accurately predict processor power consumption and greatly facilitated operat-
ing systems’ support for fine-grained power management. Later on, Heath et
al. [Heath et al., 2006] incorporated this counter-based energy accounting in their
Mercury project to manage thermal emergencies in server clusters. The basic idea
was that when estimated servers’ temperatures went beyond a red-flag threshold,
a load adjustment would take place to mitigate this thermal emergency. Some
other studies [Weissel and Bellosa, 2002; Isci et al., 2006; Kotla et al., 2004] used
performance counters as guidance to tune voltage/frequency scaling for power
savings. We will discuss them in Section 2.3.
Most counter-based work was evaluated on single thread/process or multipro-
grammed workloads. When a single server application (consisting of many concur-
rent requests) runs on a machine, it is beneficial to analyze application behavior
at request granularity. A server request usually goes through multiple components
during its execution. For example, it may first be handled by a front-end server
layer at the beginning, then handed to a decision-making layer, and eventually
12
triggers an update in a back-end database. Shen et al. [Shen et al., 2008] proposed
a mechanism to intercept the layer (or component) transition point and propagate
request context properly to attribute counter statistics to individual requests. Un-
like Magpie [Barham et al., 2004], which is only capable of analyzing per-request
behavior off-line, on-the-fly request characterization can greatly facilitate online
system adaptations(e.g., admission control on different types of requests).
There were efforts like PAPI [Browne et al., 2000], perfMon2 [Eranian, 2006],
and perfctr [Pettersson, 2009b] trying to standardize the API of performance coun-
ters across different platforms. Other investigations aimed to provide support for
performance counter monitoring at a large scale. For example, Azimi et al. [Azimi
et al., 2005] proposed to time multiplex hardware counters to simultaneously cover
more events and linearly scale up partially sampled counter values to mimic the
final results of no counter sharing/multiplexing. Wisniewski et al. [Wisniewski
and Rosenburg, 2003] implemented an infrastructure to log events in per-CPU
buffers to augment events storage/trace. Blue Gene [Salapura et al., 2008] was
designed to provide concurrent access to a large number of counters.
Lastly, there has also been a group of proposals on enriching existing hardware
counters. El-Moursy et al. [El-Moursy et al., 2006] suggested new counters (the
number of ready instructions and the number of in-flight instructions) to help
derive metrics more correlated to hardware utilization than instructions per cycle.
Settle et al. [Settle et al., 2004] proposed new counters to collect cache references
and misses at cache set granularity. These new counters could be used to estimate
the usage of cache sets and guide the scheduler to co-execute threads that have
less conflicts. Zhao et al. [Zhao et al., 2007] investigated tagging the cache at
block granularity to provide more fine-grained information on cache sharing and
contention.
13
2.2 Resource-aware Scheduling
Multiprocessor systems such as simultaneous multithreading (SMT), chip multi-
processing (CMP, or more often referred to as a multicore processor), and sym-
metric multiprocessing (SMP) are commonplace nowadays. Commodity oper-
ating systems like Linux kernel [Linux Open Source Community, 2010] mainly
deal with two problems on multiprocessor scheduling: load-balancing and cache-
affinity. Load-balancing attempts to assign each processor a roughly equal amount
of work. If the workload is unbalanced, the scheduler migrates some tasks from
the heavily burdened processor to other less loaded processors to re-balance them.
However, task migration has its associated costs: when a task migrates to a re-
mote processor, it can no longer take advantage of a warmed up cache. Newer
versions of the Linux kernel scheduler mitigate such cache-affinity issues by pref-
erentially migrating a task within a processor domain, in which the source and
target processors share some levels of cache. This is achieved by a hierarchi-
cal load balancing starting from a basic scheduling domain. For example, for a
multicore-based SMP platform, all sibling cores on a chip form the basic domain
and all chips assemble a higher domain. Load balancing starts within each basic
domain and then moves to the higher domain. By doing so, scheduling first tries
to eliminate load imbalance by moving tasks within a chip. If further imbalance
still exists, it will perform inter-chip task migration.
Resource sharing further complicates the OS scheduler, mainly due to extensive
contention for shared resources. A number of studies explored resource-aware
CPU scheduling to improve system performance and fairness. Most work along
this direction is trying to find simple yet effective heuristics to guide workload
co-scheduling that mitigates resource contention.
Parekh et al. [Parekh et al., 2000] and Snavely et al. [Snavely and Tullsen,
2000] first studied scheduling on SMT processors. Parekh found that the best
14
overall system instruction throughput happened when they co-scheduled threads
with highest instruction rates (instructions per cycle, or IPC) together. Their ex-
planation for that was, in the case of shared instruction queue on SMT processors,
low-IPC threads tended to hold buffers longer and might slow down the instruc-
tion flow of other high-IPC threads. Snavely used the term ”symbiosis” to refer
to co-scheduling of threads that share resources in a harmonious fashion. Their
symbiotic scheduler had to permute threads periodically for some time (so called
sampling phase). After the sampling, the scheduler would pick a best co-schedule
permutation according to certain metrics (whole system’s instruction throughput,
cache hit rate etc.) measured during the sampling phase. Their work confirmed
Parekh’s IPC-based heuristic on SMT scheduling. In contrast to the previous
similar IPC grouping heuristic, Fedorova et al. [Fedorova et al., 2004] suggested
co-scheduling a low-IPC thread together with a high-IPC thread. They argued
that low-IPC threads usually had low pipeline resource requirements due to ex-
tensive memory accesses and long-latency instructions and thus were more likely
to leave function units idle. Threads with high IPCs had high pipeline resource
requirements as they spent much less time stalled.
SMT processors implement resource sharing to an extreme end, with almost
all resources like pipeline, function units, and all levels of cache being shared
among sibling hardware threads. As a contrast, SMP processors typically only
share off-chip memory bandwidth1. Since there is only one bottleneck shared
resource, resource management is relatively straightforward. Antonopoulos et
al. [Antonopoulos et al., 2003] and Zhang et al. [Zhang et al., 2007] advocated
bandwidth-aware scheduling to mitigate memory bus congestion on SMP plat-
forms. The idea was to co-schedule memory-intensive and non-memory-intensive
applications on different chips and avoid memory bus being either underutilized or
1Of course, a SMP processor itself could implement SMT, but we do not attribute SMT-
sharing to SMP-sharing.
15
over-saturated. Such guidance not only eliminated severe bottleneck resource con-
tention but also made efficient use of available bandwidth resource. Antonopou-
los’s work was based on the assumption of a constant peak bandwidth limit and
use that to guide co-scheduling of jobs whose total bandwidth will not exceed the
saturation limit. Zhang’s work measured applications’ memory bandwidth usage
at runtime.
On multicore processors, last level cache and memory bus are typically shared
by sibling cores. Chandra et al. [Chandra et al., 2005] and Zhuravlev et al. [Zhu-
ravlev et al., 2010] proposed predicting inter-thread cache space contention based
on applications’ reuse distance profiles. A reuse distance profile was a histogram
with individual buckets corresponding to different reuse distances in a LRU-like
stack. Given reuse distance profiles of multiple threads, Chandra’s stack distance
competition model would merge them into a single profile and simulate how they
would compete for cache space. Zhuravlev’s Pain model introduced two concepts:
cache sensitivity and cache intensity. Sensitivity indicates how many cache hits
from a thread running alone could turn into cache misses when multiple threads
are running concurrently. To simplify the computation burden, they assumed that
a cache line with position i in a stack had probability of 1i
to be evicted by the
next distinct data access. Intuitively speaking, a cache line at position 1 means
it is least recently used and is very likely to be replaced. The intensity indicates
how aggressively an application occupies cache and it is measured by application’s
cache misses per instruction. They defined the performance penalty (or pain in
their term) as the product of one’s sensitivity and its co-runner’s intensity. The
absolute value of such metric was meaningless, but the relative order of multiple
pain values could be used to predict which co-schedule was better. Besides the
computation overhead, their model inputs — reuse distance profiles, were also
very expensive to obtain.
Instead of using reuse distance profiles, Merkel et al. [Merkel and Bellosa,
16
2008a] and Zhuravlev et al. [Zhuravlev et al., 2010] suggested using miss rate
(misses per instruction) as a simple heuristic to guide co-scheduling on multicore
processors. Specifically, they suggested co-scheduling a high miss rate thread with
a low miss rate thread within a multicore processor.
Fedorova et al. [Fedorova et al., 2007] adjusted thread’s CPU time-slice as
a way to control resource sharing. The policy would increase the time-slice of
threads with under-fair cache usage, and shorten the time-slice of threads with
over-fair cache usage. Guan et al. [Guan et al., 2009b,a] did some theoretical
analysis on the schedulability of deadline-driven real-time applications on mul-
tiprocessors. Jiang et al. [Jiang et al., 2008] proved that optimal co-scheduling
on a multicore is an NP-complete problem when the number of cores is larger
than 2 and provided a divide-and-conquer approximation algorithm that tries to
solve this problem in polynomial time. Ghoting et al. and Zhang et al. [Ghoting
et al., 2007; Zhang et al., 2010a] observed the need to match the development and
compilation of multithreaded applications to the underlying platform in order to
exploit the shared cache between cores.
2.3 Power Management
Power and energy consumption are prominent resource concerns in large data
centers. Bianchini and Rajamony [Bianchini and Rajamony, 2004] presented a
good survey of research efforts on power management strategies.
Usually power management employs hardware mechanisms such as volt-
age/frequency scaling and sleeping to transition machine from high to low power
modes for power savings. There are two directions of power management in re-
search community. The first is power/energy management of large scale systems
such as data centers or server clusters. A typical server consumes a quite consider-
able amount of power (e.g., hundreds of watts) even when the system is idling. In
17
large data centers, server machines are over-provisioned for peak workload and for
most of the time they are idling or underutilized. Pinheiro et al. [Pinheiro et al.,
2001] and Chase et al. [Chase et al., 2001] suggested workload concentration on
a few machines when systems were off peak time and to keep other idle machines
in low power modes or even shut them down. Elnozahy et al. [Elnozahy et al.,
2003] further introduced a request batching technique that could accumulate in-
coming requests in memory while CPUs were kept in a low-power state during
periods of sporadic workload. Weissel et al. [Weissel and Bellosa, 2004] and Wang
et al. [Wang et al., 2005] advocated throttling processors to keep systems within
a certain power/thermal budget envelope.
The other direction is to optimize active power on relatively small scale ma-
chines. Many researchers targeted on CPU since it has a wide range of active power
consumption. Specifically, they used dynamic voltage/frequency scaling (DVFS)
to control CPU power consumption. DVFS is a hardware mechanism on modern
processors that trades processing speed for power savings. Typically, each CPU
frequency level is paired with a minimum operating voltage so that a frequency
reduction lowers both power and energy consumption. Frequency scaling-based
CPU power/energy optimization has been studied for over a decade. Weiser et
al. [Weiser et al., 1994] first proposed adjusting the CPU speed according to its
utilization. Pillai and Shin [Pillai and Shin, 2001] applied DVFS to deadline-
driven embedded operating systems. The basic principle was that when CPU
was not fully utilized, the processing capability could be lowered to improve the
power efficiency. When the CPU was already fully utilized, DVFS might still be
applied without hurting much performance, especially for memory intensive ap-
plications. The rationale was that memory-bound applications did not have suffi-
cient instruction-level parallelism to keep the CPU busy while waiting for memory
accesses to complete, and therefore decreasing their CPU frequency would not re-
sult in a significant performance penalty. Some other previous studies focused on
18
modeling the DVFS effects on performance. A couple of studies [Weissel and Bel-
losa, 2002; Isci et al., 2006] utilized offline constructed frequency selection lookup
tables. Such an approach required a large amount of offline profiling. Merkel
and Bellosa employed a linear model based on memory bus utilization [Merkel
and Bellosa, 2008a] but it could only support a single frequency adjustment level.
Kotla et al. [Kotla et al., 2004] constructed a performance model for variable CPU
frequency levels. Specifically, they assumed that all cache and memory stalls were
not affected by the CPU frequency scaling while other delays were scaled in a lin-
ear fashion. Their model was not evaluated on real frequency scaling platforms.
Barroso and Holzle [Barroso and Hlzle, 2007] advocated that hardware design
would trend towards energy-proportional computing. That is, hardware power
consumption would be proportional to its computing workload on future comput-
ing platforms. They showed that right now CPU is the most energy-proportional
component and urged other storage (memory and disk) manufacturers to catch
up.
2.4 Cache Partitioning
On multicore processors, the cache size is designed to be large enough (e.g., Intel
Xeon-5160 CPU has 4 MB L2 cache, Nehalem processor has 8 MB L3 cache) to
accommodate multiple concurrently executing threads. The trend toward larger
and larger caches strongly motivates research on how to allocate/partition cache
space among multiple competing threads.
Most hardware-based cache partitioning schemes require modifying the cache
block replacement policy. Usually it tags each cache block with a thread ID and
replaces blocks according to threads’ shares rather than least recently used (LRU)
principle. Assuming that such block replacement mechanism was available, Suh
et al. [Suh et al., 2001a] proposed an analytical cache model to estimate the miss
19
rate of applications for any cache size at a given time quantum. They demonstrate
that estimated utility functions could be applied to cache partitioning to achieve
better system instruction throughput. A coarser granularity scheme is column
caching/partitioning [Chiou et al., 2000]. Basically column caching treats each
way in a n-way associative cache as a column and cache block replacement is
restricted within columns. Therefore, it partitions the cache at way-granularity.
Figure 2.1: An illustration of the page coloring technique.
Systems without special hardware support can also partition the cache in a
pure software way by page coloring technique. The basic idea of page coloring is to
control the mapping of physical memory pages to a processor’s cache blocks since
the last level cache is typically indexed by physical address. Memory pages that
are mapped to the same cache blocks are assigned the same color (as illustrated
by Figure 2.1). By controlling the color of pages assigned to an application,
operating systems can manipulate cache blocks at page granularity (more strictly
speaking, the granularity is the product of page size and cache associativity). This
granularity is the unit of cache space that can be allocated to an application.
20
Page coloring was first implemented on MIPS operating system in 1980s [Tay-
lor et al., 1990]. The problem at that time was the unstable performance due to
random virtual to physical page mapping. The solution was that engineers created
page coloring to enforce a constant offset in page mappings. Kessler et al. [Kessler
and Hill, 1992] made a good survey on several static page mapping/placement poli-
cies. They defined page coloring and bin hopping as two different techniques: page
coloring maps pages close in virtual address (spatial locality) to different cache
blocks while bin hopping maps pages close in access time (temporal locality) to
different cache blocks. Today page coloring has been generalized to include bin
hopping. Bershad et al. and Romer et al. [Bershad et al., 1994; Romer et al., 1994]
examined dynamic page replacement in hardware and software respectively. Ber-
shad et al. proposed a novel hardware component: cache miss lookaside (CML)
buffer. Upon a cache miss to a physical page, operating systems looked up the
CML and incremented the miss counter of the corresponding entry. If a page had
lots of misses, it was better to remap/recolor it to different cache blocks. Romer’s
work relied on software and existing hardware (TLB, cache miss counter) to detect
conflicts: when the miss counter for the whole cache reached a certain threshold,
the operating systems would take a snapshot of the TLB and recolor one from the
set of pages that appear to have most conflicts. Bugnion et al. [Bugnion et al.,
1996] utilized the hints generated at compilation time to guide page allocation.
Sherwood et al. [Sherwood et al., 1999] summarized the previous work and also
proposed his own software and hardware page placements. His software method
was based on profiling: given page reference sequences, a greedy algorithm was
used to calculate good page colors so that conflicts are minimal. His hardware
method was similar to the CML buffer in Bershad’s work but with a modified
hardware TLB that did not need to copy memory pages in recoloring.
A few recent studies introduced the use of page coloring to control multicore
cache partitioning in the operating system [Tam et al., 2007a; Lin et al., 2008].
21
Guided by information on application data access pattern (such as the miss ratio
curve or stall rate curve), page coloring has the potential to reduce inter-thread
cache conflicts and improve fairness. In Tam’s work [Tam et al., 2007a], the parti-
tion point was fixed and there was no dynamic repartitioning/recoloring involved.
Lin et al. [Lin et al., 2008] extended that by implementing dynamic page coloring.
2.5 Hardware Execution Throttling
Recent studies [Herdrich et al., 2009; Zhang et al., 2009a] advocate using exist-
ing hardware throttling mechanisms for multicore resource management. Specifi-
cally, there are three available mechanisms on Intel x86 platforms: dynamic volt-
age/frequency scaling (DVFS), duty cycle modulation, and hardware prefetching.
We have discussed DVFS in Section 2.3 and will focus on duty cycle modulation
and hardware prefetching in the following paragraphs.
Duty cycle modulation [Intel Corporation, 2006] is a hardware feature intro-
duced by Intel. It allows the operating systems to specify a portion (e.g., multiplier
of 1/8) of regular CPU cycles as duty cycles by writing to the logical processor’s
IA32 CLOCK MODULATION register. The processor will be effectively halted
during non-duty cycles for a duration of ∼3 microseconds [Intel Corporation,
2009a]. Different duty cycle ratios are achieved by keeping the time for which the
processor is halted at a constant duration of ∼3 microseconds and adjusting the
time period for which the processor is enabled. Duty cycle modulation is per-core
controllable and originally designed for thermal management. Systems can simply
throttle any over heated core without affecting other sibling cores.
Hardware prefetching is a widely used technique to hide memory latency by
taking advantage of bandwidth not being used. There are multiple hardware
prefetchers on a single chip and they are usually configurable by writing to
platform-specific registers (e.g., IA32 MISC ENABLE register on Intel proces-
22
Prefetchers Description
L1 IP Keeps track of instruction pointer and looks for
sequential load history.
L1 DCU When detecting multiple loads from the same line
within a time limit, prefetches the next line.
L2 Adjacent Line Prefetches the adjacent line of required data.
L2 Stream Looks at streams of data for regular patterns.
Table 2.1: Brief description of four L1/L2 cache prefetchers on Intel Core 2 Duo
processors [Intel Corporation, 2006].
sors). Table 2.1 gives an example of 4 prefetchers on Intel Core 2 Duo processors.
There are two L1 cache prefetchers (DCU and IP prefetchers) and two L2 cache
prefetchers (adjacent line and stream prefetchers) [Intel Corporation, 2006]. Each
can be selectively turned on/off, providing partial control over application’s band-
width utilization.
Hardware execution throttling does not require significant modifications to
operating systems, and incurs little overhead in configuration (hundreds or thou-
sands of cycles). These properties make it a good choice for multicore resource
management.
23
3 Toward Practical Page
Coloring
The shared last level cache is a critical resource on multicore chip that can result in
performance anomalies due to contention or unfair allocation. The performance of
a process that would normally have been high due to the cache being large enough
to fit its working set could be severely impacted by a simultaneously executing
process with high cache demand, resulting in the first process’s cache lines being
evicted.
Without specific hardware support to control cache sharing, the operating
system’s only recourse in a physically addressed cache is to control the virtual
to physical mappings used by individual processes. Traditional page coloring at-
tempts to ensure that contiguous pages in virtual memory are allocated to physical
pages that will be spread across the cache [Kessler and Hill, 1992; Romer et al.,
1994; Bugnion et al., 1996; Sherwood et al., 1999]. In order to accomplish this,
contiguous pages of physical memory are allocated different colors, with the max-
imum number of colors being a function of the size and associativity of the cache
relative to the page size. Free page lists are organized to differentiate these colors,
and contiguous virtual pages are guaranteed to be assigned distinct colors.
Recently, several studies have recognized the potential of utilizing page coloring
to manage the shared cache space on multicore platforms [Tam et al., 2007a; Lin
24
et al., 2008; Soares et al., 2008]. However, several challenges remain to make page
coloring practical for resource partitioning purpose.
3.1 Issues of Page Coloring in Practice
The first issue is the high overhead of online recoloring in a dynamic, multi-
programmed execution environment. An adaptive system may require online ad-
justments of the cache partitioning policy (e.g., context switch at one of the cores
brings in a new program with different allocation and requirements from the pro-
gram that was switched out). Such an adjustment requires a change of color for
some application pages. Without special hardware support, recoloring a page im-
plies memory copying, which takes several microseconds on commodity platforms.
Frequent recoloring of a large number of application pages may incur excessive
overhead that more than negates the benefit of page coloring.
The second issue is that of constraining the allocated memory space. Imposing
page color restrictions on an application implies that only a portion of the memory
can be allocated to this application. When the system runs out of pages of a
certain color, the application is under memory pressure while there still may be
abundant memory in other colors. This application can either evict some of its
own pages to secondary storage or steal pages from other page colors. The former
can result in dramatic slowdown due to page swapping while the latter may yield
negative performance effects on other applications due to cache conflicts.
We proposes a hot-page coloring approach [Zhang et al., 2009b] in which cache
mapping colors are only enforced on a small set of frequently accessed (or hot)
pages for each process. Hot-page coloring may realize much of the benefit of all-
page coloring, but with reduced memory space allocation constraint and much
less online recoloring overhead in an adaptive and dynamic environment.
25
3.2 Page Hotness Identification
Our hot-page coloring approach builds atop effective identification of frequently
accessed pages for each application. Its overhead must be kept low for online
continuous identification during dynamic application execution.
3.2.1 Sequential Page Table Scan
The operating system (OS) has two main mechanisms for monitoring access to
individual pages. First, on most hardware-implemented TLB platforms (e.g., Intel
processors), each page table entry has an access bit, which is automatically set by
hardware when the page is accessed [Intel Corporation, 2008b]. By periodically
checking and clearing this access bit, one can estimate each page’s access frequency
(or hotness). The second mechanism is via page read/write protection so that
accesses to one page will be caught by page faults. One drawback for the page
protection approach is the high page fault overhead. On the other hand, it has the
advantage (in comparison to the access bit checking) that overhead is only incurred
when pages are indeed accessed. Given this tradeoff, Zhou et al. [Zhou et al.,
2004] proposed a combined method to track page accesses for an application—
link together frequently accessed pages and periodically check their access bits;
invalidate those infrequently accessed pages and catch accesses to them by page
faults.
However, traversing the list of frequently accessed pages involves pointer chas-
ing, which exhibits poor locality efficiency on modern processor architectures. In
contrast, a sequential scan of the application’s page table can be much faster on
platforms with high peak memory bandwidth and hardware prefetching. For a set
of 12 SPECCPU2000 applications, our experiments on a dual-core Intel Xeon 5160
3.0 GHz “Woodcrest” processor shows that the sequential table scan takes tens of
cycles (36 cycles on average) per page entry while the list traversal takes hundreds
26
of cycles (258 cycles on average) per entry. Given the trend that memory latency
improvement lags memory bandwidth improvement [Patterson, 2004], sequential
table scan is favored over random pointer chasing in our design.
We consider several issues in the design and implementation of the sequential
page table scan-based hot page identification. An accurate page hotness mea-
sure requires cumulative statistics on continuous page access checking. Given the
necessity of checking the page table entries and the high efficiency of sequential
table scan, we maintain the page access statistics (typically in the form of an
access count) using a small number of unused bits within the page table entry.
Specifically, we utilize 8 unused page table entry bits in our implementation on
a 64-bit Intel platform (as illustrated in Figure 3.1). Some, albeit fewer, unused
bits are also available in the smaller page table entry on 32-bit platforms. Fewer
bits may incur more frequent counter overflow but do not fundamentally affect
our design efficiency. In the worst case when no spare bit is available, we could
maintain a separate “hotness counter table” that shadows the layout of the page
table. In that case, two parallel sequential table scans are required instead of one,
which would incur slightly more overhead.
In our hardware-implemented TLB platform, the OS is not allowed to directly
read TLB contents. With hypothetical hardware modification to allow this, we
could then sample TLB entries to gather hotness information. Walking through
TLBs (e.g., 256 entries on our experimental platform) is much lighter-weight than
walking through the page table (usually 1 to 3 orders of magnitude larger than
the TLB).
The hotness counter for a page is incremented at each scan that the page
is found to be accessed. To deal with potential counter overflows, we apply a
fractional decay (e.g., halving or quartering the access counters) for all pages
when counter overflows are possibly imminent (e.g., every 128/192 scans for halv-
ing/quartering). Applied continuously, the fractional decay also serves the purpose
27
Figure 3.1: Unused bits of page table entry (PTE) for 4K page on 64-bit and 32-bit
x86 platforms. Bits 11-9 are hardware defined unused bits for both platforms [Intel
Corporation, 2006; AMD Corporation, 2008]. Bits 62-48 on the 64-bit platform
are reserved but not used by hardware right now. Our current implementation
utilizes 8 bits in this range for maintaining the page hotness counter.
of gradually screening out stale statistics, as in the widely used exponentially-
weighted moving average (EWMA) filters.
We decouple the frequency at which the hotness sampling is performed from
the time window during which the access bits are sampled (by clearing the access
bits at the beginning and reading them at the end of the access time window).
We call the former sampling frequency and latter sampled access time window. In
practice, one may want to enforce an infrequent page table scan for low overhead
while at the same time collecting access information over a much smaller time
window to avoid hotness information loss. The latter allows distinguishing the
relative hotness across different pages accessed in the recent past. Consider a
concrete example in which the sampling frequency is once per 100 milliseconds
and the sampled access time window is 2 milliseconds. In the first sampling, we
clear all page access bits at time 0-millisecond and then check the bits at time
2-millisecond. In the next sampling, the clearing and checking occur at time
100-millisecond and 102-millisecond respectively.
28
Benchmark # of physically # of excess page
allocated pages table entries
gzip 46181 1141
wupwise 45008 1617
swim 48737 1617
mgrid 14185 1582
applu 45981 4135
mesa 2117 1255
art 903 1028
mcf 21952 1334
equake 12413 1057
parser 10183 699
bzip 47471 954
twolf 1393 88
Table 3.1: Memory footprint sizes and numbers of excess page table entries for 12
SPECCPU2000 benchmarks. The excess page table entries are those that do not
correspond to physically allocated pages.
A page table scan is expensive since there is no a priori knowledge of whether
each page has been accessed, let alone allocated. There may be invalid page table
entries that are not yet mapped and mapped virtual pages that are not yet physi-
cally substantiated (some heap management systems may only commit a physical
page when it is first accessed). As shown in Table 3.1, however, such excess page
table entries are usually few in practice (particularly for applications with larger
memory footprints). We believe the excess checking of non-substantiated page
table entries does not constitute a serious overhead.
29
3.2.2 Acceleration for Non-Accessed Pages
A conventional page table scan checks every entry regardless of whether the corre-
sponding page was accessed in the last time window. Given that a page list traver-
sal approach [Zhou et al., 2004] only requires continuous checking of frequently
accessed pages, the checking of non-accessed page table entries may significantly
offset the sequential scan’s performance advantage on per-entry checking cost.
We propose an accelerated page table scan that skips the checking of many
non-accessed pages. Our acceleration is based on the widely observed data access
spatial locality—i.e., if a page was not accessed in a short time window, then
pages spatially close to it were probably not accessed either. Intuitively, the
non-access correlation of two nearby pages degrades when the spatial distance
between them increases. To quantitatively understand this trend, we calculate
such non-access correlation as a function of the spatial page distance. Figure 3.2
illustrates that in most cases (except mcf), the correlation is quite high (around
0.9) for a spatial distance as far as 64 pages. Beyond that, the correlation starts
dropping, sometimes precipitously. Further investigation of mcf shows that such
correlation changes significantly as time goes by probably due to major phase
changes (suggested by sizable changes in its memory-related performance counter
statistics).
Driven by such page non-access correlation, we propose to quickly bypass cold
regions of the page table through an approach we call locality jumping. Specifically,
when encountering a non-accessed page table entry during the sequential scan,
we jump page table entries while assuming that the intermediate pages were not
accessed (thus requiring no increment of their hotness counters). To minimize false
jumps, we gradually increase the jump distance in an exponential fashion until we
reach a maximum distance (empirically determined to be 64 in our case) or touch
an accessed page table entry. In the former case, we will continue jumping at the
30
1 2 4 8 16 32 64 128 256 512 1024 20480
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Spatial page distance
Page n
on−
assess c
orr
ela
tion
gzipwupwiseswimmgridapplumesaartmcfequakeparserbziptwolf
Figure 3.2: Illustration of a page non-access correlation as a function of the spatial
page distance. Results are for 12 SPECCPU2000 benchmarks with 2-millisecond
sampled access time windows. For each distance value D, the non-access correla-
tion is defined as the probability that the next D pages are not accessed in a time
window if the current page is not accessed. We take snapshots of each benchmark’s
page table every 5 seconds and present average non-access correlation results here.
maximum distance without further increasing it. In the latter case, we jump back
to the last seen non-accessed entry and restart the sequential scan. Figure 3.3
provides a simple illustration of our approach.
Locality jumping that follows a deterministic pattern (e.g., doubling the dis-
tance after each jump) runs the risk of synchronizing with a worst-case appli-
cation access pattern to incur abnormally high false jump rates. To avoid such
unwanted correlation with application access patterns, we randomly adjust the
jump distance by a small amount at each step. Note, for instance, that the fourth
jump in Figure 3.3 has a distance of 6 (as opposed to 8 in a perfectly exponential
31
Figure 3.3: Illustration of sequential page table scan with locality jumping.
pattern).
It is important to note that by breaking the sequential scan pattern, we may
sacrifice the per-entry checking cost (particularly by degrading the effectiveness
of hardware prefetching). Quantitatively, we observe that the per-entry overhead
increases from 36 cycles to 56 cycles on average. Such an increase of per-entry cost
is substantially outweighed by the significant reduction of page entry checking.
Finally, it is worth pointing out that spatial locality also applies to accessed
pages. However, jumping over accessed page table entries is not useful in our case
for at least two reasons. First, in the short time window for fine-grained hotness
32
checking (e.g., 2 milliseconds), the number of non-accessed pages far exceeds that
of accessed pages. Second, a jump over an accessed page table entry would leave
no chance to increment its hotness counter.
3.3 Hot Page Coloring
In this section, we utilize hotness-based partial page coloring to alleviate the online
recoloring overhead in an adaptive and dynamic environment.
3.3.1 MRC-Driven Partition Policy
For a given set of co-running applications, the goal of our cache partition pol-
icy is to improve overall system performance (defined as the geometric mean of
co-running applications’ performance relative to running independently). The re-
alization of this goal depends on an estimation of the optimization target at each
candidate partitioning point. Given the dominance of data access time on mod-
ern processors, we estimate that the application execution time is proportional
to the combined memory and cache access latency, i.e., roughly hit + r · miss,
where hit/miss is the cache hit/miss ratio and r indicates the ratio between cache
and memory access latency. For a given application, the cache miss ratio under
a specific cache allocation size can be estimated from a cache miss ratio curve
(or MRC). Note that while the cache MRC generation requires profiling, the cost
per application is independent of the number of processes running in the system.
An on-the-fly mechanism to learn the cache MRC is possible [Tam et al., 2009].
Figure 3.4 illustrates a simple example of our cache partitioning policy.
33
Figure 3.4: An example of our cache partitioning policy between swim and mcf.
The cache miss ratio curve for each application is constructed (offline or during
an online learning phase) by measuring the miss ratio at a wide range of possible
cache partition sizes. Given the estimation of application performance at each
cache partitioning point, we determine that the best partition point for the two
applications is if 1 MB cache is allocated to swim and 3 MB cache to mcf.
3.3.2 Hotness-Driven Page Recoloring
In a multi-programmed system where context switches occur fairly often, an adap-
tive cache partitioning policy may need to recolor pages to reflect the dynamically
changing co-running applications. Frequent page recoloring may incur substantial
page copying overhead, in some cases more than negating the benefit of adaptive
cache partitioning. Our approach is to recolor a subset of hot (or frequently ac-
cessed) pages, which may realize much of the benefit of all-page coloring at much
reduced cost. Specifically, we specify an overhead budget as the maximum number
of recolored pages (or page copying operations) allowed at each recoloring. Given
this budget, we attempt to recolor the hottest (most frequently accessed) pages
34
to reach the maximal recoloring effect.
Given a budget K, we want to find the hottest K pages for recoloring. This
can be achieved by locating the hotness threshold value of the K-th hottest page.
One fast, constant-space approach is to maintain a hotness page count array to
record the number of pages at each possible hotness value. We can scan from the
highest hotness value downward until we have an accumulative of K pages, at
which point we find the hotness threshold. In our implementation, we maintain
the hotness page count array in each task (process or thread)’s control structure
in the operating system. To better control its space usage, we group multiple,
similar hotness values into one bin so that we only need to record the number of
pages at each possible bin. With 8 bins and a 4-byte page counter at each bin,
we incur a space cost of 32 bytes per task.
Given the set of hot pages to be recolored, we try to uniformly assign these
pages to the new colors. This uniform recoloring helps to achieve low intra-
application cache conflicts. Pseudocode for our recoloring approach is shown in
Figure 3.5.
3.4 Relief of Memory Allocation Constraints
Page coloring introduces new constraints on the memory space allocation. When
a system has plenty of free memory but is short of pages in certain colors, an
otherwise avoidable memory pressure may arise. As a concrete example, two ap-
plications on a dual-core platform would like to equally partition the cache by
page coloring (to follow the simple fairness goal of equal resource usage). Conse-
quently each can only use up to half of the total memory space. However, one of
the applications is an aggressive memory user and would benefit from more than
its memory share. At the same time, the other application needs much less mem-
ory than its entitled half. The system faces two imperfect choices—to enforce
35
procedure Recolor
budget (recoloring budget)
old-colors (thread’s color set under old partition)
new-colors (thread’s color set under new partition)
if new-colors is a subset of old-colors then
subtract-colors = old-colors − new-colors.
Find the hot pages in subtract-colors within the budget limit, and then round-
robin new-colors to recolor them.
end if
if old-colors is a subset of new-colors then
addition-colors = new-colors − old-colors.
Find the hot pages in old-colors within the
|new-colors||addition-colors|
∗ budget limit, and then move at most budget (i.e., |addition-colors||new-colors|
proportion of them) to addition-colors.
end if
Figure 3.5: Procedure for hotness-based page recoloring. A key goal is that hot
pages are distributed to all assigned colors in a balanced way.
36
the equal cache use (and thus force expensive disk swapping for the aggressive
memory user); or to allow an efficient memory sharing (and consequently let the
aggressive memory user pollute the other’s cache share).
In the latter case of memory sharing, a naive approach that colors some ran-
dom pages from the aggressive application to the victim’s cache partition may
result in unnecessary contention. Since a page’s cache occupancy is directly re-
lated to its access frequency, preferentially coloring cold pages to the victim’s
cache partition would mitigate the effect of cache pollution. Our page hotness
identification can be naturally employed to support such an approach. Note that
the resulting reduction of cache pollution can benefit adaptive as well as static
cache partitioning policies (like the example given above).
3.5 Evaluation Results
We implemented the proposed page hotness identification approach and used
it to drive hot page coloring (including adaptive recoloring in dynamic, multi-
programmed environments) in the Linux 2.6.18 kernel. We have also implemented
lazy page copying (proposed earlier by Lin et al. [Lin et al., 2008]), which delays
the copying to the time of first access, to further reduce the coloring overhead.
Specifically, each to-be-recolored page is set invalid in page table entry, and the
actual page copying is performed within the page fault handler triggered by the
next access to the page.
We performed experiments on the dual-core Intel Xeon 5160 3.0GHz “Wood-
crest” platform. The two cores share a single 4 MB L2 cache (16-way set-
associative, 64-byte cache line, 14 cycles latency, writeback). Our evaluation
benchmarks are a set of 12 programs from SPECCPU2000.
37
0%
10%
20%
30%
40%
50%
60%
70%
Ove
rhe
ad
10 milliseconds sampling frequency
gzipwupwise
swimmgrid
applumesa
art mcfequake
parser
bziptwolf
Average
list traversal
sequential table scan
plus locality−jump
0%
2%
4%
6%
8%
10%
12%
Ove
rhe
ad
100 milliseconds sampling frequency
gzipwupwise
swimmgrid
applumesa
art mcfequake
parser
bziptwolf
Average
list traversal
sequential table scan
plus locality−jump
Figure 3.6: Overhead comparisons under different page hotness identification
methods.
Overhead of Page Hotness Identification We compare the page hotness
identification overheads of three methods—page linked list traversal [Zhou et al.,
2004] and our proposed sequential table scan with and without locality-jumping.
In our approach, the page table is traversed twice per scan: once to clear the
access bits at the beginning of the sampled access time window and once to check
them at the end of the window. We set the access time window to 2 milliseconds
in our experiments.
The list traversal approach [Zhou et al., 2004] maintains a linked list of fre-
38
quently accessed pages while the remaining pages are invalidated and monitored
through page faults. The size of the frequently accessed page linked list is an
important parameter that requires careful attention. If the size is too large, list
traversal overhead dominates; if the size is too small, page fault overhead can be
prohibitively high. Raghuraman [Raghuraman, 2003; Zhou et al., 2004] suggests
that a good list size is 30 K pages. Our evaluation revealed that even a value
of 30 K was insufficient to keep the page fault rate low in some instances. We
therefore measured performance using both the 30 K list size and no limit for the
linked list size (meaning all accessed pages are included into the list), and present
the better of the two as a comparison point.
The overhead results at two different sampling frequencies (once per 10 mil-
liseconds and once per 100 milliseconds) are shown in Figure 3.6. When the
memory footprint is small, the linked list of pages can be cached and the over-
head is close to that of a sequential table scan. As the memory footprint becomes
larger, the advantage of spatial locality with a sequential table scan becomes more
apparent. On average, sequential table scan with locality jumping involves mod-
est (7.1%, 1.9%) overhead at 10 and 100 millisecond sampling frequencies. It
improves over list traversal by 71.7% and 47.2%, and over sequential table scan
without locality jumping by 58.1% and 19.6%, at 10 and 100 milliseconds sam-
pling frequencies. To understand the direct effect of locality jumping, Figure 3.7
shows the percentage of page table entries skipped during the scan. On average
we save checking on 63.3% of all page table entries.
Accuracy of Page Hotness Identification We measure the accuracy of our
page hotness identification methods. We are also interested in knowing whether
the locality jumping technique (which saves overhead) would lead to less accurate
identification. The ideal measurement goal is to tell how close our identified page
hotness is to the ”true page hotness”. Acquiring the true page hotness, however,
39
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Skip
pe
d e
ntr
ies (
in p
rop
ort
ion
to
all
PT
Es)
gzipwupwise
swimmgrid
applumesa
art mcfequake
parser
bziptwolf
Average
Figure 3.7: Proportion of skipped page table entries (PTEs) due to our locality-
jumping approach in page hotness identification.
is challenging. We approximate it by scanning the page table entries at high
frequency without any locality jumping. Specifically, we employ a high sampling
frequency of once per 2 milliseconds in this approximation and we call its identified
page hotness the baseline.
For a given hotness identification approach, we measure its accuracy by cal-
culating the difference between its identified page hotness and the baseline. To
mitigate the potential weakness of using a single difference metric, we use two
difference metrics in our evaluation. This first is the Jeffrey-divergence, which is
a numerically robust variant of Kullback-Liebler divergence. More precisely, the
Jeffrey-divergence of two probability distributions p and q is defined as:
JD(p, q) =∑
i
(
p(i) log2p(i)
p(i) + q(i)+ q(i) log
2q(i)
p(i) + q(i)
)
.
JD(p, q) measures the divergence in terms of relative entropy from p and q to
40
0
0.2
0.4
0.6
0.8
1
1.2
Jeffre
y d
iverg
ence to b
aselin
e
gzipwupwise
swimmgrid
applumesa
art mcfequake
parser
bziptwolf
Average
sequential table scan
plus locality−jump
naive method
Figure 3.8: Jeffrey divergence on identified page hotness between various ap-
proaches and the baseline (an approximation of “true page hotness”).
p+q
2and it is in the range of [0, 2]. In order to calculate Jeffrey-divergence, page
hotness is normalized such that hotness sums up to 1. Here p(i) and q(i) represent
page i’s measured hotness from the two methods being compared.
The second difference metric we utilize is the rank error rate. Specifically, we
rank pages in hotness order (pages of the same hotness are ranked equally at the
highest rank available) and sum up the absolute rank difference between the two
methods being compared. The rank error rate is the average rank difference per
page divided by the total number of pages.
We measure the page hotness identification of our sequential table scan ap-
proach and its enhancement with locality-jumping. These approaches employ a
sampling frequency of once per 100 milliseconds. As a point of comparison, we
also measure the accuracy of a naive page hotness identification approach which
considers all pages to be equally hot. Note that under our rank order definition,
all pages under the naive method have the highest rank.
Figure 3.10 visually presents the deviation between our identified page hotness
to the baseline for all 12 applications. Results suggest that our hotness identifi-
41
0%
10%
20%
30%
40%
50%R
an
k e
rro
r ra
te t
o b
ase
line
gzipwupwise
swimmgrid
applumesa
art mcfequake
parser
bziptwolf
Average
sequential table scan
plus locality−jump
naive method
Figure 3.9: Rank error rate on identified page hotness between various approaches
and the baseline (an approximation of “true page hotness”).
cation results are fairly accurate overall.
Relieving Memory Allocation Constraints As explained in Section 3.4,
page coloring introduces new memory allocation constraints that may cause oth-
erwise avoidable memory pressure or cache pollution. We examine the effective-
ness of hot-page coloring in reducing the negative effect of such coloring-induced
memory allocation constraints. In this experiment, two applications on a dual
core platform would like to equally partition the cache by page coloring (to follow
the simple fairness goal of equal resource usage). Consequently each can only use
up to half of the total memory space. However, one of the applications uses more
memory than its entitled half. Without resorting to expensive disk swapping, this
application would have to use some memory pages beyond its allocated colors and
therefore pollute the other application’s cache partition.
Specifically, we consider a system with 256 MB memory1. We pick swim as
1The relatively small system memory size is chosen to match the small memory usage in our
SPECCPU benchmarks. We expect that the results of our experiment should also reflect the
behaviors of larger-memory-footprint applications in larger systems.
42
0 1 2 3 4 5
x 104
0
5
10
15
x 10−3
Pages sorted on baseline hotness
Norm
aliz
ed h
otn
ess
gzip
baseline
our result
0 1 2 3 4 5
x 104
0
0.2
0.4
0.6
0.8
1
x 10−4
Pages sorted on baseline hotness
Norm
aliz
ed h
otn
ess
wupwise
baseline
our result
0 1 2 3 4 5
x 104
0
0.2
0.4
0.6
0.8
1
x 10−4
Pages sorted on baseline hotness
Norm
aliz
ed h
otn
ess
swim
baseline
our result
0 5000 10000 150000
1
2
x 10−4
Pages sorted on baseline hotness
Norm
aliz
ed h
otn
ess
mgrid
baseline
our result
0 1 2 3 4 5
x 104
0
5
10
15
x 10−4
Pages sorted on baseline hotnessN
orm
aliz
ed h
otn
ess
applu
baseline
our result
0 500 1000 1500 2000 25000
0.5
1
1.5
2
x 10−3
Pages sorted on baseline hotness
Norm
aliz
ed h
otn
ess
mesa
baseline
our result
0 200 400 600 800 10000
0.5
1
1.5
x 10−3
Pages sorted on baseline hotness
Norm
aliz
ed h
otn
ess
art
baseline
our result
0 0.5 1 1.5 2 2.5
x 104
0
2
4
6
x 10−4
Pages sorted on baseline hotness
Norm
aliz
ed h
otn
ess
mcf
baseline
our result
0 2000 4000 6000 8000 10000 12000 140000
1
2
3
4
x 10−4
Pages sorted on baseline hotness
Norm
aliz
ed h
otn
ess
equake
baseline
our result
0 2000 4000 6000 8000 10000 120000
0.5
1
1.5
x 10−3
Pages sorted on baseline hotness
Norm
aliz
ed h
otn
ess
parser
baseline
our result
0 1 2 3 4 5
x 104
0
5
10
15
20
x 10−4
Pages sorted on baseline hotness
Norm
aliz
ed h
otn
ess
bzip
baseline
our result
0 200 400 600 800 1000 1200 1400
0
1
2
3
x 10−3
Pages sorted on baseline hotness
Norm
aliz
ed h
otn
ess
twolf
baseline
our result
Figure 3.10: All-page comparison of page hotness identification results for sequential
table scan with locality-jumping approach (at once-per-100-millisecond sampling fre-
quency) and the baseline page hotness. Pages are sorted by their baseline hotness. The
hotness is normalized so that the hotness of all pages in an application sum up to 1.
the polluting application with a 190 MB memory footprint. When only half of the
total 256 MB memory is available, swim has to steal about 62 MB from the victim
application’s page colors. Figure 3.10 showed that in swim, 20% of the pages
are exceptionally hotter than the other 80% of the pages, which provides a good
opportunity for our hot-page coloring. We choose six victim applications with
small memory footprints that, without the coloring-induced allocation constraint,
would fit well into the system memory together with swim. They are mesa, mgrid,
equake, parser, art, and twolf.
43
mesa mgrid equake parser art twolf
1
1.2
1.4
1.6
1.8
2
2.2
2.4
Victim applications
No
rma
lize
d e
xe
cu
tio
n t
ime
Random pollutionHotness−aware pollutionNo pollution
Figure 3.11: Normalized execution time of different victim applications under
different cache pollution schemes. The polluting application is swim.
We evaluate three policies: random, in which the polluting application ran-
domly picks the pages to move to the victim application’s entitled colors; hot-page
coloring, which uses the page hotness information to pollute the victim applica-
tion’s colors with the coldest (least frequently used) pages; and no pollution, a
hypothetical comparison base that is only possible with expensive disk swapping.
Figure 3.11 shows the victim applications’ slowdowns under different cache pollu-
tion policies. Compared to random pollution, the hotness-aware policy reduces the
slowdown for applications with high cache space sensitivity. Specifically, for the
two most sensitive victims (art and twolf), the random cache pollution yields 55%
and 124% execution time increases (from no pollution) while the hotness-aware
pollution causes 36% and 86% execution time increases.
Alleviating Page Recoloring Cost In a multi-programmed system where
context switches occur fairly often, an adaptive cache partitioning policy may
44
Figure 3.12: Contention relations of two groups of SPECCPU2000 benchmarks.
If A points to B, that means B has more than 50% performance degradation when
running together with A on a shared cache, compared to running alone when B
can monopolize the whole cache.
need to recolor pages to reflect the dynamically changing co-running applications.
Each of our multi-programmed experiments runs four applications on a dual-
core processor. Specifically, we employ two such four-application groups with
significant intra-group cache contentions. These two groups are {swim, mgrid,
bzip, mcf} and {art, mcf, equake, twolf}, and their contention relations are shown
in Figure 3.12. Within each group, we assign two applications to each sibling core
on a dual-core processor and run all possible combinations. In total, there are 6
tests:
test1 = {swim, mgrid} vs. {mcf, bzip};
test2 = {swim, mcf} vs. {mgrid, bzip};
test3 = {swim, bzip} vs. {mgrid, mcf};
test4 = {art, mcf} vs. {equake, twolf};
test5 = {art, equake} vs. {mcf, twolf};
test6 = {art, twolf} vs. {mcf, equake}.
We compare system performance under several static cache management poli-
cies.
45
• In default sharing, applications freely compete for the shared cache space.
• In equal partition, the two cores statically partition the cache evenly and
applications can only use their cores’ entitled cache space. Under such
equal partition, there is no need for recoloring when co-running applications
change in a dynamic execution environment.
We then consider several adaptive page coloring schemes. As described in Sec-
tion 3.3.2, adaptive schemes utilize the miss-ratio-curve (MRC) to determine a
desired cache partition between co-running applications. Whenever an applica-
tion’s co-runner changes, the application re-calculates an optimal partition point
and recolors pages.
• In all-page coloring, we recolor all pages necessary to achieve the new desired
cache partition after a change of co-running applications. This is the obvious
alternative without the guidance of our hot-page identification.
• The ideal page coloring is a hypothetical approach that models the all-page
coloring but without incurring any recoloring overhead. Specifically, con-
sider the test of {A,B} vs. {C,D}. We run each possible pairing (A-C,
A-D, B-C, and B-D) on two dedicated cores (without context switches) and
assume that the resulting average performance for each application would
match its performance in the multi-programmed setting.
• In hot page coloring, we utilize our page hotness identification to only recolor
hot pages within a target recoloring budget that limits its overhead. The
recoloring budget is defined as an estimated relative slowdown of the appli-
cation (specifically as the cost of each recoloring divided by the time interval
between adjacent recoloring events, which is estimated as the CPU schedul-
ing quantum length). Our experiments consider two recoloring-caused ap-
plication slowdown budgets—5% (conservative) and 20% (aggressive). In
46
100 200 500 8000.9
1
1.1
scheduling time quantum in milliseconds
Avera
ge s
yste
m n
orm
aliz
ed p
erf
orm
ance
test1: {swim, mgrid} vs. {mcf, bzip}
Default Equal Hot (5% budget) Hot (20% budget) All−page Ideal
100 200 500 800
0.6
0.8
1
test2: {swim, mcf} vs. {mgrid, bzip}
100 200 500 800
0.6
0.8
1
test3: {swim, bzip} vs. {mgrid, mcf}
100 200 500 8000.95
1
1.05
1.1
1.15test4: {art, mcf} vs. {equake, twolf}
100 200 500 8000.7
0.8
0.9
1
1.1test5: {art, equake} vs. {mcf, twolf}
100 200 500 800
0.8
0.9
1
1.1
test6: {art, twolf} vs. {mcf, equake}
Figure 3.13: Performance comparisons under different cache management policies
for 6 multi-programmed tests (four applications each) on a dual-core platform.
our implementation, a given recoloring budget is translated into a cap on
the number of recolored pages according to the page copying cost. Copying
one page takes roughly 3 microseconds on our experimental platform.
The recoloring overhead in the adaptive schemes depends on the change fre-
quency of co-running applications, and therefore it is directly affected by the
CPU scheduling quantum. We evaluate this effect by experimenting with a range
of scheduling quantum lengths (100–800 milliseconds). Figure 3.13 presents the
system performance of the 6 tests under different cache management policies. Our
performance metric is defined as the geometric mean of individual applications’
relative performance compared to when running alone and utilizing the whole
cache. All performance numbers are normalized to that of the equal partition
policy.
Our first observation is that the simple policy of equal cache partition achieves
47
quite good performance generally speaking. It does so by reducing inter-core cache
conflicts without incurring any adjustment costs in multi-programmed environ-
ments. On average, it has a 3.5% performance improvement over default sharing
and its performance is about 7.7% away from that of ideal page coloring.
All-page coloring achieves quite poor performance overall. Compared to equal
partitioning, it degrades performance by 20.1%, 11.7%, and 1.7% at 100, 200, and
500 milliseconds scheduling time quanta respectively. It only manages to achieve
a slight improvement of 1.6% at the long 800 milliseconds scheduling quantum.
The poor performance of all-page coloring is due to the large recoloring overhead
at context switches. To provide an intuition of such cost, we did a simple back-of-
the-envelope calculation as follows. The average working set of the 7 benchmarks
used in these experiments is 82.1 MB. If only 10% of the working set is recolored
at every time quantum (default 100 milliseconds), the page copying cost alone
would incur 6.3% application slowdown, negating most of the benefit brought by
the ideal page coloring.
The hot page coloring greatly improves performance over all-page coloring. It
can also improve the performance over equal partitioning at 500 and 800 millisec-
onds scheduling time quanta. Specifically, the conservative hot page coloring (at
5% budget) achieves 0.3% and 4.3% performance improvement while aggressive
hot page coloring (at 20% budget) achieves 2.9% and 4.0% performance improve-
ment. However, it is somehow disappointing that the page copying overhead
still outweighs the adaptive page coloring’s benefit when context switches happen
fairly often (every 100 or 200 milliseconds). Specifically, the conservative hot page
coloring yields 3.8% and 0.5% performance degradation compared to equal par-
tition while the aggressive hot page coloring yields 7.1% and 2.3% performance
degradation.
We notice that in test4 of Figure 3.13, the ideal scheme does not always provide
the best performance. One possible explanation for this unintuitive result is that
48
100 200 500 800
0.1
0.2
0.3
scheduling time quantum in milliseconds
Ave
rag
e u
nfa
irn
ess
test1: {swim, mgrid} vs. {mcf, bzip}
Default Equal Hot (5% budget) Hot (20% budget) All−page Ideal
100 200 500 8000
0.1
0.2
0.3
0.4test2: {swim, mcf} vs. {mgrid, bzip}
100 200 500 800
0.1
0.2
0.3
0.4test3: {swim, bzip} vs. {mgrid, mcf}
100 200 500 800
0.1
0.2
test4: {art, mcf} vs. {equake, twolf}
100 200 500 8000.15
0.2
0.25
0.3
0.35
test5: {art, equake} vs. {mcf, twolf}
100 200 500 8000.15
0.2
0.25
0.3
test6: {art, twolf} vs. {mcf, equake}
Figure 3.14: Unfairness comparisons (the lower the better) under different cache
management policies for 6 multi-programmed tests (four applications each) on a
dual-core platform.
our page recoloring algorithm (described in Section 3.3.2) also considers intra-
thread cache conflicts by distributing pages to all assigned colors in a balanced
way. Such intra-thread cache conflicts are not considered in our ideal scheme.
An Evaluation of Fairness We also study how these cache management poli-
cies affect the system fairness. We use an unfairness metric, defined as the coef-
ficient of variation (standard deviation divided by the mean) of all applications’
normalized performance. Here, each application’s performance is normalized to its
execution time when it monopolizes the whole cache resource. If normalized per-
formance is fluctuating across different applications, unfairness tends to be large;
if every application has a uniform speedup/slowdown, then unfairness tends to
be small.
49
We evaluate the execution unfairness of the 6 tests we examined in Section 3.5.
Figure 3.14 shows the results under different cache management policies. Results
show that equal partition performs poorly, simply because it allocates cache space
without knowledge of how individual applications’ performance will be affected.
The unfairness of default sharing is not as high as one may expect, because this
set of benchmarks exhibits contention in both directions for most pairs, resulting
in relatively uniform poor performance for individual ones. Ideal page coloring
is generally better (lower unfairness metric value) than default sharing and equal
partition. Hot and all-page coloring perform similarly to what they did in the
performance results: they gradually approach the fairness of ideal page coloring
as the page copying cost becomes amortized by longer scheduling time quanta. It
also suggests that expensive page coloring may be worthwhile in cases where the
quality-of-service (like those in service level agreements) is the first priority and
customized resource allocation is needed. Note that our cache partition policy
does not directly take fairness into consideration. It should be possible to derive
other metrics to optimize fairness and to use other metric for fairness.
3.6 Related Work and Summary
Hardware-based cache partitioning schemes mainly focus on modifying cache
replacement policies and can be categorized by partition granularity: way-
partitioning [Chiou et al., 2000; Qureshi and Patt, 2006] and block-
partitioning [Suh et al., 2001b; Zhao et al., 2007; Rafique et al., 2006]. Way-
partitioning (also called column partitioning in [Chiou et al., 2000]) restricts cache
block replacement for a process to within a certain way, resulting in a maximum
of n slices or partitions with an n-way associative cache. Block-partitioning allows
partitioning blocks within a set, but is more expensive to implement. It usually
requires hardware support to track cache line ownership. When a cache miss
50
occurs, a cache line belonging to an over-allocated owner is preferentially evicted.
Cho and Jin [Cho and Jin, 2006] first proposed the use of page coloring to
manage data placement in a tiled CMP architecture. Their goal was to reduce
a single application’s cache miss and access latency. Tam et al. [Tam et al.,
2007a] first implemented page coloring in the Linux kernel for cache partitioning
purposes, but restrict their implementation and analysis to static partitioning
of the cache among two competing threads. Lin et al. [Lin et al., 2008] further
extended the above to dynamic page coloring. They admitted that recoloring a
page is clearly an expensive operation and should be attempted rarely in order
to make page coloring beneficial. Soares et al. [Soares et al., 2008] remap high
cache miss pages to dedicated cache sets to avoid polluting pages with low cache
misses. These previous works either only consider a single application or two co-
running competing threads, where frequent page recoloring is not incurred. Also,
they mainly target one beneficial aspect of page coloring, rather than developing a
practical and viable solution within the operating system. Our approach alleviates
two important obstacles: memory pressure and frequent recoloring when using the
page coloring technique.
Kim et al. [Kim et al., 2004] proposed 5 different metrics for L2 cache fairness.
They use cache miss or cache miss ratio as performance (or normalized perfor-
mance) and define fairness as the difference between the maximum and minimum
performance of all applications. Our fairness metric in Section 3.5 takes all appli-
cations’ performance into consideration and tends to be more numerically robust
than only considering max and min. Iyer et al. [Iyer et al., 2007] proposed 3
types of quality-of-service metrics (resource oriented, individual, or overall perfor-
mance oriented) and statically/dynamically allocated cache/memory resources to
meet these QoS goals. Hsu et al. [Hsu et al., 2006] studied various performance
metrics under communist, utilitarian, and capitalist cache polices and made the
conclusion that thread-aware cache resource allocation is required to achieve good
51
performance and fairness. All these studies focus on resource management in the
space domain. Another piece of work by Fedorova et al. [Fedorova et al., 2007]
proposed to compensate/penalize threads that went under/over their fair cache
share by modifying their CPU time quanta.
We present an efficient approach to tracking application page hotness on-the-
fly. Beyond supporting hot-page coloring in this work, the page hotness identifica-
tion has a range of additional utilization in operating systems. We provide some
examples here. The page hotness information we acquire is an approximation of
page access frequency. Therefore our approach can support the implementation
of LFU (Least-Frequently-Used) memory page replacement. As far as we know,
existing LFU systems [Lee et al., 2001; Sokolinsky, 2004] are in the areas of storage
buffers, database caches, and web caches where each data access is a heavy-duty
operation and precise data access tracking does not bring significant additional
cost. In comparison, it is challenging to efficiently track memory page access fre-
quency for LFU replacement and our page hotness identification helps tackle this
problem. In service hosting platforms, multiple services (often running inside vir-
tual machines) may share a single physical machine. It is desirable to allocate the
shared memory resource among the services according to their needs. The page
hotness identification may help such adaptive allocation by estimating the service
memory needs at a given hotness threshold. This is a valuable addition to ex-
isting methods. For instance, it provides more fine-grained, accurate information
than sampling-based working set estimation [Waldspurger, 2002]. Additionally,
it incurs much less runtime overhead than tracking exact page accesses through
minor page faults [Lu and Shen, 2007].
Driven by the page hotness information, we propose new approaches to mit-
igate practical obstacles faced by current page coloring-based cache partitioning
on multicore platforms. The results of our work make page coloring-based cache
management a more viable option for general-purpose systems, although with
52
cost amortization time-frames that are still higher than typical operating system
time-slice. In parallel, computer architecture researchers are also investigating
new address translation hardware to make page coloring extremely lightweight.
We expect features provided by new hardware in the near future to allow more
efficient operating system control. In the meanwhile, we hope our proposed ap-
proach could aid performance isolation in existing multicore processors on today’s
market.
53
4 Resource-aware Scheduling on
Multi-chip Multicore
Machines
In the previous Chapter, we show that hot-page coloring can alleviate the adverse
effects of naive page coloring. However, its effectiveness is somewhat constrained
by the frequency of expensive page recoloring. Therefore, we explore in this Chap-
ter a more flexible solution— resource-aware scheduling.
It is well known that different pairings of applications on resource-sharing mul-
tiprocessors may result in different levels of resource contention and thus differ-
ences in performance. Resource-aware scheduling tries to co-schedule applications
in a way such that performance penalty due to contention on shared resources is
minimized. We propose a simple resource-aware scheduling heuristic which groups
applications with similar cache miss ratios on the same multicore chip on multi-
chip multicore machines. Our experimental results show that it not only improves
performance but also creates more opportunities for power savings.
54
4.1 Resource Contention on Multi-chip Multi-
core Machines
Due to the scalability limitations of today’s multicore microarchitectures, multi-
chip multicore machines are commonplace. These machines can be organized
either as symmetric multiprocessors (SMP) or as non-uniform memory accessing
(NUMA) architectures.
On a SMP machine, there is typically a shared memory bus directly connected
to all chips. This shared memory bus has the key advantage of deterministic mem-
ory access. Its disadvantage is limited bus bandwidth and contention as more cores
are added into the system. For machines based on a NUMA architecture, each chip
has its dedicated memory controller to its local memory. Remote memory accesses
are completed by inter-chip communication through a point-to-point interconnect
such as the HyperTransport in AMD technology or the QuickPath Interconnect
in Intel technology. The key advantage of this design is that the aggregated bus
bandwidth scales with the number of chips. The cost is a loss of uniform memory
access.
In this work, we focus on SMP-based multi-chip multicore machines since
the memory bandwidth contention is more severe. We are going to show how a
simple yet efficient scheduling policy can help mitigate contention on both memory
bandwidth and cache space.
4.1.1 Mitigating Memory Bandwidth Contention
Merkel et al. [Merkel and Bellosa, 2008a] profiled a set of SPECCPU benchmarks
and found that memory bus bandwidth is a critical resource on multicore chips.
Based on this observation, they advocated mixing memory-bound (indicated by
high cache misses per instruction) and CPU-bound applications (indicated by
55
low cache misses per instruction) on sibling cores of the same chip to mitigate
bandwidth contention. Such a mixing approach mitigates memory bandwidth
contention within a multicore chip and we refer to this method as complementary
mixing scheduling. A natural extension is to apply the same method for each
multicore chip on a multi-chip machine.
As an alternative, we propose a similarity grouping method to tackle band-
width contention on SMP-based multi-chip platforms. Specifically, we group ap-
plications with similar cache miss ratios on the same multicore chip. For example,
on a two-chip machine, one chip hosts high miss ratio applications while the other
chip only hosts low miss ratio applications. Since cache miss ratio is significantly
correlated with the memory intensity as shown in Figure 4.1, our approach avoids
saturating memory bandwidth (both chips are running memory hungry appli-
cations) or under-utilizing memory bandwidth (both chips are running memory
non-intensive applications).
Although our similarity grouping appears to contradict complementary mix-
ing, in reality, they both accomplish mitigating memory bandwidth congestion
by avoiding simultaneously running memory intensive applications on all cores.
Complementary mixing focuses on temporal scheduling of applications on a single
chip, while similarity grouping focuses on spatial partitioning of applications over
multiple chips.
There does exist a subtle difference between the two methods with respect to
reducing memory bandwidth contention. Suppose we have two memory inten-
sive and two non-intensive applications running on a two-chip dual-core machine.
Complementary mixing will place two memory intensive applications on two dif-
ferent chips and such placement will likely consume more bandwidth than that of
putting the two on the same chip (which is what similarity grouping will do). If
the co-scheduled other two non-intensive applications have no memory access at
all, then complementary mixing has the advantage of using up available memory
56
0
20
40
60
80
100
mcfswim
equakeapplu
wupwise
mgridparser
bzipgzip
mesatwolf
art
Miss−ratio(L2 misses per kilo data references)
Miss−rate(L2 misses per kilo instructions)
Figure 4.1: Cache miss-ratio (L2 cache misses per kilo data references )and cache
miss-rate (L2 misses per kilo instructions) of 12 SPECCPU2000 benchmarks. In
general, these two metrics show high correlation. We label the first six benchmarks
(mcf, swim, equake, applu, wupwise, and mgrid) as high miss-ratio applications
and the later six ones (parser, bzip, gzip, mesa, twolf, and art) as low miss-ratio
applications.
bandwidth. However, this extreme of memory access assumption is not true for
most commodity applications. Furthermore, Moscibroda and Mutlu [Moscibroda
and Mutlu, 2007] pointed out that current memory access scheduling algorithms
favor streaming applications because they usually have good DRAM row buffer lo-
cality. It also implies that non-intensive applications’ memory accesses will likely
experience longer delays when they compete against memory intensive applica-
tions. As a contrast, similarity grouping throttles memory intensive applications’
bandwidth consumption by putting them on the same chip and makes room for
other applications to get their share of limited memory bandwidth.
57
512 1024 2048 40960
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cache size in kilobytes
Norm
aliz
ed c
ache m
iss r
atios
applu, equake, mgrid
mcfswimequakeappluwupwisemgridparserbzipgzipmesatwolfart
Figure 4.2: Normalized miss ratios of 12 SPECCPU2000 benchmarks at different
cache sizes. The normalization base for each application is its miss ratio at 512 KB
cache space. Cache size allocation is enforced using page coloring [Zhang et al.,
2009b]. Solid lines mark the six applications with the highest miss ratios while
dotted lines mark the six applications with the lowest miss ratios. Threshold of
labeling high/low miss-ratio is based on their miss-ratio values shown in Figure 4.1.
4.1.2 Efficient Cache Sharing
In practice, memory intensive applications not necessarily benefit much from large
cache capacity. For example, streaming applications do not need much cache
space. Applications typically exhibit high miss ratios because their working sets
do not fit in the cache. Increasing available cache space is not likely to improve
performance until the cache size exceeds its total working set, which is typically
at least one order of magnitude larger than current cache size (typically 4∼16
58
MB). This can be observed from the normalized L2 cache miss ratio curves of
12 SPECCPU2000 benchmarks, shown in Figure 4.2. With the exception of mcf,
most high miss ratio applications (applu, equake, mgrid, swim, and wupwise) show
small or no benefits with additional cache space beyond 512 KB. As a contrast,
low miss ratio applications (twolf, art, bzip, and mesa) are more sensitive to cache
space. Interestingly, we also see two low miss ratio applications (parser and gzip)
are not that sensitive to cache space. These two applications take large input files
and temporally work on a small part of the files before moving to next part. Their
temporal working set can fit in cache and they do not need large cache space even
though their total memory footprint is huge (hundreds of MB).
While they do not benefit from larger cache capacity, memory intensive ap-
plications will more aggressively occupy cache space than memory non-intensive
applications and result in adverse cache thrashing effects on memory non-intensive
applications. Recall in Figure 1.1, when low miss ratio applications art and twolf
run together with high miss ratio application swim, it is always the low miss ratio
applications that suffer more severe performance degradation. When another high
miss ratio application equake runs together with swim, equake exhibits less perfor-
mance degradation than art and twolf do. This cache thrashing effect on memory
non-intensive applications will increase their cache misses and make them gradu-
ally become memory intensive, which in turn results in more bandwidth pressure
as a return. Similarity grouping helps reduce these adverse effects by separat-
ing low and high miss ratio applications on different chips. When low miss ratio
applications run together, they can hold more cache space than when they co-
run with high miss ratio applications, simply because their co-runners are less
aggressive. Therefore, in addition to mitigating memory bandwidth contention,
similarity grouping can lead to more efficient cache sharing on multicore chips.
59
4.2 Additional Benefits on CPU Power Savings
Besides effective hardware resource sharing, power is another important factor
in systems. Our similarity grouping achieves additional benefits on CPU power
savings by engaging chip-wide voltage/frequency scaling opportunities.
4.2.1 Constraint of DVFS on Multicore Chips
Dynamic voltage/frequency scaling (DVFS) has been studied for more than ten
years, but most previous work is focused on uniprocessors. On multicore chips,
voltage/frequency scaling is subject to an important constraint. Most current
processors use off-chip voltage regulators (some use on-chip regulators but on a
per-chip rather than per-core basis), they require that all sibling cores be set to
the same voltage level. For example, a single voltage/frequency setting applies
to the entire multicore chip on Intel processors [Naveh et al., 2006]. AMD family
10h processors do support per-core frequency selection, but they still maintain
the highest voltage level required for all cores [AMD Corporation, 2009], which
limits power savings. Per-core on-chip voltage regulators add design complexity
and die real estate cost and are a subject of ongoing architecture research [Kim
et al., 2008].
Voltage/frequency scaling is most efficient for memory intensive applications
since their performance is largely dependent on the memory component rather
than CPUs. To maximize power savings from per-chip frequency scaling while
minimizing performance loss, it is essential to group applications with similar
memory intensities to sibling cores on a processor chip. A simple metric that
indicates such behavior is the application’s on-chip cache miss ratio—a higher
miss ratio indicates a larger delay due to off-chip resource (typically memory)
accesses that are not subject to frequency scaling-based speed reduction. This
60
property makes our similarity grouping a good scheduling policy in such problem
context.
4.2.2 Model-Driven Frequency Setting
It is desirable to trade performance in a controlled fashion (e.g., bounded per-
formance loss) for power savings. For that purpose, we need an estimation of
the target metrics at candidate CPU frequency levels. Several previous stud-
ies [Weissel and Bellosa, 2002; Isci et al., 2006] utilized offline constructed fre-
quency selection lookup tables. Such an approach requires a large amount of
offline profiling. Merkel and Bellosa employed a linear model based on memory
bus utilization [Merkel and Bellosa, 2008a] but it only supports a single frequency
adjustment level. Kotla et al. [Kotla et al., 2004] constructed a performance model
for variable CPU frequency levels. Specifically, they assume that all cache and
memory stalls are not affected by the CPU frequency scaling while other delays
are scaled in a linear fashion. However, their model was not evaluated on real
frequency scaling platforms.
In practice, on-chip cache accesses are also affected by frequency scaling, which
typically applies to the entire chip. We corrected this aspect of Kotla’s model.
Specifically, our variable-frequency performance model assumes that the execution
time is dominated by memory and cache access latencies, and that the execution
of all other instructions can be overlapped with these accesses. Accesses to off-chip
memory are not affected by frequency scaling while on-chip cache access latencies
are linearly scaled with the CPU frequency. Let T (f) be the average execution
time of an application when the CPU runs at frequency f . Then:
T (f) ∝F
f· (1 − RCacheMiss) · LHit + RCacheMiss · LMiss,
where F is the maximum CPU frequency. LHit and LMiss are access latencies of
cache hit and miss respectively measured at full speed. We assume that these
61
access latencies are platform-specific constants that apply to all applications. Us-
ing a micro-benchmark, we measured that the cache hit and miss latencies are
around 3 and 121 nanoseconds respectively on our experimental platform. Strictly
speaking, the cache miss latency also spends cycles on cache, but that portion is
relatively small as compared to cycles spent on memory and it does not change our
model’s accuracy qualitatively. For simplicity, we assume cache miss latency does
not vary as frequency changes. The miss ratio RCacheMiss represents the propor-
tion of data accesses that go to memory. Specifically, it is measured as the ratio
between the L2 cache misses (L2 LINES IN with hardware prefetch also included)
and data references (L1D ALL REF) performance counters on our processors [In-
tel Corporation, 2006].
With the definition of T (·), the normalized performance (as compared to run-
ning at the full CPU speed) at a throttled frequency f is T (F )T (f)
. To calculate it
online, we also need to estimate the application’s cache miss ratio when it runs at
the full CPU speed. Fortunately Rcachemiss does not change across different CPU
frequency settings so we can simply use the online measured cache miss ratio. Fig-
ure 4.3 shows the accuracy of our model when predicting the performance of 12
SPECCPU2000 benchmarks and two server benchmarks (TPC-H and SPECJbb)
at different frequencies. Results show that our model achieves a prediction accu-
racy of no more than 6% error for the 14 applications.
The variable-frequency performance model allows us to set the per-chip CPU
frequencies according to specific performance objectives. For instance, we can
bound the slowdown of any application while achieving the maximal power saving
possible. The online adaptive frequency setting must react to dynamic execution
behavior changes. Specifically, we monitor our model parameter Rcachemiss and
make changes to the CPU frequency setting when necessary.
62
0
0.25
0.5
0.75
1
Norm
aliz
ed p
erf
orm
ance
(A) Measured performance at throttled CPU frequencies
mcf
swim
equake
applu
wupw
ise
mgrid
parser
bzipgzip
mesa
twolf
artTPC
H
SPECjbb
[email protected] [email protected] [email protected]
−6%
−3%
0%
3%
6%
mcf
swim
equake
applu
wupw
ise
mgrid
parser
bzipgzip
mesa
twolf
artTPC
H
SPECjbb
Model err
or
(B) Model prediction error at throttled CPU frequencies
Figure 4.3: The accuracy of our variable-frequency performance model. Figure
(A) shows the measured normalized performance (to that of running at the full
CPU speed of 3 GHz). Figure (B) shows our model’s prediction error (defined as
prediction−measurementmeasurement
).
4.3 Evaluation Results
Our experimental platform is a 2-chip SMP running Linux 2.6.18 kernel. Each
chip is an Intel 3 GHz mutlicore processor with two cores sharing a 4 MB L2 cache.
We modified the kernel to support per-chip DVFS at 3, 2.67, 2.33, and 2 GHz
on our platform. Configuring the CPU frequency on a chip requires writing to
platform-specific IA32 PERF CTL registers, which takes around 300 cycles on our
processor. Because the off-chip voltage switching regulators operate at a relatively
63
low speed, it may require some additional delay (typically at tens of microsecond
timescales [Kim et al., 2008]) for a new frequency and voltage configuration to
take effect.
Test Chip Similarity grouping Complementary mixing
#1 0 {equake, swim} {swim, parser}
1 {parser, bzip} {equake, bzip}
#2 0 {mcf, applu} {mcf, art}
1 {art, twolf} {applu, twolf}
#3 0 {wupwise, mgrid} {wupwise, mesa}
1 {mesa, gzip} {mgrid, gzip}
#4 0 {mcf, swim, equake, {swim, equake, applu,
applu, wupwise, mgrid} wupwise, gzip, twolf}
1 {parser, bzip, gzip, {mcf, mgrid, parser,
mesa, twolf, art} bzip, mesa, art}
#5 0 2 SPECJbb threads 1 SPECJbb thread and
1 TPC-H thread
1 2 TPC-H threads 1 SPECJbb thread and
1 TPC-H thread
Table 4.1: Benchmark suites and scheduling partitions of 5 tests. Complementary
mixing mingles high-low miss-ratio applications such that two chips are equally
pressured in memory bandwidth. Similarity grouping separates high and low miss-
ratio applications on different chips (Chip-0 hosts high miss-ratio ones in these
partitions).
Our experiments employ 12 SPECCPU2000 benchmarks (applu, art, bzip,
equake, gzip, mcf, mesa, mgrid, parser, swim, twolf, wupwise) and two server-
style applications (TPC-H and SPECJbb2005). We design five multi-program
test scenarios using our suite of applications. Each test includes both memory
intensive and non-intensive benchmarks. Benchmarks and scheduling partitions
are detailed in Table 4.1.
64
Test−1 Test−2 Test−3 Test−4 Test−5 Average0
0.2
0.4
0.6
0.8
1
1.2
Norm
aliz
ed p
erf
orm
ance
Default
Similarity grouping
Complementary mixing
Figure 4.4: Performance (higher is better) of the different scheduling policies at
full CPU speed.
Scheduling Comparison First, we compare the overall performance of the de-
fault Linux (version 2.6.18) scheduler, complementary mixing (within each chip),
and similarity grouping (across chips) scheduling policies.
Figure 4.4 compares the performance of the different scheduling policies when
both chips are running at full CPU speed. For each test, the geometric mean of
the applications’ performance normalized to the default scheduler is reported. On
average, similarity grouping is about 4% and 8% better than default and comple-
mentary mixing respectively. Test-2 shows particularly encouraging performance
improvement with similarity grouping, about 12.8% and 19% respectively over the
default system and complementary mixing. We also observe 25% and 30% respec-
tively average cache miss (average over the four applications in this test) reduction
over default and complementary mixing in this test. This result demonstrates that
similarity grouping can help reduce cache space interference and memory band-
width contention to achieve better performance. We also measure the power con-
sumption of these policies using a WattsUpPro meter [Watts Up] which measures
whole system power at a 1 Hz frequency. Our test platform consumes 224 watts
when idle and 322 watts when running our highest power-consuming workload.
We notice that similarity grouping consumes slightly more power, up to 3 watts
65
as compared to the default Linux scheduler. However, the small power increase is
offset by its superior performance, leading to improved power efficiency.
Test−1 Test−2 Test−3 Test−4 Test−5 Average0
0.2
0.4
0.6
0.8
1
Norm
aliz
ed p
erf
orm
ance
(A) Performance comparison
Default (chip−0@2Ghz)
Similarity grouping (chip−0@2Ghz)
Complementary mixing (chip−0@2Ghz)
Test−1 Test−2 Test−3 Test−4 Test−5 Average0
0.05
0.1
0.15
Perf
orm
ance loss
(B) Performance loss due to frequency scaling
Figure 4.5: Performance comparisons of different scheduling policies when Chip-
0 is scaled to 2 GHz. In subfigure (A), the performance normalization base is
the default scheduling without frequency scaling in all cases. In subfigure (B),
the performance loss is calculated relative to the same scheduling policy without
frequency scaling in each case.
Next, we examine how performance degrades when the frequency of one of the
two chips is scaled down. Default scheduling does not employ CPU binding and
applications have equal chances of running on any chip, so deploying frequency
scaling on either Chip-0 or Chip-1 has the same results. We only scale Chip-0 for
similarity grouping scheduling since it hosts the high miss-ratio applications. For
complementary mixing, scaling Chip-0 shows slightly better results than scaling
Chip-1. Hence, we report results for all three scheduling policies with Chip-
66
0 scaled to 2 GHz. Figure 4.5 shows that similarity grouping still achieves the
best overall performance (shown by subfigure (A)) and the lowest self-relative
performance loss under frequency scaling (shown by subfigure (B)).
Test−1 Test−2 Test−3 Test−4 Test−5 Average0
0.2
0.4
0.6
0.8
1
No
rma
lize
d p
erf
orm
an
ce
(A) Performance comparison
Default
Similarity grouping
Similarity grouping (chip−[email protected])
Similarity grouping (chip−[email protected])
Similarity grouping (chip−0@2Ghz)
Test−1 Test−2 Test−3 Test−4 Test−5 Average220
240
260
280
300
320
340
Po
we
r in
Wa
tts
(B) Power consumption comparison
Figure 4.6: Performance and power consumption for per-chip frequency scaling
under the similarity grouping schedule. Figure (B) only shows the range of active
power (from idle power at around 224 watts), which is mostly consumed by the
CPU and memory in our platform.
Nonuniform Frequency Scaling We then evaluate the performance and
power consumption of per-chip nonuniform frequency scaling under similarity
grouping. We keep Chip-1 at 3 GHz and only vary the frequency on Chip-0 where
high miss-ratio applications are hosted. Figure 4.6(B) shows significant power
67
saving due to frequency scaling—specifically, 8.4, 15.8, and 23.6 watts power sav-
ings on average for throttling Chip-0 to 2.67, 2.33, and 2 GHz respectively. At the
same time, Figure 4.6(A) shows that the performance when throttling Chip-0 is
still quite comparable to that with the default scheduler.
Test−1 Test−2 Test−3 Test−4 Test−5 Average0
0.2
0.4
0.6
0.8
1
Norm
aliz
ed p
erf
. per
watt
(A) Whole system power efficiency comparison
Test−1 Test−2 Test−3 Test−4 Test−5 Average0
0.5
1
1.5
Norm
aliz
ed p
erf
. per
watt
(B) Active power efficiency comparison
Default
Similarity grouping
Similarity grouping (chip−[email protected])
Similarity grouping (chip−[email protected])
Similarity grouping (chip−0@2Ghz)
Figure 4.7: Power efficiency for per-chip frequency scaling under the similarity
grouping schedule. Figure (A) uses whole system power while (B) uses active
power in the efficiency calculation.
We next evaluate the power efficiency of our system. We use performance
per watt as our metric of power efficiency. Figure 4.7(A) shows that, on aver-
age, per-chip nonuniform frequency scaling achieves a modest (4–6%) increase in
power efficiency over default scheduling. The idle power on our platform is sub-
stantial (224 watts). Considering a hypothetical energy-proportional computing
platform [Barroso and Hlzle, 2007] on which the idle power is negligible, we use
the active power (full operating power minus idle power) to estimate the power
68
efficiency improvement. In this case, Figure 4.7(B), scaling Chip-0 at 2.67, 2.33,
and 2 GHz achieves 13%, 21%, and 32% better active power efficiency respectively.
Test−1 Test−2 Test−3 Test−4 Test−5
0.5
0.7
0.91
1.1N
orm
aliz
ed p
erf
orm
ance
(A) Performance of most degraded application in each test
Similarity grouping
Similarity grouping (baseline, chip−0@2Ghz)
Similarity grouping (fairness with 10% perf. thrhold)
Test−1 Test−2 Test−3 Test−4 Test−5220
240
260
280
300
320
340
Pow
er
in W
atts
(B) System power consumption comparison
Figure 4.8: Performance and power consumption for baseline and fair per-chip
frequency scaling under the similarity grouping scheduling.
Application Fairness While it shows encouraging overall performance, the
baseline per-chip nonuniform frequency scaling does not provide any performance
guarantee for individual applications. For example, setting Chip-0 to 2 GHz causes
a 26% performance loss for mgrid as compared to the same schedule without
frequency scaling.
To be fair to all applications, we want to achieve power savings with bounded
individual performance loss. Based on the frequency-performance model men-
tioned in Section 4.2.2, our system will dynamically configure frequency setting
to control the performance degradation of running applications within a certain
69
threshold (e.g., 10% in this experiment). Note that in this case the system may
scale down any processor chip as long as the performance degradation threshold
is not exceeded.
Figure 4.8(A) shows the normalized performance of the most degraded appli-
cation in each test. We observe that fairness-controlled frequency scaling is closer
(than the baseline scaling) to the 90% performance threshold line. It completely
satisfies the threshold for three tests while it exhibits slight violations in test-3 and
test-4. The most degraded application in these cases is mgrid, whose performance
is 6% and 3% away from the 90% threshold in test-3 and test-4 respectively. Sec-
tion 4.2.2, Figure 4.3 shows that our model over-estimates mgrid’s performance
by up to 6%. This inaccuracy causes the fairness violation in test-3 and test-
4. Figure 4.8(B) shows power savings for both baseline and fairness-controlled
frequency scaling. Fairness-controlled frequency scaling provides better quality-
of-service while achieving comparable power savings to the baseline scheme.
Test−1 Test−2 Test−3 Test−4 Test−5−6
−4
−2
0
2
4
6
Tem
pera
ture
changes
Similarity grouping
Similarity grouping (chip−[email protected])
Similarity grouping (chip−[email protected])
Similarity grouping (chip−0@2Ghz)
Similarity grouping (dynamic scaling w. 10% performance tradeoff)
Figure 4.9: On-chip temperature changes in Celsius degree for the per-chip fre-
quency scaling under the similarity grouping scheduling. In each case, we present
a relative number beyond(+) or below(-) the temperature measured under the
default scheduling.
70
Thermal Reduction A by-product of power savings is the reduction of CPU
heat dissipation. We can observe this by reading the CPU temperature from the
on-chip digital thermal meter on our Intel processor [Intel Corporation, 2006].
The output of such digital meter has a resolution of 1◦ Celsius and is a relative
value below a hardware specific temperature threshold (typically ranging from
85◦ to 105◦ Celsius). A recently published data sheet [Intel Corporation, 2009b]
suggests such threshold on our platform is 105◦ Celsius, which translates to 59◦
Celsius average CPU working temperature in the original system.
Figure 4.9 shows that per-chip nonuniform frequency scaling can reduce the
average CPU temperature (averaged over four cores) by up to 5◦ Celsius. This
could gain additional power savings due to a reduction in the cooling needs. Note
that we see throttled cores usually have lower temperature than unthrottled ones.
One may be concerned that unbalanced heat dissipation can adversely affect the
hardware reliability and lifetime. A possible solution to alleviate such concern
is to periodically migrate or swap workloads across different chips [Merkel and
Bellosa, 2008b].
4.4 Discussion and Summary
A number of previous studies have explored adaptive CPU scheduling to im-
prove system performance. We refer readers to section 2.2 for more complete
elaboration. As a matter of SMP platforms, Antonopoulos et al. [Antonopoulos
et al., 2003] are the first to demonstrate performance benefits of bandwidth-aware
scheduling on a real SMP machine. We believe memory bandwidth is a critical
issue on future machines and make a further step to target multicore-based sym-
metric multiprocessors. We have seen a slow but stable trend of increasing core
numbers on a single chip, which will exacerbate the contention for memory band-
width. Fortunately, memory technology advancement offers significant help to
71
tackle this problem. Measured using the STREAM benchmark [McCalpin, 1995],
our testbed with Intel Woodcrest 3 GHz CPUs (two dual-core chips) and 2GB
DDR2 533 MHz memory achieves 2.6 GB/sec memory bandwidth. In comparison,
a newer Intel Nehalem machine with 2.27 GHz CPUs (one quad-core chip) and
6GB DDR3 1,066 MHz memory achieves an 8.6 GB/sec memory bandwidth.
The idle power constitutes a substantial part (about 70%) of the full system
power consumption on our testbed, which questions the practical benefits of opti-
mizations on active power consumption. However, we are optimistic about future
hardware designs toward more energy-proportional platforms [Barroso and Hlzle,
2007]. We have already observed this trend—the idle power constitutes a smaller
part (about 60%) of the full power on the newer Nehalem machine. In addition,
our measurement shows that per-chip nonuniform frequency scaling can reduce
the average CPU temperature (by up to 5 degrees Celsius, averaged over four
cores), which may lead to additional power savings on cooling.
To summarize, we advocate a simple scheduling policy that groups applica-
tions with similar cache miss ratios on the same multicore chip. On one hand,
such scheduling improves the performance due to reduced cache interference and
memory contention. On the other hand, it facilitates per-chip frequency scaling
to save CPU power and reduce heat dissipation.
Guided by a variable-frequency performance model, our CPU frequency scal-
ing can save about 20 watts of CPU power and reduce up to 5◦ Celsius of CPU
temperature on average on our multicore platform. These benefits were realized
without exceeding the performance degradation bound for almost all applications.
This result demonstrates the strong benefits possible from per-chip adaptive fre-
quency scaling on multi-chip, multicore platforms.
72
5 Hardware Execution
Throttling
Modern processors provide hardware mechanisms, such as duty-cycle modulation,
voltage/frequency scaling, and cache prefetcher adjustment, to control the execu-
tion speed or resource access latency for an application. Although these mecha-
nisms were originally designed for other purposes, we argue that they can be an
effective tool to support resource management of shared resources on multicores.
We refer to these hardware features as hardware execution throttling mecha-
nisms. Compared to other software-based resource management mechanisms, we
find that hardware execution throttling is very flexible and lightweight in providing
resource usage control. We further propose a flexible framework to automatically
find an optimal (or close to optimal) hardware execution throttling configuration
for a user-specified management objective.
5.1 Comparisons of Existing Multicore Manage-
ment Mechanisms
We first compare the effectiveness and overhead of three existing multicore re-
source management mechanisms: CPU scheduling quantum adjustment, page col-
oring, and hardware execution throttling.
73
0 1000 2000 3000 4000 50000.1
0.2
0.3
0.4
0.5
0.6
Time order in milliseconds
Instr
uctio
n t
hro
ug
hp
ut
(IP
C)
of
SP
EC
jbb
Sched. quantum adjustment Hardware throttling
Figure 5.1: SPECJbb’s performance when its co-runner swim is regulated us-
ing two different approaches: scheduling quantum adjustment (default 100-
millisecond quantum) and hardware throttling. Each point in the plot represents
performance measured over a 50-millisecond window.
5.1.1 Effectiveness
CPU scheduling quantum adjustment is a mechanism proposed by Fedorova et
al. [Fedorova et al., 2007] to maintain fair resource usage on multicores. They
advocate adjusting the CPU scheduling time quantum to increase or decrease an
application’s relative CPU share. By compensating/penalizing applications un-
der/over fair cache usage, the system tries to maintain equal cache miss rates
across all applications (which is considered fair). It does not work for real time
applications where there is no concept of scheduling quantum. It works best when
CPUs are over-committed, in another word, the number of threads is larger than
number of CPUs. When CPUs are under-committed modifying threads’ time
quantum would not change their resource usage since each CPU hosts no more
74
0
0.05
0.1
0.15
0.2
0.25
Unfa
irness facto
r
default hardware sharing
page coloring partitioning
time quantum adjustment
hardware execution throttling
Co−schedule swim and SPECWeb on two sibling cores sharing a 4MB L2 cache
Figure 5.2: We co-schedule swim and SPECWeb on an Intel Woodcrest chip
where two sibling cores share a 4MB L2 cache. Here we compare the effectiveness
of different mechanisms in reducing unfairness.
than one thread. De-scheduling is a possible extension of it in this case. How-
ever, the key disadvantage of this kind mechanism is that it manages resource
at a coarse grain comparable to the scheduling quantum size, which may cause
fluctuating performance when scrutinized at finer granularity. As a demonstra-
tion, we run SPECJbb and swim on a dual-core chip. Consider a hypothetical
resource management scenario where we need to slow down swim by a factor of
two. We compare two approaches—the first adds an equal-priority idle process
on swim’s core; the second throttles the duty cycle at swim’s core to half the
full speed. Figure 5.1 illustrates SPECJbb’s performance over time under these
two approaches. For scheduling quantum adjustment, SPECJbb’s performance
fluctuates dramatically because it highly depends on whether its co-runner is the
idle process or swim. In comparison, hardware throttling leads to more stable
75
performance behaviors due to its fine-grained execution speed regulation.
Page coloring only controls cache space allocation and has no direct control
over memory bandwidth. Also its effectiveness is curbed by the memory alloca-
tion constraint. For example, Figure 5.2 shows the unfairness comparison of those
mechanisms when applied to co-scheduling of swim and SPECWeb. Although
all three mechanisms are all able to reduce unfairness factor as compared to de-
fault hardware sharing, page coloring shows comparatively higher unfairness than
the other two alternatives. Under page coloring, if swim was entitled to a very
small portion of the cache space, its mapped memory pages might be less than
its required memory footprint, resulting in thrashing (page swapping to disk).
If swim’s cache usage is not curtailed, SPECWeb’s performance is significantly
affected. These two competing constraints result in page coloring not achieving
good enough fairness in this case. As a contrast, scheduling quantum adjustment
and hardware execution throttling are more flexible in controlling individual appli-
cations’ resource usage. In addition, hardware execution throttling can effectively
throttle both cache space and bandwidth usage by controlling a thread’s running
speed.
5.1.2 Overhead
CPU scheduling quantum adjustment needs to modify the existing kernel schedul-
ing module to reflect the compensation or penalty, which requires a modest
amount of efforts.
The overhead of page coloring mainly comes from the expensive overhead of
recoloring. Without extra hardware support, recoloring a page means copying
a memory page and it usually takes several microseconds on typical commodity
platforms (3 microseconds on our test platform). In addition to runtime overhead,
its implementation requires significant modification on existing kernel memory
76
management (our implementation involves more than 700 lines of Linux source
code changes in more than 10 files).
Hardware execution throttling uses existing hardware features and incurs very
little overhead. On a 3.0 GHz machine, configuring the duty cycle takes 265+350
(read plus write register) cycles; configuring the prefetchers takes 298+2065 (read
plus write register) cycles; configuring voltage/frequency scaling takes 240 + 310
(read plus write register) cycles. The control registers also specify other fea-
tures in addition to our control targets, so we need to read their values before
writing. The longer time for a new prefetching configuration to take effect is
possibly due to clearing obsolete prefetch requests in queues. Voltage/frequency
changes usually happen after some additional delay (typically at tens of microsec-
ond timescales [Kim et al., 2008]) because the off-chip voltage switching regulators
operate at a relatively low speed. Enabling these hardware features requires very
little kernel modification. Our changes to the Linux kernel source are ∼50 lines
of code in a single file.
We have focused on the comparisons of the various existing resource control
mechanisms. Built on a good mechanism, it may still be challenging to identify
the best control policy during online execution and exhaustive search of all possi-
ble control policies may be very expensive. In such cases, our hardware execution
throttling approaches are far more appealing than page coloring due to our sub-
stantially cheaper re-configuration costs. Nevertheless, more efficient techniques
to identify the best control policy are desirable.
77
5.2 Hardware Throttling Based Multicore Man-
agement
In the previous section, we have shown the advantages of hardware execution
throttling mechanisms. In following paragraphs, we are going to show how one
can build management infrastructure based on hardware execution throttling.
We first describe the actual throttling mechanisms used in our framework and
management objectives we focus on. We then present a simple heuristic-based
solution and point out its limitation. A more advanced approach is described in
Section 5.3.
5.2.1 Throttling Mechanisms in Consideration
We mainly consider duty cycle modulation and dynamic voltage/frequency scaling
(DVFS) as throttling mechanisms due to their relatively predictable effects on
application’s running.
On our experimental platform, the operating system can specify a fraction
(e.g., multiplier of 1/8) of total CPU cycles during which the CPU is on duty.
The processor is effectively halted during non-duty cycles for a duration of ∼3
microseconds [Intel Corporation, 2009a]. Different duty cycle ratios are achieved
by keeping the time for which the processor is halted at a constant duration of
∼3 microseconds (for all ratios other than 1 when the processor is operating at
full speed) and adjusting the time period for which the processor is enabled.
DVFS has a relatively smaller range of throttling effectiveness with a maximal
scaling factor 2/3 on our platforms (e.g., scale from 3 GHz to 2 GHz). DVFS is
often deployed to achieve good active power efficiency (performance divided by
active power), since it not only throttles hardware activities but also the voltage
level at which hardware is running. In contrast, duty cycle modulation does not
78
reduce the voltage level and is less efficient in terms of power usage. Support for
DVFS on Intel processors [Naveh et al., 2006; Intel Corporation, 2008a] applies to
the entire chip rather than to individual cores, however.
These two mechanisms also differ in their throttling effectiveness on memory
intensive applications. The effect of DVFS is that throttled cores slow their rate of
computation at a fine per-cycle granularity, although outstanding memory accesses
continue to be processed at regular speed. On applications with high demand for
memory bandwidth, the resulting effect may be that of matching processor speed
to memory bandwidth rather than that of throttling memory bandwidth utiliza-
tion. In contrast, the microsecond granularity of duty cycle modulation ensures
that memory bandwidth utilization is reduced, since any back-log of requests is
drained quickly, so that no memory and cache requests are made for most of the
3 microsecond non-duty cycle duration. Thus, duty cycle modulation has a more
direct influence on memory bandwidth consumption [Herdrich et al., 2009].
5.2.2 Resource Management Policies
Our goal is to find an n-core hardware throttling configuration that best satisfies
a service-level agreement. Our targeted service-level agreement is performance
oriented rather than resource-allocation oriented [Hsu et al., 2006; Waldspurger
and Weihl, 1994]. This is more challenging because predicting an application’s
performance in the presence of competition for shared resources is difficult.
We consider two kinds of constraints in service-level agreements:
• The fairness-centric constraint enforces roughly equal performance progress
among multiple applications. We are aware that there are several possible
definitions of fair use of shared resources [Hsu et al., 2006]. The particular
choice of fairness measure should not affect the main purpose of our evalu-
ation. In our evaluation, we use communist fairness, or equal performance
79
degradation compared to a standalone run for the application. Based on
this fairness goal, we define an unfairness factor metric as the coefficient
of variation (standard deviation divided by the mean) of all applications’
performance normalized to that of their individual standalone run.
• The QoS-centric constraint provides a guarantee of a certain level of perfor-
mance to a high priority application. In this case, we call this application
the QoS application and we call the core that the QoS application runs on
the QoS core.
Given a service-level agreement with one such constraint, the best configuration
should maximize performance (or power efficiency in case of DVFS) while meet-
ing the constraint. In the rare case that no possible configuration can meet the
agreement constraint, we deem the closest one as the best. For example, for con-
figurations C1 and C2, if both C1 and C2 meet the agreement constraint, but C1 has
better performance than C2 does, then we say C1 is a better configuration than
C2. Also, if neither of the two configurations meet the constraint but C1 is closer
to the target constraint than C2, then we say C1 is better than C2.
5.2.3 A Simple Heuristic-Based Greedy Solution
Previous research has shown that heuristic-based approaches can be effective mul-
ticore resource management policies, particularly in the case of partitioning the
shared cache space [Suh et al., 2004]. Therefore we first consider a heuristic-driven
greedy approach for our problem of determining a desirable execution throttling
configuration. Our algorithm begins by sampling the configuration with every
CPU running at full speed. At each step of the greedy process, we lower one
core’s duty cycle level by one. To guide such a greedy move, we use a perfor-
mance metric that correlates with our constraint optimization to estimate the
effect of lowering each core’s duty cycle level.
80
This simple solution is driven by hardware performance counters on modern
processors. After investigating the relationships between counter metrics such
as last level cache references, last level cache misses, CPU cycles, and retired
instructions, we found a cycles-per-instruction (CPI) ratio metric to be a useful
guide to the greedy exploration. Specifically, the metric is the ratio between the
CPI when the application runs alone (without resource contention) and the CPI
when it runs along with other applications. Intuitively, this ratio provides direct
information on how an application’s performance is affected by contention.
For resource management with a fairness-centric constraint, we start from
the configuration that every CPU runs at full speed and then slow down one
core’s running speed by one level (could be either duty cycle or DVFS) at each
greedy step. The chosen core at each step is the one with the the highest CPI
ratio. The rationale is that this core contributes to the most unfairness in the
system and therefore slowing it down would most likely lead to the largest fairness
improvement. The greedy moves stop when the fairness constraint is met. We
also stop when the last configuration results in worse fairness than the previous
one. In this case, we return to the previous configuration, which has the best
fairness among those examined. Note that at each step of the greedy approach,
we need to measure the performance and fairness of the current configuration.
For resource management with a QoS-centric constraint, we again start from
the configuration with every CPU running at the full speed. At each greedy step,
we decrease one level of the core with the highest CPI ratio among the non-QoS
cores. This heuristic is based on the assumption that the higher the CPI ratio is,
the more aggressive the application tends to be in competition for shared cache
and memory bandwidth resources. By slowing down this core, a higher-priority
core has a better chance of meeting its QoS target with fewer duty cycle downward
adjustments.
By nature, a greedy approach may not settle on the globally optimal solution.
81
Further, since our heuristic only serves as a hint of the performance effect of
throttling adjustment, it may also lead to non-optimal configurations. Another
problem is that since each greedy step makes a small configuration adjustment,
the algorithm often requires many steps to arrive at its chosen configuration and
consequently requires a large number of measurements at sample configurations.
Finally, in a dynamic online environment, when a system condition change or
application phase change calls for a new configuration, the greedy exploration-
based approach needs to redo the sample-based search. With this realization, we
propose a more advanced solution in Section 5.3.
5.3 A Flexible Model-Driven Iterative Refine-
ment Framework
In light of the weaknesses of the heuristic-based greedy exploration approach, we
propose a more advanced approach that uses model-driven iterative refinement.
The approach maintains a set of system throttling configurations that have been
executed before and records interested metrics at these configurations. We call
each such previously executed configuration a reference configuration and we call
the whole set the reference set. The heart of this approach is a customizable model
that estimates the per-core performance using collected metrics at the reference
configurations. The predicted per-core performance can then be used to predict
the whole system performance and unfairness metrics. In each iteration, the model
will pick a predicted “best” configuration and configure the system to execute at
this configuration, which then becomes a new reference.
If the model can reflect the trend of performance changes across different con-
figurations, after each iteration, the predicted “best” configuration is more likely
to be better than existing references. As we get more reference configurations in
82
an area, the model’s accuracy in that area tends to improve. Therefore it is more
likely to make better predictions and increase the chance of finding the optimal
configuration in the area.
This is an iterative refinement process. The refinement ends when a predicted
best configuration is already a previously executed reference configuration (and
therefore would not lead to a growth of the reference set or better modeling). In
some cases, such an ending condition may lead to too many configuration samples
with associated cost and little benefit. Therefore, we introduce an early ending
condition so that refinement stops when no better configurations (as defined in
Section 5.2.2) are identified after several steps.
The new approach addresses some of the weaknesses of the heuristic-based
greedy exploration approach. Unlike the small configuration adjustment at each
greedy step, the model predicts performance for all candidate configurations and
picks the “best” one. This allows a faster convergence than the greedy approach.
Further, the iterative refinement nature of this approach makes it better suited
to a dynamic online environment. New performance measurements (potentially
replacing older samples at the same configuration) can be added to the reference
set, allowing the model to gradually adapt its predictions based on the new data.
5.3.1 Performance Prediction Models
To include a throttling mechanism in our framework, we need to construct a model
to predict the performance of possible throttling configurations using collected
metrics at previously executed references. Next we present such performance
models for two throttling mechanisms, as well as an approach to predict the
performance of a hybrid configuration involving both mechanisms.
Duty Cycle Modulation Suppose we have an n-core system and each core
hosts a CPU intensive application. Our model utilizes a set of reference configu-
83
rations whose performance is already known through past measurements. At the
minimum, the sample reference set contains n + 1 configurations: n single-core
running-alone configurations (i.e., ideal runs) and a configuration of all cores run-
ning at full speed (i.e., default). Note that the running-alone performance can
be measured offline. Also note that more reference sample configurations may
become available as the iterative refinement progresses.
We represent a throttling configuration as s = (s1, s2, ..., sn) where si’s corre-
spond to individual cores’ duty cycle levels. We collect the CPI of each running
application using performance counters and calculate appi’s normalized perfor-
mance, P si . Specifically, P s
i is the ratio between the CPI when appi runs alone
(without resource contention) and its CPI when running at configuration s.
Generally speaking, an application will suffer more resource contention if its
sibling cores run at higher speed. To quantify that, we define the sibling pressure
of application appi under configuration s = (s1, s2, ..., sn) as:
(5.1) Bsi =
n∑
j=1,j 6=i
sj.
We first assume application’s performance degrades linearly to its sibling pres-
sure, and the linear coefficient k can be approximated as:
(5.2) k =P ideal
i − Pdefaulti
Bideali − Bdefault
i
,
where P ideali and P
defaulti are appi’s performance under ideal and default config-
urations respectively.
For a given target configuration t = (t1, t2, ..., tn), we need to choose a reference
configuration r = (r1, r2, ..., rn) which is the closest to t in our reference set. We
introduce sibling Manhattan distance between configuration r and t w.r.t appi as:
(5.3) Di(r, t) =n
∑
j=1,j 6=i
|rj − tj|.
84
The closest reference r would be one with minimum such distance.
If r and t have the same duty cycle level on appi (i.e., ri = ti), we simply apply
the linear coefficient k to them. If not, we assume application’s performance is
linear to its duty cycle level as long as its sibling pressure does not change. So
appi’s performance under configuration t can be estimated as:
(5.4) E(P ti ) = P r
i ·ti
ri
+ k · (Bti − Br
i ).
Equation (5.4) says that an application’s performance is affected by two main
factors: the duty cycle level of the application itself and sibling pressure from its
sibling cores. The first part of the equation assumes a linear relationship between
the application’s performance to its duty cycle level. The second part assumes
performance degradation caused by inter-core resource contention is linear to the
sum of duty cycle levels of sibling cores.
The first assumption is largely true through our experience of dealing with
duty cycle modulation and it is usually a major factor in deciding application’s
performance. The second assumption is a simplified approximation. Admittedly,
different cores may exert different amount of resource contention at different re-
source components even if they are set to the same duty cycle level. It is even
possible to consider individual cores’ resource usage heuristics (e.g., cache miss
rate) as weight to enhance our sibling pressure calculation. We address this over-
simplification by searching a closest reference configuration to proximate similar
resource contention from all sibling cores. Besides that, we argue the model im-
perfection can be mitigated by the iterative refinement nature of our framework.
Voltage/Frequency Scaling We borrow a simple frequency-to-performance
model from Section 4.2.2. Specifically, it assumes that the execution time is dom-
inated by memory and cache access latencies, and accesses to off-chip memory are
not affected by frequency scaling while on-chip cache access latencies are linearly
85
scaled with the CPU frequency. Let F be the maximal frequency and f a scaled
frequency, T (F ) and T (f) be execution times of an application when the CPU
runs at frequency F and f . Then we have the performance at f (normalized to
running at the full frequency F ):
(5.5)T (F )
T (f)=
(1 − RF) · LCacheHit + RF · LCacheMiss
Ff· (1 − Rf) · LCacheHit + Rf · LCacheMiss
LCacheHit and LCacheMiss are access latencies of cache hit and miss respectively
measured at full speed, which we assume are platform-specific constants. Rf and
RF are run-time cache miss ratios measured by performance counters at frequency
f and F . Since DVFS is applied to the whole chip, it barely or modestly changes
shared cache space competition among sibling cores on the same chip. We assume
RF equals Rf as long as all cores’ duty cycle configurations are the same for two
different runs.
A Hybrid Model Recall that in Section 5.3.1 we need to find a reference duty
cycle configuration to estimate the normalized performance of a target configu-
ration. After adding DVFS, we have two components (duty cycle and DVFS) in
a configuration setting. Thus, when we pick a closest reference configuration, we
first find the set of samples with the closest DVFS configuration on appi, then we
pick the one with minimum sibling Manhattan distance as we did in Section 5.3.1.
When we estimate the performance of the target, if the reference has the same
DVFS settings as the target, the estimation is exactly the same as Equation (5.4).
Otherwise, we first estimate the reference’s performance at the target’s DVFS set-
tings using the Equation (5.5), and then use the estimated reference performance
to predict the performance at the target configuration.
86
5.3.2 Online Deployment Issues
In an online system, we continuously monitor applications’ behavior and adapt
system-wide throttling accordingly. The online system uses cycles-per-instruction
(CPI, captured by hardware performance counters) as run-time performance guid-
ance and only requires baseline performance (running each application alone) and
SLA targets as inputs. We discuss some important design issues below.
Accelerating Duty Cycle Search Since our model can estimate the perfor-
mance at any duty cycle configuration, we can simply apply the model to all
possible configurations and choose the best. Given an n-core system with a maxi-
mum of m modulation levels, we would need to apply the model computation mn
times, one for each possible configuration.
On our test platform (a quad-core 2.27 GHz Nehalem chip), it takes about 10
microseconds to estimate a configuration. If we calculate all 84 (4-core system
with 8 modulation levels) configurations each round, that would incur consider-
able overhead. To reduce computation overhead, we introduce a hill climbing
algorithm to prune the mn search space. Using our Nehalem platform as an ex-
ample, suppose we are currently at a configuration {x, y, z, u}, then we calculate
(or fork) 4 children configurations: {x− 1, y, z, u}, {x, y− 1, z, u}, {x, y, z − 1, u},
and {x, y, z, u−1}. The best one of the 4 configurations will be chosen as the next
fork position. Note that the sum of the modulation-level of the next fork position
{x′, y′, z′, u′} is 1 less than the sum of the current fork position {x, y, z, u}:
x′ + y′ + z′ + u′ = x + y + z + u − 1.
In our example, the first fork position is {8, 8, 8, 8} (default configuration that
every core runs at full speed). The end condition is that we either cannot fork
any more or find a configuration that meets our unfairness or QoS constraint.
The rationale for ending at the first satisfying configuration is that we assume an
87
ancestor with configuration {x, y, z, u} has no worse overall performance than its
descendant configuration {x′, y′, z′, u′} (here x′ ≤ x, y′ ≤ y, z′ ≤ z, u′ ≤ u).
Under this hill climbing algorithm, the worst-case search cost for a system
with n cores and m modulation levels occurs when forking from {m,m, ...,m} to
{1, 1, ..., 1}. Since the difference between the sum of the modulation-levels of two
consecutive forking positions is 1, and the first fork position has a configuration
sum of m ·n while the last one has a configuration sum of n, the total possible fork
positions is m · n − n. Each of these fork positions will probe at most n children.
So, we examine (m− 1)n2 configurations in the worst case, which is substantially
cheaper than enumerating all mn configurations.
Robustness to Behavior Changes A robust system needs to be adaptive to
behavior changes and we consider this aspect in our design. Our online system
continuously monitors applications’ behavior and overwrites old samples to new
samples with the same configuration to reflect recency. By doing so, our iterative
framework takes a phase-change as a mistaken prediction and automatically in-
corporates the behavior at the current configuration into the model to correct the
next round prediction.
A long sampling interval increases the time required to determine the appro-
priate configuration, while a short sampling interval can result in instability due to
frequent changes in behavior. We use a sampling frequency of once every second,
which was empirically determined to avoid instability due to fine-grained behavior
variation.
5.4 Evaluation Results
Experimental Setup Our evaluation is conducted on two platforms: the first
is an Intel Xeon E5520 2.27 GHz “Nehalem” quad-core processor running a Linux
88
2.6.30 kernel. Each core has a 32 KB L1 data and instruction cache and a 256 KB
unified L2 cache. The four cores share a 16-way 8 MB L3 cache. We disable
hyper-threading on this platform. The other is a 2-chip SMP running a Linux
2.6.18 kernel. Each chip is an Intel “Woodcrest” dual-core with a 32 KB per-core
L1 data cache and a 4 MB L2 cache shared by two sibling cores. We implemented
necessary kernel support for performance counter, duty cycle modulation, and
DVFS on our platforms.
Each of our experiments runs four different applications with each one pinned
to a specific core. We picked 5 combinations (out of 8 SPECCPU2000 benchmarks)
of quad-applications that showed severe resource contention when run together.
set-1 = {mesa, art, mcf, equake},
set-2 = {swim, mgrid, mcf, equake},
set-3 = {swim, art, equake, twolf},
set-4 = {swim, applu, equake, twolf},
set-5 = {swim, mgrid, art, equake}.
We also include 4 server-style applications:
{TPC-H, WebClip, SPECWeb, SPECJbb}.
TPC-H runs on the MySQL 5.1.30 database. Both WebClip and SPECWeb
use independent copies of the Apache 2.0.63 web server. WebClip hosts a set of
video clips, synthetically generated following the file size and access popularity
distribution of the 1998 World Cup workload [Arlitt and Jin, 1999]. SPECWeb
hosts a set of static web content following a Zipf distribution. SPECJbb runs on
IBM 1.6.0 Java. All applications are configured with 300∼400 MB footprints so
that they can fit into the memory we have on our test platforms.
89
5.4.1 Offline Evaluation
We first populate possible configurations of 5 SPECCPU2000 sets on the Nehalem
platform. Since DVFS is only applied on a per-chip basis, we only consider duty
cycle modulation in this first experiment. Our Nehalem platform supports 8 duty-
cycle levels for each individual core, resulting in a total of 84 possibilities. Since
the configurations with lower duty-cycles will have very long execution times, we
only populate duty-cycle levels from 8 (full speed) to 4 (half speed) to limit our
experimental time for the exhaustive search. We also avoid configurations in which
all cores are throttled (i.e., we want at least one core to run at full speed). So in
total we try 54 − 44 = 369 configurations for each set. Each configuration runs
for tens of minutes and the average execution times are used as stable results. In
total, it took us two weeks to populate the configuration space for 5 test sets. In
the following sections we present the offline evaluation on these populated sets.
Evaluation Methodology Our examined service level agreements (SLAs) are
the two discussed in Section 5.2.2. For fairness-centric tests, we consider unfair-
ness 0.05, 0.10, 0.15, and 0.20 as thresholds. For QoS-centric tests, we consider
normalized performance 0.50, 0.55, 0.60, and 0.65 as targets for a selected applica-
tion in each set. Here, we pick mcf in set-1 and set-2, twolf in set-3 and set-4, art in
set-5 as the high-priority QoS applications, because they are the most negatively
affected applications in the default co-running (i.e. no throttling at all).
There may be multiple configurations satisfying an agreement target, so we
also calculate an overall performance metric to compare their quality. For a set
of applications, their overall performance is defined as the geometric mean of
their normalized performance. We use execution time as the performance met-
ric for SPECCPU2000 applications, and throughput for server applications. For
the fairness-centric test, the overall performance includes all co-running appli-
cations. For the QoS-centric test, the overall performance only includes those
90
non-prioritized applications (e.g., no QoS guarantee). Our goal is therefore to
find a configuration that maximize overall performance while satisfy SLA targets.
We also compare the convergence speed of different methods, i.e., the number
of configurations sampled before selecting a configuration that meets the con-
straints. We assume that we have the performance samples of the applications’
standalone runs beforehand, so they are not counted in the number of samples.
2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
Avg. P
erf
. P
redic
tion E
rror
Number of Sample
(a) Set−1
2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
Avg. P
erf
. P
redic
tion E
rror
Number of Sample
(b) Set−2
2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
Avg. P
erf
. P
redic
tion E
rror
Number of Sample
(c) Set−3
2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
Avg. P
erf
. P
redic
tion E
rror
Number of Sample
(d) Set−4
2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
Number of Sample
Avg. P
erf
. P
redic
tion E
rror
(e) Set−5
swim (model)
mgrid (model)
art (model)
equake (model)
mesa (model)
mcf (model)
twolf (model)
applu (model)
swim (naive)
mgrid (naive)
art (naive)
equake (naive)
mesa (naive)
mcf (naive)
twolf (naive)
applu (naive)
Figure 5.3: Accuracy comparison of our model and a naive method. Performance
prediction error is defined as |prediction−measurement|measurement . The average prediction error of
each application in each set is reported here. Solid lines represent prediction by our
model and dashed lines represent prediction by a naive method.
91
1 2 3 40
1
2
3
4
5
6
(8,8,8,8)
(6,7,8,7)
(5,8,8,6)
(6,7,8,6)
optimal config (6,7,8,6) by Oracle
L1
Dis
. to
Op
tim
al
N−th Sample
1 2 30
5
10
15
(8,8,8,8)
(4,6,8,6)
(4,5,8,4)
optimal config (4,5,8,4) by Oracle
L1
Dis
. to
Op
tim
al
N−th Sample
1 2 3 4 5 6 70
1
2
3
4
5
6
(8,8,8,8)
(6,5,8,6)
(6,6,8,8)
(6,6,8,6)
(7,5,8,7)
(8,6,8,7)
(7,5,8,7)
optimal config (7,5,8,7) by Oracle
L1
Dis
. to
Op
tim
al
N−th Sample
1 2 30
1
2
3
4
5
6
(8,8,8,8)
(6,6,8,6)
(6,5,8,6)
optimal config (7,5,8,7) by Oracle
L1
Dis
. to
Op
tim
al
N−th Sample
1 2 3 40
0.02
0.04
0.06
0.08
0.1
Avg
. P
red
ictio
n E
rro
r
N−th Sample
(a) Set−1 w. unfairness 0.10
1 2 30
0.02
0.04
0.06
0.08
0.1
Avg
. P
red
ictio
n E
rro
r
N−th Sample
(b) Set−2 w. QoS 0.60
1 2 3 4 5 6 70
0.02
0.04
0.06
0.08
0.1
Avg
. P
red
ictio
n E
rro
rN−th Sample
(c) Set−5 w. unfairness 0.05
1 2 30
0.02
0.04
0.06
0.08
0.1
Avg
. P
red
ictio
n E
rro
r
N−th Sample
(d) Set−2 w. unfairness 0.10
Figure 5.4: Examples of our iterative model for some real tests. X-axis shows the N -th
sample. For the top half of the figure, the Y-axis is the L1 distance (or Manhattan dis-
tance) from the current sample to optimal (best configuration as chosen by the Oracle).
Configuration is represented as a quad-tuple (u, v, w, z) with each dimension indicating
the duty cycle level of the corresponding core. For the bottom half of the figure, Y-axis
is the average performance prediction error of all considered points over applications in
the set. Here considered points are selected according to the hill climbing algorithm in
Section 5.3.2.
Accuracy of Iterative Model We first evaluate the accuracy of our model
in predicting the performance of arbitrary duty-cycle configurations. Here we
randomly sample x configurations and our model will use them as a reference pool
to calculate the performance of other unobserved configurations. As a comparison,
we also consider a naive method that uses the average performance of sampled
configurations to estimate other configurations.
Figure 5.3 shows that our method achieves reasonable accuracy (≤0.17 error
rate after 5 samples). It also consistently beats the naive method across all tests.
92
This is because the performance variation of an application is large in our cases
(even though we only profile from full to half duty cycle level) and the average
value cannot be used to make an acceptable prediction. The accuracy of our model
remains stable (and in some cases improves) as more configurations are sampled,
converging quickly to a stable value after at most 5 samples. The naive method
on the other hand is sensitive to the specific samples across which averaging is
performed at these small sample numbers.
While these experiments demonstrate that our model is reasonably accurate
when using random sampling, in reality, we do not randomly sample configura-
tions. Given a service level target, our model tends to sample a region where the
optimal configuration resides. We show four examples of real tests on the Nehalem
platform in Figure 5.4. In all cases, average accuracy of the considered points is
improved relative to random sampling, demonstrating that adding samples close
to the configuration points of interest does improve accuracy. We present con-
figurations as a quad-tuple (u, v, w, z) with each letter indicating the duty cycle
level of the corresponding core (as shown in top half of Figure 5.4). The first
sample (8, 8, 8, 8) (i.e. default configuration where every core runs at full speed)
is usually not close to the optimal configuration (measured by the L1 distance
from the best configuration as chosen by the Oracle), but our model automat-
ically adjusts subsequent samples toward the optimal region (represented by a
smaller L1 distance). The iterative procedure terminates when the predicted best
configuration is the same as the current configuration, which is the configuration
picked by the Oracle in Figures 5.4 (a) and (b). It is possible that our model will
terminate at a different configuration from that chosen by the Oracle (as in Fig-
ure 5.4 (d), where the L1 distance is not zero when the algorithm terminates) by
discovering a local minimum, although the SLA is satisfied. The model may also
continue sampling even after discovering a satisfying configuration in the hopes of
discovering a better configuration: in Figure 5.4 (c), it finds the Oracle-predicted
93
set−1 set−2 set−3 set−4 set−50
0.05
0.1
0.15
0.2
0.25
Unfa
irness
(a) Unfairness comparison
Oracle Model Search Non−iterative Model Search Greedy Explore Random
set−1 set−2 set−3 set−4 set−50.5
0.6
0.7
0.8
0.9
1
1.1
Norm
aliz
ed P
erf
orm
ance
(b) Overall system performance comparisonset−1 set−2 set−3 set−4 set−5
0
5
10
15
Sam
ple
Num
ber
(c) Sample number comparison
Figure 5.5: Comparison of methods with unfairness ≤ 0.10. In (a), the unfairness
target threshold is indicated by a solid horizontal line (lower is good). In (b),
performance is normalized to that of Oracle. In (c), Oracle require zero samples.
configuration (7, 5, 8, 7) at the 5th sample, but continues to explore (8, 6, 8, 7). If
the next prediction is within the set of sampled configurations ((7, 5, 8, 7) in this
case), the algorithm concludes that a better configuration will not be found and
stops exploration.
Comparison of Different Methods We compare the results of several meth-
ods: Oracle is the optimal baseline that always automatically selects the op-
timal configuration — the configuration with the best overall performance (as
defined in Section 5.4.1) while satisfying the unfairness or QoS constraint. Model
Search is the model-driven iterative refinement approach we propose in this work.
Non-iterative Model Search is the same as Model Search but without iterative
refinement. Greedy Explore is the heuristic-based greedy exploration approach as
described in Section 5.2.3. Random Search randomly samples 15 configurations
and picks the best one.
Figure 5.5 shows the results using a 0.10 unfairness threshold. From Figure 5.5
a), we can see that only Oracle and Model Search satisfy the constraints for each
experiment (indicated by unfairness below the horizontal solid line). Figure 5.5
94
set−1 set−2 set−3 set−4 set−50
0.2
0.4
0.6
Hig
h−
priority
App P
erf
.
(a) QoS Comparison of high−priority app
Oracle Model Search Non−iterative Model Search Greedy Explore Random
set−1 set−2 set−3 set−4 set−5
0.6
0.8
1
1.2
Norm
. Low
−priority
Apps P
erf
.
(b) Overall performance of other three low−priority appsset−1 set−2 set−3 set−4 set−5
0
5
10
15
Sam
ple
Num
ber
(c) Sample number comparison
Figure 5.6: Comparison of methods for high-priority thread QoS ≥ 0.60. In
(a), the QoS target is indicated by a horizontal line (higher is good). In (b),
performance is normalized to that of Oracle. In (c), Oracle require zero samples.
b) shows the corresponding overall performance normalized to the performance of
Oracle. In some tests, Non-iterative Model Search, Greedy Explore and Random
Search show better performance than Oracle, but in each case, they fail to meet
the unfairness target. Only Model Search meets all unfairness requirements and
is very close to (less than 2%) the performance of Oracle. Figure 5.5 c) shows the
number of samples before a method settles on a configuration. We see that Model
Search and Greedy Explore are comparable in terms of convergence speed in this
test.
Figure 5.6 shows results of QoS tests with 0.60 performance target for a selected
high-priority application. From Figure 5.6 a), we can see that Oracle, Model
Search, and Greedy Explore all meet the QoS target (equal or higher than the 0.6
horizontal line). However, Model Search consistently achieves better performance
than Greedy Explore: Model Search is only within 7% below Oracle while Greedy
Explore could be 30% lower than that. The tests (e.g., set-2) where Non-iterative
Model Search and Random Search show better performance than Oracle fail to
meet the QoS target. In set-1, Random Search gets lower performance while fails
the QoS test. Figure 5.5 c) shows that Model Search has more stable convergence
95
Method Num of times Avg. num Avg. Norm. Avg. Norm.
pass targets of sample performance performance
(of 18 common (of any test
tests) passing target)
Oracle 39/40 0 100% 100%
Model 39/40 4.1 99.6% 99.4%
Non-iterative Model 23/40 1 94.1% 95.0%
Greedy explore 33/40 4.2 98.1% 96.8%
Random 25/40 15 90.9% 91.1%
Table 5.1: Summary of the comparison among methods.
speed (3∼5 samples) than Greedy Explore (2∼13 samples) across different tests.
The convergence speed of Greedy Explore is largely determined by how far away
the satisfying configuration is from the starting point since it only moves one level
of duty cycle at each step. This could be a serious limitation for for systems with
many cores and more configurations. Model Search converges quickly because it
has the ability to estimate the whole search space at each step.
In total, we have 8 tests (4 parameters for both unfairness and QoS) for 5 sets
and we summarize the 40 tests in Table 5.1. Model Search meets SLA targets in
almost all cases except in one (set-2 with QoS target ≥ 0.65) where there is no
configuration in the populated range we explore (duty cycle levels from 4 to 8)
that can meet the target (i.e., even Oracle failed on this one). We compare overall
performance in 2 ways: 1) we pick 18 common tests for which all methods meet
the SLA targets in order to provide a fair comparison; 2) we include any passing
test of any method in the performance calculation for that method. Performance
is normalized to Oracle’s. In both cases, Model Search shows the best results,
achieving 99% of Oracle’s performance.
96
set−1 set−2 set−3 set−4 set−50
0.1
0.2
0.3
Un
fairn
ess
(a) Unfairness threshold 0.10
Default
Model
set−1 set−2 set−3 set−4 set−50
0.2
0.4
0.6
0.8
1
Hig
h−
prio
rity
Ap
p P
erf
.
(b) QoS target 0.90
Default
Model
Figure 5.7: Online test results of 5 SPECCPU2000 sets. Default is the default
system running without any throttling. Only duty cycle modulation is used by
Model as the throttling mechanism.
5.4.2 Online Evaluation
In this section, we implement our model as a daemon thread in runtime systems
and evaluate it in a dynamic environment. In this experiment any core’s duty
cycle can be set from minimum of 1/8 to 1.
Evaluation using SPECCPU2000 We first evaluate our duty cycle modu-
lation model using an unfairness threshold of 0.10 and a QoS target of 0.90 for
5 SPECCPU2000 sets on Nehalem platform. Figure 5.7 shows the results of the
online tests. Default is the default running without any hardware throttling. It
exhibits poor fairness among applications and has no control in providing QoS
for selected applications. Model almost meets all targets except in providing QoS
target 0.90 for mcf in set-1 and set-2. The reason is that the current duty-cycle
modulation on our platform can only throttle the CPU to a minimum of 1/8 —
we do not attempt to de-schedule any application (i.e. virtually throttle CPU to 0
speed), which would be necessary to give mcf enough room in the shared resource
to maintain 90% of its ideal performance. Nevertheless, Model manages to keep
mcf’s performance fairly close to that target (within 10%).
97
Set Target Hill-Climbing Exhaustive
#1 Unfairness ≤ 0.1 0.32 15.94
QoS ≥ 0.9 1.06 27.48
#2 Unfairness ≤ 0.1 0.49 31.64
QoS ≥ 0.9 1.28 66.14
#3 Unfairness ≤ 0.1 0.18 9.93
QoS ≥ 0.9 0.88 21.92
#4 Unfairness ≤ 0.1 0.21 6.54
QoS ≥ 0.9 1.81 35.03
#5 Unfairness ≤ 0.1 0.19 10.04
QoS ≥ 0.9 1.33 28.82
Table 5.2: Average runtime overhead in milliseconds of calculating best duty cycle
configuration. Before each round of sampling, Exhaustive searches and compares
all possible configurations while Hill-Climbing limits calculation to a small portion.
The runtime overhead of our approach mainly comes from the computation
load of predicting best configuration based on existing reference pool (reading per-
formance counters and setting modulation only take several microseconds). Recall
that we introduce a hill climbing algorithm in Section 5.3.2, which significantly
reduces the number of evaluated configurations from mn to (m − 1)n2 for an n
core system with a maximum of m modulation levels. As shown in Table 5.2,
the hill climbing optimization reduces computation overhead by 20x ∼ 60x and
mostly incurs less than 1 millisecond overhead in our tests. Such optimization
makes our approach affordable in cases where frequent (e.g., tens of milliseconds)
sampling is desirable.
Tests of Server benchmarks Our iterative framework does not make any
assumption about the particular bottleneck resource and is applicable to different
98
Nehalem Woodcrest 0
0.05
0.1
0.15
0.2
0.25
Un
fairn
ess
Unfairness threshold 0.10
Default
Model
Figure 5.8: Online unfairness test of four server applications on platform “Wood-
crest” and “Nehalem”. Default is the default system running without any throt-
tling. Model here only uses duty cycle modulation as throttling mechanism.
resource management scenarios. We test server benchmarks on both Nehalem
and Woodcrest platforms with different models and management objectives to
demonstrate this.
First we only consider duty cycle modulation as the throttling mechanism.
There are 4 cores on each platform and we bind each server application to one core.
On the “Woodcrest” platform, TPC-H and WebClip run on a chip, and SPECWeb
and SPECJbb run on the other chip. We choose an unfairness threshold of 0.10
and QoS target of 0.90. For the QoS-centric tests, we rotate high priority among
the 4 server applications in each test. The final performance is calculated in term
of throughput though our run-time daemon uses IPC as guidance. This might be
problematic for applications whose instructions mutate during different runs, but
this is not the case in our experiments.
In Figure 5.8, our model significantly reduces unfairness although the target
is not met for the test on the Woodcrest platform. Figure 5.9 shows our model
99
0
0.2
0.4
0.6
0.8
1
Hig
h−
priority
App P
erf
.
(a) QoS target 0.90 on Nehalem
TPC−H
WeClip
SPECWeb
SPECJbb
0
0.2
0.4
0.6
0.8
1
Hig
h−
priority
App P
erf
.
(b) QoS target 0.90 on Woodcrest
TPC−H
WeClip
SPECWeb
SPECJbb
Default
Model
Default
Model
Figure 5.9: Online QoS test of four server applications on “Woodcrest” and “Ne-
halem”. (a) shows results of 4 different tests with each selecting a different server
application as the high-priority QoS one. Same applies to (b). Default refers to
the default system running without any throttling. Model only uses duty cycle
modulation as throttling mechanism.
provides good performance isolation of the high-priority application on both plat-
forms, providing performance above or close to the 0.9 performance target.
In order to demonstrate the more general applicability of our approach, we
add DVFS as another source of throttling, and change the management objective
from overall performance to power efficiency. We use performance per watt as
our metric of power efficiency. We are mainly interested in active power (whole
system operating power minus idle power) in this work. We empirically model
active power to be quadratic to frequency and linear to duty cycle levels. Since
DVFS is applied to the whole chip and not per-core on our Intel processors, we
only test this new model on the 2-chip SMP Woodcrest platform. Figure 5.10
shows that this new model achieves much better power efficiency while providing
good fairness.
100
0
0.05
0.1
0.15
0.2
0.25
Un
fairn
ess
(a) Unfairness
Default
Model w.o. DVFS
Model w. DVFS0
0.5
1
1.5
No
rm.
Active
Po
we
r E
ffic
ien
cy
(b) Active Power Efficiency
Default
Model w.o. DVFS
Model w. DVFS
Figure 5.10: Online test of power efficiency (performance per watt). Default is
the default system running without any throttling. Model w.o. DVFS only uses
duty cycle modulation as throttling mechanism. Model w. DVFS combines two
throttling mechanisms (duty cycle modulation and dynamic voltage/frequency
scaling).
5.5 Related Work and Summary
There has been considerable focus on the issue of quality of service for appli-
cations executing on multicore processors. Several new hardware mechanisms
have been proposed in order to collect statistics at the last-level cache or at the
memory [Suh et al., 2001b, 2004; Zhao et al., 2007; Awasthi et al., 2009; Nesbit
et al., 2006; Qureshi and Patt, 2006]. Suh et al. [Suh et al., 2001b, 2004] use
hardware counters to estimate marginal gains from increasing cache allocations to
individual processes, along with a greedy exploration algorithm, in order to find
a cache partition that minimizes overall miss rate. Zhao et al. [Zhao et al., 2007]
propose the CacheScouts architecture to determine cache occupancy, interference,
and sharing of concurrently running applications, and used this information to
determine which applications to co-schedule. Tam et al. [Tam et al., 2007b] use
the data sampling feature available in the Power5 performance monitoring unit
in order to sample accesses. The resulting access signature is used to deter-
mine similar information to CacheScouts, but in software, which is then used for
101
clustering/co-scheduling. Awasthi et al. [Awasthi et al., 2009] use an additional
layer of translation to control the placement of pages in a multicore shared cache.
Mutlu et al. [Mutlu and Moscibroda, 2008] propose parallelism-aware batch
scheduling at the DRAM level in order to reduce inter-thread interference at the
memory level. These techniques are orthogonal and complementary to controlling
the amount of a resource utilized by individual threads.
Alternatively, without extra hardware support, software techniques such as
page coloring to achieve cache partitioning [Cho and Jin, 2006; Tam et al., 2007a;
Lin et al., 2008; Soares et al., 2008; Zhang et al., 2009b] and CPU scheduling
quantum adjustment to achieve fair resource utilization [Fedorova et al., 2007]
have been explored. However, page coloring requires significant changes in the
operating system memory management, places artificial constraints on system
memory allocation policies, and incurs expensive re-coloring (page copying) costs
in dynamic execution environments. CPU scheduling quantum adjustment suffers
from its inability to provide fine-grained quality of service guarantees [Zhang et al.,
2009a].
Iyer et al. [Iyer et al., 2007] show how priority can be taken into account when
defining quality of service policies on either a resource or performance basis. Hsu
et al. [Hsu et al., 2006] demonstrate the importance of the objective function in
guiding QoS policy decisions. Nathuji et al. [Nathuji et al., 2010] apply a multi-
input multi-output model to allocate surplus CPU resources among applications
for QoS purposes. In all cases, the cost of making the policy decision and the
amount of time needed to arrive at the correct configuration are not discussed.
Ebrahimi et al. [Ebrahimi et al., 2010] propose a new hardware design to
track contention at different cache/memory levels and throttle ones causing unfair
resource usage or disproportionate progress. We address the same problem but
without requiring special hardware support.
In our work, we propose an iterative framework to enforce SLAs by controlling
102
multicore resource through two hardware execution throttling mechanisms: duty
cycle modulation and voltage/frequency scaling. Besides the iterative refinement
property, the essence of our framework is a customizable prediction model that
determines the effect of a configuration change on the metric of interest. We
devise a hill climbing algorithm to make the prediction model computationally
efficient for online deployment. We analyze our approach using 8 SPECCPU2000
benchmarks (mesa, art, mcf, equake, swim, mgrid, applu, and mgrid) and 4 server-
style applications (TPC-H, WebClip, SPECWeb and SPECJbb). We test our
approach on a variety of resource management objectives such as fairness, QoS,
performance, and power efficiency using two different multicore platforms. Our
results suggest that CPU execution speed throttling coupled with our iterative
framework effectively supports multiple forms of service level agreements (SLAs)
for multicore platforms in an efficient and flexible manner.
103
6 A Unified Middleware
In previous chapters we describe various multicore resource management mech-
anisms and show how they can be applied to manage shared resources. These
approaches are orthogonal yet complementary to each other. For example, the
similarity grouping described in Chapter 4 affects performance and fairness across
groups of applications while the hardware throttling utilized in Chapter 5 affects
performance and fairness for concurrently running applications. In this chapter,
we present a prototype middleware that unifies similarity grouping and hardware
execution throttling to realize multiple benefits simultaneously.
6.1 Design and Implementation
Our prototype middleware consists of kernel and user parts. The kernel part
implements necessary driver support for hardware execution throttling (duty cy-
cle modulation and voltage/frequency scaling) and performance counter profiling.
We apply Mikael Pettersson’s perfctr patch [Pettersson, 2009a] to a recent Linux
2.6.30 kernel. On our Intel dual-core platform, there are two general-purpose per-
formance counters and three additional fixed performance counters. The general-
purpose counters can be programmed to measure hundreds of hardware events.
Each of the fixed counters is dedicated to a pre-defined hardware event: number
104
of retired instructions, unhalted CPU cycles, and unhalted CPU reference cycles
1. These counters are 40 bits wide on the Intel Dual-Core processor or 48 bits
on the Nehalem platform and can be read by either rdpmc or rdmsr instructions.
The difference between the two instructions is rdmsr always executes at privilege
level 0 (highest) while the rdpmc privilege level can be relaxed by a performance-
monitoring counters enabled (PCE) flag in register CR4 [Intel Corporation, 2006].
In rare cases, the system enables the rdpmc instruction to be executed at any priv-
ilege level by setting the PCE to the appropriate protection mode.
We also developed a user-level tool to facilitate configuring various hardware
event counters. Monitoring a hardware event involves a pair of registers: a select
register and its corresponding counter. One can modify the “Unit Mask” (bits 15-
8) and “Event Select” (bits 7-0) fields of the select register to specify a particular
performance event, and read out the event value from the associated counter. Our
tool takes names of desirable hardware events from the command line (needing
root privilege to run) and evokes the perfctr driver via the ioctl interface. Right
now our tool supports a selected set of most frequently used hardware events for
two popular Intel multicore processors (Dual-core and Nehalem). Other events
can be added as needed.
Setting duty cycle modulation is relatively easy: the Intel manual [Intel
Corporation, 2006] specifies the layout and functionality of each bit in the
IA32 CLOCK MODULATION register. Configuring DVFS is a bit more com-
plex since it is not well documented in the Intel manual. Basically, there is
a IA32 PERF CTL register to control the CPU’s performance state (i.e., fre-
1Unhalted CPU cycles (CPU CLK UNHALTED.CORE) may change over
time due to hardware frequency change, but unhalted CPU reference cycles
(CPU CLK UNHALTED.REF) does not change. For example, suppose a 3 GHz CPU
scales to 2 GHz, CPU CLK UNHALTED.CORE reports 2,000,000,000 cycles for 1 second while
CPU CLK UNHALTED.REF remains 3,000,000,000 cycles for 1 second, assuming the CPU
does not enter the halt state.
105
quency/voltage operating point), but the document does not specify values to be
written to this register. By reading Intel’s cpufreq device driver code in Linux,
we find that IA32 PERF CTL uses bits 7-0 to encode the voltage level and bits
15-8 to encode the frequency level. We modified the cpufreq driver to get these
codes and wrote our own DVFS support. On the Intel chip, each core has its
own register to specify a desired operating point, but the highest operating point
among all sibling cores is what is in effect. To effectively scale a particular core’s
frequency, we have to set IA32 PERF CTL registers on all sibling cores of the
same chip.
Our user level part is a daemon process that takes management policies as in-
put to guide resource-aware scheduling and hardware execution throttling. Upon
starting execution, a job will first register its signature information (e.g., ideal
IPC and cache miss ratio when it runs alone) by evoking a certain system call.
These signatures can be learned in profiling runs and we assume they are available
before hand. The kernel scheduler signals the daemon process at context switch
to update its information of currently running jobs. Based on the signature infor-
mation, the daemon process will determine how to group running jobs according
to similarity grouping as described in Chapter 4. Specifically, it will modify a
thread’s CPU affinity to bind it to a particular core. For this reason, the daemon
thread runs under root privilege. It also keeps monitoring applications’ perfor-
mance (instructions per cycle, or IPC) and adjusts hardware execution throttling
correspondingly for a given management objective. Once a new job begins run-
ning, the daemon process will erase records of previous runs and restart a sampling
from the default system throttling settings (full duty cycle and highest frequency).
Commodity operating systems implement asynchronous scheduling, which
means CPUs do not synchronize their job dispatch. The expected frequency of
context switch would be the number of CPUs per scheduling quantum. In re-
ality, context switches occur more frequently due to applications’ sporadic I/O
106
operations. This affects the choice of sampling duration. By default, we choose
a 1 second sampling interval (for batch scheduling) and can go as low as 10 mil-
liseconds (for frequent context switches). A long sampling interval provides more
stable results although it increases the time required to determine the appropriate
configuration.
Old samples are replaced by new samples if measured at the same configuration
to reflect recency. By doing so, our iterative framework takes a phase-change as
a mistaken prediction and automatically incorporates the behavior at the current
configuration into the model to correct the next round of prediction.
6.2 Evaluation Results
Our evaluation is conducted on a 2-chip SMP machine running our modified Linux
2.6.30 kernel. Each chip is an Intel “Woodcrest”dual-core with a 32 KB per-core
L1 data cache and a 4 MB L2 cache shared by the two sibling cores.
Our benchmarks are 12 SPECCPU2000 benchmarks and we first divide them
into four groups based on their memory intensities with group-0 most intensive
and group-3 least intensive:
group-0 = {swim, mcf, equake},
group-1 = {applu, wupwise, mgrid},
group-2 = {art, bzip, twolf},
group-3 = {gzip, mesa, parser},
Each group will issue one job (i.e., start a benchmark) at a time and await the
last job’s termination to issue the next one. We have four groups and four cores
on our platform, so exactly one job is running on each core at any time. Jobs
are bound at random initially by the default scheduler. They are subsequently
bound based on similarity grouping by the daemon after they make the system
107
call to specify their signatures. We continuously run our experiments for a suffi-
cient amount of time and use the average execution time (not response time) of
individual benchmarks as their performance.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
No
rma
lize
d p
erf
orm
an
ce
Default System
Similarity Grouping
Unified Middleware
(a) Overall System Performance (higher is better)
0
0.05
0.1
0.15
0.2
0.25
Un
fairn
ess f
acto
r
Default System
Similarity Grouping
Unified Middleware
(b) System Unfairness (lower is better)
0
10
20
30
40
50
60
70
80
90
100
Po
we
r in
wa
tts
Default System
Similarity Grouping
Unified Middleware
(c) Active System Power (lower is better)
0
0.2
0.4
0.6
0.8
1
1.2
1.4N
orm
aliz
ed
pe
rf.
pe
r w
att
Default System
Similarity Grouping
Unified Middleware
(d) Active Power Efficiency (higher is better)
Figure 6.1: Comparison results of experiment where CPUs are not over-committed
(number of concurrently running applications equals number of cores).
We compare a system with our middleware support against the default Linux
system with respect to performance, fairness, and active power efficiency. We first
only enable similarity grouping scheduling and see how much overall system per-
formance improvement it can gain over the default system. Figure 6.1 (a) shows
similarity grouping achieves 7% performance improvement over default. We then
enable hardware execution throttling with a 0.10 unfairness control threshold.
Meanwhile we also try to optimize active power efficiency (performance per watt,
calculated as normalized performance divided by active power in watts) under
108
0 5 10 15 20 25 300.05
0.1
0.15
0.2
0.25
restart frequency in # of samples
Un
fairn
ess
1 second
100 milliseconds
10 milliseconds
Figure 6.2: Sensitivity tests with varing sampling interval (10 milliseconds, 100
milliseconds, and 1 second) and restart frequency (5, 10, 20, and 30 samples).
the constrained unfairness threshold. Figure 6.1 (b) shows that our policy man-
ages to achieve unfairness of 0.107, a 45% and 35% reduction respectively from
the default system and resource-aware scheduling. However, we also notice its
performance drops by 15% as compared against the default system. The reason
for this is that we sometimes have to throttle applications that can aggressively
occupy shared resources and make relatively fast progress in order to make more
resources available for other co-running applications. While reducing unfairness,
our middleware process also cuts active power consumption by almost 30 watts,
or 31% reduction from the default system as shown in Figure 6.1 (c). We can
see from Figure 6.1 (d) that our power savings offset the performance loss and
translate to a 21% improvement of power efficiency over the default system.
In the previous test, context switches occur fairly infrequently (30 samples
109
between context switches on average). In order to determine sensitivity to con-
text switch frequency and sampling interval (time slice), we repeat the test to
enforce a restart (restart sampling from default throttling settings) in the dae-
mon periodically to emulate context switch effects. We tried 5, 10, 20, and 30
samples as restart frequency with 10 milliseconds, 100 milliseconds, and 1 second
sampling intervals. The unfairness curves for the different parameters are plotted
in Figure 6.2.
The general trend is that unfairness curves get closer to the target guidance
(0.10) as the restart is less frequent. Another interesting observation is that long
sampling intervals generally work better than short sampling intervals. This is
because we implement the hardware throttling setting in an asynchronous manner.
When a new throttling configuration needs to be set, we write it to a per-cpu
kernel data structure. The scheduler reads it at the next tick (1 millisecond by
default), at which time the actual setting is changed. The time tick is triggered
asynchronously on each CPU, so we may get a couple of milliseconds skew.
Next we perform an experiment where all 12 benchmarks are concurrently run-
ning (i.e., cores are over-committed). Commodity operating systems implement
asynchronous scheduling, which means CPUs do not synchronize their scheduling
quantum. Our middleware process has to restart a sampling with default system
settings (full duty cycle and highest frequency) upon every context switch. On
average, it will occur 4 times per scheduling quantum on a 4-core system. To
alleviate the inefficiency due to frequent context switches, we set the scheduling
time quantum to be 10 seconds in this experiment. Figure 6.3 (a) shows similarity
grouping exhibits a 5% performance gain over the default while unified middle-
ware is 5% worse than the default. For fairness, the unified middleware reduces
the unfairness factor by 35% and 25% respectively compared to the default and
to similarity grouping, although its absolute value of 0.169 is higher than our ex-
pected 0.1 unfairness target. The effectiveness of the unified middleware’s ability
110
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
No
rma
lize
d p
erf
orm
an
ce
Default System
Similarity Grouping
Unified Middleware
(a) Overall System Performance (higher is better)
0
0.05
0.1
0.15
0.2
0.25
0.3
Un
fairn
ess f
acto
r
Default System
Similarity Grouping
Unified Middleware
(b) System Unfairness (lower is better)
0
10
20
30
40
50
60
70
80
90
100
Po
we
r in
wa
tts
Default System
Similarity Grouping
Unified Middleware
(c) Active System Power (lower is better)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
No
rma
lize
d p
erf
. p
er
wa
tt
Default System
Similarity Grouping
Unified Middleware
(d) Active Power Efficiency (higher is better)
Figure 6.3: Comparison results of experiment where CPUs are over-committed
(number of concurrently running applications is larger than number of cores).
to save power is limited by frequent context switches and it only shows about 7
watts in savings. Nevertheless, the unified middleware is still 3.6% better than
the default in power efficiency as shown in Figure 6.3 (d).
6.3 Summary
In this chapter, we present a prototype middleware that combines similarity group-
ing and hardware execution throttling to realize multiple benefits (fairness, perfor-
mance, and power savings) simultaneously. We demonstrate its benefits through
two different multi-programmed execution environments. In order to share our
valuable experience of dealing with hardware features (performance counters, duty
111
cycle modulation, and voltage/frequency scaling), we plan to make our prototype
implementation publicly accessible. We hope it will inspire more research in this
area.
112
7 Conclusions and Future
Directions
In this dissertation, we focus on multicore resource management with respect to
performance, fairness, and power efficiency. In particular, we:
• devise and implement an efficient way to track memory page access frequency
(i.e., page hotness). The cost of identifying hot pages online is reduced by
leveraging knowledge of spatial locality during a page table scan of access
bits. Based on this, we propose hot-page-based page coloring, which enforces
coloring on only a small set of frequently accessed (or hot) pages for each
process. Hot-page-based selective coloring can significantly alleviate mem-
ory allocation constraints and recoloring overhead induced by naive page
coloring.
• present a simple yet efficient similarity grouping scheduling on SMP-based
multi-chip multicore machines. This scheduling policy mitigates cache space
and memory bandwidth contention and achieves up to 12% performance
improvement for a set of SPECCPU2000 benchmarks and two server ap-
plications on a 2-chip dual-core machine. In addition, similarity grouping
presents the ability to engage chip-wide DVFS-based CPU power savings.
Guided by a frequency-to-performance model, it achieves about 20 watts
113
power savings and 3 Celsius degrees CPU thermal reduction on average
with competitive performance to the default system.
• advocate hardware execution throttling as an effective tool to support fair
use of shared resources on multicores. We also propose a flexible framework
to automatically find a proper hardware execution throttling configuration
for a user-specified objective. A variety of resource management objectives,
such as fairness, QoS, performance, and power efficiency are targeted and
evaluated in our experiments. The essence of our framework is an iterative
prediction refinement procedure and a customizable model that currently
incorporates both duty cycle modulation and voltage/frequency scaling ef-
fects. Our experimental results on a quad-core Intel Nehalem machine show
that our approach can quickly arrive at the exact or close to optimal con-
figuration.
• present a prototype middleware that combines similarity grouping and hard-
ware execution throttling to realize multiple benefits (fairness, performance,
and power savings) simultaneously.
Throughout this dissertation, we have focused on single chip or SMP-based
multi-chip multicore machines. We are also interested in other platforms such as
NUMA machines and mobile devices. Due to the memory bandwidth limitation in
SMP machines, NUMA architectures have been deployed for high-end multi-chip
multicore machines. On NUMA-based machines, each chip/node has a dedicated
memory controller to its local memory. Remote memory accesses are completed
via inter-chip communication through a point-to-point interconnect (e.g., Hyper-
Transport in AMD technology or QuickPath Interconnect in Intel technology).
By doing so, the aggregated memory bandwidth scales with the number of chips.
However, the cost is a loss of uniform memory access. Depending on the num-
ber of hops and the capacity of the links, the latency of remote memory accesses
114
varies dramatically. Consequently, an application’s execution time may fluctuate
because its memory pages are allocated in different nodes during different runs.
Existing solutions to mitigate this effect is to interleave memory pages among all
nodes to get more stable performance. This solves the non-uniform problem but
does not necessarily achieve optimal performance. Another more desirable yet
challenging solution is migrating an application to different nodes according to its
memory access patterns. When migrating an application is expensive, we can also
consider migrating hot (frequently accessed) pages to local memory and swapping
cold pages to remote memory. This approach tries to maximize performance by
making most memory accesses local.
We are also interested in applying our techniques to multithreaded parallel
applications. For example, resource-aware scheduling can take advantage of data
communication information available at the programming language level to dy-
namically co-schedule threads on the same chip to reduce communication over-
head. Our hardware throttling technique can be applied to prioritize a thread
during its critical phase of execution by throttling other competing sibling cores’
speed. Additional challenges will be faced to ensure that the resource control tech-
niques transparently handle both multiprogram and multithreaded workloads.
A further place to employ our techniques is virtual machine-driven shared ser-
vice hosting platforms. In particular, cloud computing platform providers (e.g.,
Amazon [Amazon] and GoGRID [GoGrid, 2008]) charge customers typically in a
pay-as-you-go manner. However, these systems are oblivious to the actual amount
of resource used by individual virtual machines when they share a physical ma-
chine. Using our techniques, we can augment existing cloud computing billing
systems with performance counter-based more fine-grained resource metering. We
can apply our techniques to carefully manage various resource conflicts (especially
those at the chip level) and provide better performance guarantees as desired by
customers.
115
Bibliography
Amazon. 2008. Amazon elastic compute cloud. Http://aws.amazon.com/ec2/.
AMD Corporation. 2008. AMD-64 architecture programmer’s manual.
AMD Corporation. 2009. BIOS and kernel developer’s guide (BKDG) for AMD
family 10h processors.
Antonopoulos, Christos, Dimitrios Nikolopoulos, and Theodore Papatheodorou.
2003. Scheduling algorithms with bus bandwidth considerations for SMPs. In
Proc. of the 32nd Int’l Conf. on Parallel Processing.
Arlitt, Martin and Tai Jin. 1999. Workload Characterization of the 1998 World
Cup Web Site. Technical Report HPL-1999-35, HP Laboratories Palo Alto.
Awasthi, Manu, Kshitij Sudan, Rajeev Balasubramonian, and John Carter. 2009.
Dynamic hardware-assisted software-controlled page placement to manage ca-
pacity allocation and sharing within large caches. In 15th Int’l Symp. on High-
Performance Computer Architecture. Raleigh, NC.
Azimi, Reza, Michael Stumm, and Robert Wisniewski. 2005. Online performance
analysis by statistical sampling of microprocessor performance counters. In The
19th ACM International Conference on Supercomputing. Boston MA.
116
Balasubramonian, Rajeev, David Albonesi, Alper Buyuktosunoglu, and Sandhya
Dwarkadas. 2000. Memory hierarchy reconfiguration for energy and perfor-
mance in general-purpose processor architectures. In International Symposium
on Microarchitecture. Monterey, CA.
Barham, Paul, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using
Magpie for request extraction and workload modeling. In Proc. of the 6th
USENIX Symp. on Operating Systems Design and Implementation, pages 259–
272. San Francisco, CA.
Barroso, Luiz Andr and Urs Hlzle. 2007. The case for energy-proportional com-
puting. In IEEE Computer Society Press, pages 33–37.
Bellosa, Frank. 2000. The benefits of event-driven energy accounting in power-
sensitive systems. In SIGOPS European Workshop. Kolding, Denmark.
Bellosa, Frank, Andreas WeiBel, Martin Waitz, and Simon Kellner. 2003. Event-
driven energy accounting for dynamic thermal management. In Workshop on
Compilers and Operating Systems for Low Power. New Orleans, Louisiana.
Bershad, Brian, Dennis Lee, Theodore Romer, and Bradley Chen. 1994. Avoiding
conflict misses dynamically in large direct-mapped caches. In Proc. of the 6th
Int’l Conf. on Architectural Support for Programming Languages and Operating
Systems, pages 158–170. San Jose, CA.
Bianchini, Ricardo and Ram Rajamony. 2004. Power and energy management for
server systems. In IEEE Computer, volume 37.
Browne, S., J. Dongarra, N. Garner, K. London, and P. Mucci. 2000. A scalable
cross-platform infrastructure for application performance tuning using hardware
counters. In Proc. of the IEEE/ACM SC2000 Conf. Dallas, TX.
117
Bugnion, Edouard, Jennifer M. Anderson, Todd C. Mowry, Mendel Rosenblum,
and Monica S. Lam. 1996. Compiler-directed page coloring for multiproces-
sors. In Proc. of the 7th Int’l Conf. on Architectural Support for Programming
Languages and Operating Systems, pages 244–255. Cambridge, MA.
Chandra, Dhruba, Fei Guo, Seongbeom Kim, and Yan Solihin. 2005. Predicting
inter-thread cache contention on a chip multi-processor architecture. In Pro-
ceedings of the 11th International Symposium on High-Performance Computer
Architecture, pages 340–351.
Chase, Jeffrey, Darrell Anderson, Prachi Thakar, and Amin Vahdat. 2001. Man-
aging energy and server resources in hosting centers. In Proc. of the 18th ACM
Symp. on Operating Systems Principles. Banff, Canada.
Chiou, Derek, Prabhat Jain, Larry Rudolph, and Srini Devadas. 2000. Dynamic
cache partitioning for simultaneous multithreading systems. In Proceedings of
the ASP-DAC 2000. Asia and South Pacific.
Cho, Sangyeun and Lei Jin. 2006. Managing distributed, shared L2 caches through
OS-level page allocation. In Proc. of the 39th Int’l Symp. on Microarchitecture,
pages 455–468. Orlando, FL.
Ebrahimi, Eiman, Chang Joo Lee, Onur Mutlu, and Yale Patt. 2010. Fairness via
source throttling: A configurable and high-performance fairness substrate for
multi-core memory systems. In Proc. of the 15th Int’l Conf. on Architectural
Support for Programming Languages and Operating Systems, pages 335–346.
Pittsburgh, PA.
Eeckhout, Lieven, Hans V. Koen, and De Bosschere. 2002. Workload design:
Selecting representative program-input pairs. In Int’l Conf. on Parallel Archi-
tectures and Compilation Techniques. Charlottesville, Virginia.
118
El-Moursy, Ali, Rajeev Garg, David Albonesi, and Sandhya Dwarkadas. 2006.
Compatible phase co-scheduling on a cmp of multi-threaded processors. In Pro-
ceedings of 20th International Parallel and Distributed Processing Symposium.
Rhodes Island, Greece.
Elnozahy, Mootaz, Michael Kistler, and Ramakrishnan Rajamony. 2003. Energy
conservation policies for web servers. In Proc. of the 4th USENIX Symposium
on Internet Technologies and Systems.
Eranian, Stephane. 2006. perfmon2: A flexible performance monitoring interface
for Linux. In Proc. of the Linux Symposium, pages 269–288.
Fedorova, Alexandra, Margo Seltzer, and Michael D. Smith. 2007. Improving
performance isolation on chip multiprocessors via an operating system sched-
uler. In Proc. of the 16th Int’l Conf. on Parallel Architecture and Compilation
Techniques, pages 25–36. Brasov, Romania.
Fedorova, Alexandra, Christopher Small, Daniel Nussbaum, and Margo Seltzer.
2004. Chip multithreading systems need a new operating system scheduler. In
Proc. of the SIGOPS European Workshop. Leuven, Belgium.
Ghoting, Amol, Gregory Buehrer, Srinivasan Parthasarathy, Daehyun Kim, An-
thony Nguyen, Yen-Kuang Chen, and Pradeep Dubey. 2007. Cache-conscious
frequent pattern mining on modern and emerging processors. In Int’l Journal
of Very Large Data Bases (VLDB). Vienna, Austria.
GoGrid. 2008. Http://www.gogrid.com.
Guan, Nan, Martin Stigge, Wang Yi, and Ge Yu. 2009a. Cache-aware scheduling
and analysis for multicores. In International Conference on Embedded Software.
Grenoble, France.
119
Guan, Nan, Martin Stigge, Wang Yi, and Ge Yu. 2009b. New response time
bounds for fixed priority multiprocessor scheduling. In The 30th IEEE Real-
Time Systems Symposium. Washington, D.C.
Heath, Taliver, Ana Paula Centeno, Pradeep George, Luiz Ramos, Yogesh Jaluria,
and Ricardo Bianchini. 2006. Mercury and freon: Temperature emulation and
management for server systems. In Architectural Support for Programming Lan-
guages and Operating Systems. San Jose, CA.
Herdrich, Andrew, Ramesh Illikkal, Ravi Iyer, Don Newell, Vineet Chadha, and
Jaideep Moses. 2009. Rate-based qos techniques for cache/memory in cmp plat-
forms. In 23rd International Conference on Supercomputing (ICS). Yorktown
Heights, NY.
Hsu, Lisa R., Steven K. Reinhardt, Ravishankar Iyer, and Srihari Makineni. 2006.
Communist, utilitarian, and capitalist cache policies on cmps: Caches as a
shared resource. In Int’l Conf. on Parallel Architectures and Compilation Tech-
niques.
Intel Corporation. 2006. IA-32 Intel architecture software developer’s manual,
volume 3: System programming guide.
Intel Corporation. 2008a. Intel turbo boost technology in intel core microarchi-
tecture (Nehalem) based processors.
Intel Corporation. 2008b. TLBs, paging-structure caches, and their invalidation.
Http://www.intel.com/design/processor/applnots/317080.pdf.
Intel Corporation. 2009a. Intel core2 duo and dual-
core thermal and mechanical design guidelines.
Http://www.intel.com/design/core2duo/documentation.htm.
120
Intel Corporation. 2009b. Intel core2 duo mobile processor, intel core2 solo mo-
bile processor and intel core2 extreme mobile processor on 45-nm process -
datasheet.
Isci, Canturk, Gilberto Contreras, and Margaret Martonosi. 2006. Live, runtime
phase monitoring and prediction on real systems with application to dynamic
power management. In International Symposium on Microarchitecture. Orlando,
FL.
Iyer, Ravi, Li Zhao, Fei Guo, Ramesh Illikkal, Srihari Makineni, Don Newell, Yan
Solihin, Lisa Hsu, and Steve Reinhardt. 2007. QoS policies and architecture for
cache/memory in CMP platforms. In ACM SIGMETRICS, pages 25–36. San
Diego.
Jiang, Yunlian, Xipeng Shen, Jie Chen, , and Rahul Tripathi. 2008. Analysis
and approximation of optimal co-scheduling on cmp. In Int’l Conf. on Parallel
Architecture and Compilation Techniques (PACT). Toronto, Canada.
Kessler, R.E. and Mark D. Hill. 1992. Page placement algorithms for large real-
indexed caches. ACM Trans. on Computer Systems, 10(4):338–359.
Kim, Seongbeom, Dhruba Chandra, and Yan Solihin. 2004. Fair cache sharing
and partitioning in a chip multiprocessor architecture. In Int’l Conf. on Parallel
Architectures and Compilation Techniques.
Kim, Wonyoung, Meeta S. Gupta, Gu-Yeon Wei, and David Brooks. 2008. Sys-
tem level analysis of fast, per-core dvfs using on-chip swithing regulators. In
HPCA’08. Salt Lake City, UT.
Kotla, Ramakrishna, Anirudh Devgan, Soraya Ghiasi, Tom Keller, and Freeman
Rawson. 2004. Characterizing the impact of different memory-intensity levels.
In IEEE 7th Annual Workshop on Workload Characterization. Austin, Texas.
121
Lee, Donghee, Jongmoo Choi, JongHun Kim, Sam H. Noh, Sang Lyul Min,
Yookun Cho, and Chong Sang Kim. 2001. LRFU: A spectrum of policies that
subsumes the least recently used and least frequently used policies. IEEE Trans.
on Computers, 50(12):1352–1361.
Lin, Jiang, Qingda Lu, Xiaoming Ding, Zhao Zhang, Xiaodong Zhang, and P. Sa-
dayappan. 2008. Gaining insights into multicore cache partitioning: Bridging
the gap between simulation and real systems. In Proc. of the 14th Int’l Symp.
on High-Performance Computer Architecture. Salt Lake City, UT.
Linux Open Source Community. 2010. Linux kernel archives.
Http://www.kernel.org.
Lu, Pin and Kai Shen. 2007. Virtual machine memory access tracing with hyper-
visor exclusive cache. In Proc. of the USENIX Annual Technical Conf., pages
29–43. Santa Clara, CA.
Luo, Yue and Lizy Kurian John. 2001. Workload characterization of multithreaded
java severs. In IEEE International Symposium on Performance Analysis of
Systems and Software. Tuscon, Arizona.
McCalpin, John. 1995. Memory bandwidth and machine balance in current high
performance computers. In IEEE Technical Committee on Computer Architec-
ture newsletter.
Merkel, Andreas and Frank Bellosa. 2008a. Memory-aware scheduling for energy
efficiency on multicore processors. In Workshop on Power Aware Computing
and Systems, HotPower’08. San Diego, CA.
Merkel, Andreas and Frank Bellosa. 2008b. Task activity vectors: A new metric
for temperature-aware scheduling. In 3rd European Conf. on Computer systems.
Glasgow, Scotland.
122
Moscibroda, Thomas and Onur Mutlu. 2007. Memory performance attacks: De-
nial of memory service in multi-core systems. In USENIX Security Symp., pages
257–274. Boston, MA.
Mutlu, Onur and Thomas Moscibroda. 2008. Parallelism-aware batch scheduling:
Enhancing both performance and fairness of shared dram systems. In Inter-
national Symposium on Computer Architecture (ISCA), pages 63–74. Beijing,
China.
Nathuji, Ripal, Aman Kansal, and Alireza Ghaffarkhah. 2010. Q-clouds: Manag-
ing performance interference effects for qos-aware clouds. In Proceedings of the
Fifth EuroSys Conference. Paris, France.
Naveh, Alon, Efraim Rotem, Avi Mendelson, Simcha Gochman, Rajshree Chabuk-
swar, Karthik Krishnan, and Arun Kumar. 2006. Power and thermal manage-
ment in the Intel Core Duo processor. Intel Technology Journal, 10(2):109–122.
Nesbit, Kyle, Nidhi Aggarwal, James Laudon, and James Smith. 2006. Fair queu-
ing memory systems. In 39th Int’l Symp. on Microarchitecture (Micro), pages
208–222. Orlando, FL.
OpenSSL. 2007. OpenSSL: The open source toolkit for SSL/TLS.
Http://www.openssl.org.
Oprofile. 2009. Oprofile project. Http://oprofile.sourceforge.net.
Parekh, Sujay, Susan Eggers, Henry Levy, and Jack Lo. 2000. Thread-sensitive
scheduling for SMT processors. Technical report, Department of Computer
Science and Engineering, University of Washington.
Patterson, David A. 2004. Latency lags bandwidth. Communications of the ACM,
47(10):71–75.
123
Percival, Colin. 2005. Cache missing for fun and profit. In BSDCan 2005. Ottawa,
Canada. Http://www.daemonology.net/papers/htt.pdf.
Pettersson, Mikael. 2009a. Linux performance counters driver.
Http://sourceforge.net/projects/perfctr/.
Pettersson, Mikael. 2009b. Perfctr. Http://user.it.uu.se/ mikpe/linux/perfctr/.
Pillai, Padmanabhan and Kang G. Shin. 2001. Realtime dynamic voltage scaling
for lowpower embedded operating systems. In Proc. of the 18th ACM Symp.
on Operating Systems Principles. Banff, Canada.
Pinheiro, Eduardo, Ricardo Bianchini, Enrique V. Carrera, and Taliver Heath.
2001. Load balancing and unbalancing for for power and performance in cluster-
based systems. In Proc. of the Workshop on Compilers and Operating Systems
for Low Power.
Qureshi, Moinuddin and Yale Patt. 2006. Utility-based cache partitioning: A low-
overhead, hight-performance, runtime mechanism to partition shared caches. In
39th Int’l Symp. on Microarchitecture (Micro), pages 423–432. Orlando, FL.
Rafique, Nauman, Wontaek Lim, and Mithuna Thottethodi. 2006. Architectural
support for operating system-driven CMP cache management. In Int’l Conf.
on Parallel Architectures and Compilation Techniques (PACT), pages 2–12.
Raghuraman, Anand. 2003. Miss-ratio curve directed memory management for
high performance and low energy. Master’s thesis, Dept. of Computer Science,
UIUC.
Romer, Theodore, Dennis Lee, Brian Bershad, and Bradley Chen. 1994. Dynamic
page mapping policies for cache conflict resolution on standard hardware. In
Proc. of the First USENIX Symp. on Operating Systems Design and Implemen-
tation, pages 255–266. Monterey, CA.
124
Salapura, Valentina, Karthik Ganesan, Alan Gara, Michael Gschwind, James Sex-
ton, and Robert Walkup. 2008. Next-generation performance counters: towards
monitoring over thousand concurrent events. In IEEE International symposium
on performance analysis of systems and software. Austin TX.
Seshadri, Pattabi and Alex Mericas. 2001. Workload characterization of multi-
threaded java severs on two powerpc processors. In IEEE 4th Workshop on
Workload Characterization. Austin, Texas.
Settle, Alex, Joshua Kihm, and Andrew Janiszewski. 2004. Architectural support
for enhanced smt job scheduling. In Int’l Conf. on Parallel Architectures and
Compilation Techniques.
Shen, Kai, Ming Zhong, Chuanpeng Li, Sandhya Dwarkadas, Chris Stewart, and
Xiao Zhang. 2008. Hardware counter driven on-the-fly request signatures. In
Thirteenth International Conference on Architectural Support for Programming
Languages and Operating Systems. Seattle.
Shen, Xipeng, Yutao Zhong, and Chen Ding. 2004. Locality phase prediction.
In 11th Int’l Conf. on Architectural Support for Programming Languages and
Operating Systems (ASPLOS), pages 165–176. Boston, MA.
Sherwood, Timothy, Brad Calder, and Joel Emer. 1999. Reducing cache misses
using hardware and software page replacement. In Proc. of the 13th Int’l Conf.
on Supercomputing, pages 155–164. Rhodes, Greece.
Sherwood, Timothy, Suleyman Sair, and Brad Calder. 2003. Phase tracking and
predication. In International Symposium on Computer Architecture. San Diego,
CA.
Snavely, Allan and Dean M. Tullsen. 2000. Symbiotic job scheduling for a simulta-
neous multithreading processor. In Proc. of the 9th Int’l Conf. on Architectural
125
Support for Programming Languages and Operating Systems, pages 234–244.
Cambridge, MA.
Soares, Livio, David Tam, and Michael Stumm. 2008. Reducing the harmful
effects of last-level cache polluters with an OS-level, software-only pollute buffer.
In 41th Int’l Symp. on Microarchitecture (Micro), pages 258–269. Lake Como,
ITALY.
Sokolinsky, Leonid B. 2004. LFU-K: An effective buffer management replacement
algorithm. In 9th Int’l Conf. on Database Systems for Advanced Applications,
pages 670–681.
Suh, G. Edward, Srinivas Devadas, and Larry Rudolph. 2001a. Analytical cache
models with applications to cache partitioning. In Proc. of the 15th Int’l Conf.
on Supercomputing, pages 1–12. Sorrento, Italy.
Suh, G. Edward, Larry Rudolph, and Srini Devadas. 2001b. Dynamic cache par-
titioning for simultaneous multithreading systems. In Proc. of the IASTED
International Conference on Parallel and Distributed Computing and Systems.
Anaheim, USA.
Suh, G. Edward, Larry Rudolph, and Srini Devadas. 2004. Dynamic partitioning
of shared cache memory. In The Journal of Supercomputing 28, pages 7–26.
Sun Microsystems, Inc. 2005. UltraSPARC IV+ Processor Manual.
Http://www.sun.com/processors/documentation.html.
Sweeney, Peter, Matthias Hauswirth, Brendon Cahoon, Perry Cheng, Amer Di-
wan, David Grove, and Michael Hind. 2004. Using hardware performance mon-
itors to understand the behaviors of java applications. In Proc. of the Third
USENIX Virtual Machine Research and Technology Symp., pages 57–72. San
Jose, CA.
126
Tam, David, Reza Azimi, Livio Soares, and Michael Stumm. 2007a. Managing
shared L2 caches on multicore systems in software. In Workshop on the In-
teraction between Operating Systems and Computer Architecture. San Diego,
CA.
Tam, David, Reza Azimi, Livio Soares, and Michael Stumm. 2007b. Thread
clustering: sharing-aware scheduling on smp-cmp-smt multiprocessors. In Pro-
ceedings of the 2nd ACM SIGOPS/Eurosys European Conference on Computer
Systems. Lisbon, Portugal.
Tam, David, Reza Azimi, Livio Soares, and Michael Stumm. 2009. RapidMRC:
Approximating l2 miss rate curves on commodity systems for online optimiza-
tions. In 14th Int’l Conf. on Architectural Support for Programming Languages
and Operating Systems (ASPLOS). Washington, DC.
Taylor, George, Peter Davies, and Michael Farmwald. 1990. The tlb slicea low-cost
high-speed address translation mechanism. In Proceedings of the 17th annual
international symposium on Computer Architecture, pages 355 – 363.
Waldspurger, Carl. 2002. Memory resource management in vmware ESX server. In
5th USENIX Symp. on Operating Systems Design and Implementation (OSDI),
pages 181–194. Boston, MA.
Waldspurger, Carl and William Weihl. 1994. Lottery scheduling: Flexible
proportional-share resource management. In Proc. of the First USENIX Symp.
on Operating Systems Design and Implementation, pages 1–11. Monterey, CA.
Wang, Xiaorui, Charles Lefurgy, and Malcolm Ware. 2005. Managing peak system-
leavel power with feedback control. Technical report, IBM Research Tech Report
RC23835.
Watts Up. 2009. Watts up power meter. Https://www.wattsupmeters.com.
127
Weiser, Mark, Brent Welch, Alan Demers, and Scott Shenker. 1994. Scheduling
for reduced cpu energy. In 1st USENIX Symp. on Operating Systems Design
and Implementation (OSDI), pages 13–23.
Weissel, Andreas and Frank Bellosa. 2002. Process cruise control: Event-driven
clock scaling for dynamic power management. In International Conference
on Compilers, Architecture, and Synthesis for Embedded Systems. Grenoble,
France.
Weissel, Andreas and Frank Bellosa. 2004. Dynamic thermal management for dis-
tributed systems. In Proc. of the 1st Workshop on Temperature-aware Computer
Systems. Munich, Germany.
Wisniewski, Robert and Bryan Rosenburg. 2003. Efficient, unified, and scalable
performance monitoring for multiprocessor operating system. In 2003 Super-
computing Conference. Phoenix AZ.
Zhang, Eddy Z., Yunlian Jiang, and Xipeng Shen. 2010a. Does cache sharing on
modern cmp matter to the performance of contemporary multithreaded pro-
grams? In 15th ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming (PPoPP). Bangalore, India.
Zhang, Xiao, Sandhya Dwarkadas, Girts Folkmanis, and Kai Shen. 2007. Processor
hardware counter statistics as a first-class system resource. In HotOS XI. San
Diego, CA.
Zhang, Xiao, Sandhya Dwarkadas, and Kai Shen. 2009a. Hardware execution
throttling for multi-core resource management. In USENIX Annual Technical
Conf. (USENIX). Santa Diego, CA.
Zhang, Xiao, Sandhya Dwarkadas, and Kai Shen. 2009b. Towards practical page
coloring-based multicore cache management. In 4th European Conf. on Com-
puter systems. Nuremberg, Germany.
128
Zhang, Xiao, Kai Shen, Sandhya Dwarkadas, and Rongrong Zhong. 2010b. An
evaluation of per-chip nonuniform frequency scaling on multicores. In USENIX
Annual Technical Conf. (USENIX). Boston, MA.
Zhang, Xiao, Rongrong Zhong, Sandhya Dwarkadas, and Kai Shen. 2010c. Flexi-
ble hardware throttling based multicore management. In Under Submission.
Zhao, Li, Ravi Iyer, Ramesh Illikkal, Jaideep Moses, Don Newell, and Srihari
Makineni. 2007. Cachescouts: Fine-grain monitoring of shared caches in cmp
platforms. In Proc. of the 16th Int’l Conf. on Parallel Architecture and Compi-
lation Techniques, pages 339–352. Brasov, Romania.
Zhou, Pin, Vivek Pandey, Jagadeesan Sundaresan, Anand Raghuraman,
Yuanyuan Zhou, and Sanjeev Kumar. 2004. Dynamic tracking of page miss
ratio curve for memory management. In Proc. of the 11th Int’l Conf. on Ar-
chitectural Support for Programming Languages and Operating Systems, pages
177–188. Boston, MA.
Zhuravlev, Sergey, Sergey Blagodurov, and Alexandra Fedorova. 2010. Managing
contention for shared resources on multicore processors. In Proc. of the 15th
Int’l Conf. on Architectural Support for Programming Languages and Operating
Systems, pages 129–142. Pittsburgh, PA.