Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
WLRU CPU Cache Replacement Algorithm
(Thesis Format: Monograph)
by
Qufei Wang
Graduate Program in Computer Science
Submitted in partial fulfilmentof the requirements for the degree of
Doctor of Philosophy
Faculty of Graduate StudiesThe University of Western Ontario
London, OntarioDecember, 2006
c© Qufei Wang2006
THE UNIVERSITY OF WESTERN ONTARIOFACULTY OF GRADUATE STUDIES
CERTIFICATE OF EXAMINATION
Advisors Examining Board
Dr. Hanan Lutfiyya Dr. Marin Litou
Dr. Abdallah Shami
Dr. Mark Daley
Dr. Mike Katchabaw
The thesis byQufei Wang
entitled
WLRU CPU CACHE REPLACEMENT ALGORITHM
is accepted in partial fulfilment of therequirements for the degree of
Doctor of Philosophy
Date Chair of Examining Board
ii
Abstract
A CPU consists of two parts, the CPU cores and the CPU caches. CPU caches are small but
fast memories usually on the same die as the CPU cores. Recently used instructions and
data are stored in CPU caches. Accessing CPU caches takes a quater to five nano seconds,
but accessing the main memory takes 100 to 150 nano seconds. The main memory is so
slow that the CPU is idle for more than 80% of the time waiting for memory accesses. This
problem is known as the memory wall. The memory wall implies that faster or more CPU
cores are of little use if the performance of CPU caches does not improve.
Generally, larger CPU caches have higher performance but the improvement is very small.
A smarter CPU cache replacement algorithm is of more potential. The CPU cache replace-
ment algorithm decides which cache content to be replaced. Currently, Least Recently Used
(LRU) replacement and its variants are most widely used in CPUs. However, the perfor-
mance of LRU is not satisfactory for applications of poor locality, such as network protocols
and applications. We found that there is a pattern in the memory references of these appli-
cations that makes LRU fails. Based on this discovery, we developed a new CPU cache
replacement called Weighted Least Recently Used (WLRU). Trace based simulations show
that WLRU has significant improvement over LRU for applications of poor locality. For
example, for web servers, WLRU has 50% fewer L2 cache misses than LRU. This means
WLRU can immediately improve the performance of web serversby more than 200%.
CPU caches have been intensively studied in the past thirty years. WLRU has by far the
biggest improvement. Our studies also indicate that WLRU isvery close to the theoretical
upper limit of cache replacement algorithms. This means anyfurther improvement in CPU
cache performance will have to come from changes to the software. In future work, we will
investigate how to write OS and software to have better CPU cache performance.
iii
Acknowledgements
I would like to gratefully acknowledge the supervision of Professor Hanan Lutfiyya during
this work. Many thanks to her for her patience, tolerance andsupport.
I am grateful to all my friends in Computer Science Department, University of Western On-
tario. From the staff, Janice Wiersma and Cheryl McGrath areespecially thanked for their
care and attention.
Finally, I am forever indebted to my wife Min and my parents. The support from Min is the
source of strength helped me through the many years.
iv
Table of Contents
CERTIFICATE OF EXAMINATION ii
TABLE OF CONTENTS v
LIST OF FIGURES ix
LIST OF TABLES xiii
1 Introduction 1
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . .1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Outline of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3
2 Background and Related Research 5
2.1 Background on CPU Caches . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Cache Lines and Cache Hits . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Set Associative Caches . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.4 Multiple Level CPU Caches . . . . . . . . . . . . . . . . . . . . . 8
2.2 Efforts to Improve Cache Hit Rates . . . . . . . . . . . . . . . . . . .. . . 10
2.2.1 Cache Line Size, Prefetching and Stream Buffer . . . . . .. . . . 10
2.2.2 Cache Sizes and Hit Rates . . . . . . . . . . . . . . . . . . . . . . 12
v
2.2.3 Cache Associativity and Victim Cache . . . . . . . . . . . . . .. . 13
2.2.4 Split Instruction and Data Cache . . . . . . . . . . . . . . . . . .. 14
2.3 Cache Replacement Algorithms Other Than LRU . . . . . . . . . .. . . . 15
2.3.1 Pseudo-LRU Replacements . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 First-In-First-Out and Random Replacements . . . . . . .. . . . . 16
2.3.3 LRU-k and LIRS Replacement Algorithms . . . . . . . . . . . . . 17
2.3.4 LFU and FBR Replacement Algorithms . . . . . . . . . . . . . . . 19
2.3.5 LRFU, Multi-Queue and EELRU Replacement Algorithms .. . . . 20
2.3.6 Dead Alive Prediction Replacement Algorithms . . . . . .. . . . . 21
2.3.7 Off-Line Optimal Replacement Algorithm . . . . . . . . . . .. . . 21
2.3.8 Summary of Replacements . . . . . . . . . . . . . . . . . . . . . . 23
2.4 CPU Cache Issues of Network Protocols and Applications .. . . . . . . . 23
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Principle of Locality and Property of Short Lifetime 26
3.1 Memory Reference Traces . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Principle of Temporal Locality and LRU . . . . . . . . . . . . . . .. . . . 28
3.3 Inter-Reference Gaps and Temporal Locality . . . . . . . . . .. . . . . . . 29
3.3.1 Inter-Reference Gaps and LRU . . . . . . . . . . . . . . . . . . . 30
3.3.2 Complete Program Stream and Per Set IRG Values . . . . . . .. . 30
3.3.3 Distributions of Per Set IRG Values and Temporal Locality . . . . . 32
3.4 Reference Counts and Property of Short Lifetime . . . . . . .. . . . . . . 34
3.4.1 Property of Short Lifetime . . . . . . . . . . . . . . . . . . . . . . 34
3.4.2 Reference Counts of Cache Lines . . . . . . . . . . . . . . . . . . 34
3.5 Relationship between Average Reference Counts and LRU Hit Rates . . . . 36
3.6 L2 IRG and Reference Count Distributions . . . . . . . . . . . . .. . . . 38
3.6.1 L2 Reference Count Distributions . . . . . . . . . . . . . . . . .. 38
3.6.2 L2 IRG Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
vi
4 Locality Characteristics of Network Protocols and Applications 43
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Memory Traces of Web Servers . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Average Reference Counts of Web Server Memory Traces . . .. . . . . . 45
4.4 Reference Count Distributions of Web Server Memory Traces . . . . . . . 46
4.5 L2 Distributions of Reference Counts of Web Server Memory Traces . . . . 47
4.6 L2 IRG Distributions of Web Server Memory Traces . . . . . . .. . . . . 48
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5 WLRU Cache Replacement 53
5.1 Correlation of IRG and Reference Counts . . . . . . . . . . . . . .. . . . 53
5.2 Problems with LRU and LFU . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 WLRU Cache Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4 Notation Used to Represent WLRU Parameter Settings . . . .. . . . . . . 57
5.5 WLRU Mimicking LRU . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.6 Comparison of WLRU with Other Cache Replacement Algorithms . . . . . 58
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6 Hardware Implementations of WLRU 63
6.1 Space Requirements of WLRU . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2 Overall Structure of WLRU CPU Cache . . . . . . . . . . . . . . . . . .. 63
6.3 Hit/Miss Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.4 Weight Control Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.5 Replacement and Line-Fill/Cast-Out Logic . . . . . . . . . . .. . . . . . . 71
6.6 Comparison of WLRU and LRU . . . . . . . . . . . . . . . . . . . . . . . 72
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
vii
7 WLRU CPU Cache Simulator 78
7.1 Memory Trace Based CPU Cache Simulations . . . . . . . . . . . . .. . . 79
7.2 Architecture of CPU Cache Simulator . . . . . . . . . . . . . . . . .. . . 80
7.2.1 SimuEngine Object and Trace Synthesizing . . . . . . . . . .. . . 80
7.2.2 CacheDevice Interface . . . . . . . . . . . . . . . . . . . . . . . . 82
7.3 Cache Sets and Replacement Objects . . . . . . . . . . . . . . . . . .. . . 82
7.4 WLRU Replacements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.5 Optimal Replacement Algorithm . . . . . . . . . . . . . . . . . . . . .. . 85
7.6 Victim Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.7 Validation of Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . .. 86
7.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8 Simulation Results 90
8.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.2 WLRU on Web Server Memory Traces . . . . . . . . . . . . . . . . . . . . 93
8.3 WLRU on SPEC CPU2000 Benchmarks . . . . . . . . . . . . . . . . . . . 96
8.4 WLRU Performance on Multi-threaded Workloads . . . . . . . .. . . . . 99
8.5 Comparison of LRU and WLRU Using Victim Analysis . . . . . . .. . . 101
8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
9 Conclusions and Future Research 112
9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
9.2.1 Hardware Prototype . . . . . . . . . . . . . . . . . . . . . . . . . 114
9.2.2 Locality Analysis for More Applications Domains . . . .. . . . . 114
9.2.3 OS and algorithm Design Issues . . . . . . . . . . . . . . . . . . . 115
A Analysis and Simulation Results 116
viii
References 117
VITA 125
ix
List of Figures
2.1 The structure of an four levels memory hierarchy. . . . . . .. . . . . . . . 6
2.2 The structure of a CPU cache line. . . . . . . . . . . . . . . . . . . . .. . 7
2.3 The mapping of the main memory words into a direct-mappedcache and a
two-way associative cache. . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 The structure of an eight-way set associative cache. . . .. . . . . . . . . . 10
2.5 Storage arrangements of an eight-way associative cacheset using real LRU
and PLRU replacements. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Storage arrangements of an eight-way associative cacheset using PLRU-
tree replacement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Storage arrangements of an eight-way associative cacheset using PLRU-
msb replacement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.8 An example of the Optimal replacement decision. . . . . . . .. . . . . . 22
3.1 Two per set IRG values and their corresponding whole stream IRG values. . 29
3.2 IRG strings of three addresses in the CC1 trace [PG95]. IRG index is the
index number of the first reference of an IRG gap. . . . . . . . . . . .. . . 31
3.3 The distributions of per set IRG values of eight SPEC benchmarks. . . . . . 33
3.4 The distributions of per address reference counts of eight SPEC benchmarks.
35
3.5 The distributions of reference counts of cache lines of eight SPEC bench-
marks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 The average reference counts of SPEC integer benchmarksand their miss
rates under LRU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
x
3.7 The average reference counts of SPEC floating point benchmarks and their
miss rates under LRU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.8 The distributions of L2 reference counts of eight SPEC benchmarks. . . . . 39
3.9 The distributions of L2 IRG values of eight SPEC benchmarks. . . . . . . . 40
3.10 The distributions of L2 IRG values of SPEC benchmarks onlog2 scale. . . 41
4.1 The distributions of per address reference counts of four web sever memory
traces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 The distributions of reference counts of cache lines of four web server mem-
ory traces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 The distributions of reference counts of cache lines of four web server mem-
ory traces at the L2 cache. . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 The distributions of IRG values of four web server memorytraces at the L2
cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1 Comparison of the replacement decision of WLRU and LRU. .. . . . . . . 59
6.1 Storage arrangement of an eight-way associative cache set using WLRU re-
placement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2 The structure of a CPU cache using WLRU replacement. . . . .. . . . . . 65
6.3 The RAM memory arrays used in an associative of the WLRU CPU cache. 66
6.4 The data path, address path and control signals of the WLRU CPU cache. . 68
6.5 The hit/miss logic of the WLRU CPU cache. . . . . . . . . . . . . . .. . . 69
6.6 The weight control logic of the WLRU CPU cache. . . . . . . . . .. . . . 74
6.7 The weight arithmetic circuit of the weight control logic. . . . . . . . . . . 75
6.8 The line-fill/cast-out logic of the WLRU CPU cache. . . . . .. . . . . . . 76
6.9 The replacement logic of the WLRU CPU cache. . . . . . . . . . . .. . . 77
7.1 The architecture of the CPU cache simulator. . . . . . . . . . .. . . . . . 80
7.2 An example trace synthesizing scenario which includes context switching
effects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
xi
7.3 The UML graph ofCacheDeviceinterface. . . . . . . . . . . . . . . . . . 83
7.4 A flow chart of thecyclePing()method of classSetCache. . . . . . . . . . 84
7.5 The UML graph ofCacheSetclass, which is the base class of all replace-
ments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.6 The flow chart of thereferenced()andreplace()method ofWLRUclass. . . 89
8.1 Comparison of miss rates of WLRU, LRU and Optimal on Apache traces. . 94
8.2 Comparison of miss rates of WLRU, LRU and Optimal on Apache traces
with mixed web page sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.3 Comparison of miss rates of WLRU, LRU and Optimal on thttpd traces. . . 96
8.4 Comparison of miss rates of WLRU, LRU and Optimal on thttpd traces with
mixed web page sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.5 Comparison of miss rates of OPT, LRU and WLRU on SPEC integer bench-
marks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.6 Comparison of miss rates of OPT, LRU and WLRU on SPEC floating point
benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.7 Comparison of miss rates of OPT, LRU and WLRU on SPEC integer bench-
marks, where WLRU usingi64r256b32. . . . . . . . . . . . . . . . . . . . 101
8.8 Comparison of miss rates of OPT, LRU and WLRU on SPEC floating point
benchmarks, where WLRU usingi64r256b32. . . . . . . . . . . . . . . . . 102
8.9 The miss rates of LRU, WLRU and OPT replacements on multi-threading
SPEC INT benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.10 The miss rates of LRU, WLRU and OPT replacements on multi-threading
SPEC FLT benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
8.11 The distributions of idle time of WLRU, LRU and OPT replacements of
network tracet20kr50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.12 The distributions of stay time of WLRU, LRU and OPT replacements of
network tracet20kr50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.13 The distributions of idle time of WLRU, LRU and OPT replacements of
network tracet20kr90. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
xii
8.14 The distributions of stay time of WLRU, LRU and OPT replacements of
network tracet20kr90. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.15 The distributions of idle time of WLRU, LRU and OPT replacements of
SPEC benchmarkcrafty. . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.16 The distributions of stay time of WLRU, LRU and OPT replacements of
SPEC benchmarkcrafty. . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
xiii
List of Tables
2.1 Typical miss rates of LRU and Random with different cachesizes and associativities[HP96]. 17
4.1 Names of network traces and their configurations. . . . . . .. . . . . . . . 46
4.2 Average reference counts of network traces. . . . . . . . . . .. . . . . . . 47
4.3 Percentages of IRG values< 16 and percentages of IRG values≥ 256 of
SPEC benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Percentages of IRG values< 16 and percentages of IRG values≥ 256 of
network traces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1 The IRG values of address tags mapping toset0 of SPEC benchmarkcrafty. 54
5.2 The IRG values of address tags mapping toset0 of network tracea20kr50. . 54
5.3 Comparison of total cache misses of LRU and weight formulas mimicking
LRU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.1 The IRG values of address tags with reference count of twoin set0 of SPEC
benchmarkswim. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.2 The distribution of victim hit counts of WLRU and LRU replacements on
network tracet20kr50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8.3 The distribution of victim hit counts of WLRU and LRU replacements on
SPEC benchmarkcrafty. . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
xiv
1
Chapter 1
Introduction
1.1 Background and Motivation
The speed of CPUs is much faster than the speed of the main memory. CPU caches are used
to bridge the speed gap. A CPU cache is a small memory which is usually on the same die as
the CPU [dLJ03]. A CPU cache is much faster than the main memory but much smaller in
size. Instructions and data recently accessed from the mainmemory are stored in the CPU
cache. When the CPU requests an address, the CPU cache is checked. If found in the cache,
it is called acache hitotherwise it is called acache miss. The proportion of addresses found
in the cache is called thecache hit rate. The difference in accessing time between the main
memory and the CPU cache is defined as thecache miss penalty. This work assumes that the
cache miss penalty is measured using the number of CPU cyclesneeded to retrieve the infor-
mation from the main memory. For example, if accessing the CPU cache requires only one
CPU cycle but accessing the main memory requires 100 CPU cycles, the cache miss penalty
is 100. Currently, the cache miss penalties of most CPUs are already much more than 100
[Jac03, Tho03, FH05]. Since most CPU caches are smaller thanthe program image in the
main memory, when the CPU cache is full, then an existing cache entry is chosen to be re-
placed. Acache replacement algorithmdecides the cache entry to be replaced. The most
commonly used CPU cache replacement algorithm is Least Recently Used (LRU) replace-
ment [PH05]. LRU replacement evicts the cache entry which isleast recently accessed. The
use of LRU is based on the assumption that programs exhibit the property of temporal local-
ity, which is phrased as ‘recently accessed items are likelyto be accessed in the near future
[HP96].
2
In the past twenty years, the speed of CPUs doubled every 18 months, but the memory speed
increased only 7% each year [HP02]. The speed gap between theCPU and the main mem-
ory keeps widening1, but the CPU cache hit rate is seldom higher than 99% [HP02]. As-
suming a cache hit rate of 99% and a cache penalty of 100, the CPU is idle for 50% of the
time. Currently, main stream CPU speeds are between 2 to 4 GHz, and the main memory is
clocked between 500 MHz to 800 MHz. Besides the data transfertime, the main memory
made of DRAM ( Dynamic Random Access Memory) also has a large latency. The latency
of DRAM is the delay between the receiving of the read requestand the readiness of data
for transfer. The latency of the current DDR DRAM memory is atleast 90 nano seconds,
and the total transfer time of a cache line is around 120 ns2. Assuming a CPU speed of 1
GHz, the cache miss penalty is 120 CPU cycles. Faster CPU speeds have even larger cache
miss penalties. Faster DRAM technologies helps little since the latency of these faster mem-
ory remains constant, if not even longer. The SemiconductorIndustry Association (ISA) is
now calculating cache miss penalties of more than 300 CPU cycles [FH05]. If the cache
hit rate can not be improved, as the speed gap reaches a specific point, further increasing
CPU speeds will not generate any gain in effective computingpower. This is known as the
Memory Wall problem [WM95].
The CPU cache is a dominant factor in computing power. Generally, a larger CPU cache
has higher hit rates. However, there is a limit on the die for CPU caches. Recent proces-
sors have already spent 50% of the die area and more than 80% ofthe transistors on CPU
caches [PHS98]. Larger CPU caches are unlikely unless revolutionary circuit technologies
are used. This suggests other approaches to improve the CPU cache performance besides
increasing the size of CPU caches should be examined.
One approach to improving CPU cache performance is to find better cache replacement al-
gorithms. LRU is currently the most widely used CPU cache replacement. LRU was devel-
oped decades ago, and current computing environments are very different from that time.
1.2 Contributions
The contributions of this work include the following:
1Although the CPU speed stagnated in recent years, there are always faster CPUs coming. For example,IBM Power6 is targeted around 5G Hz. (http://realworldtech.com/page.cfm?ArticleID=RWT101606194731)
2source: www.powerlogix.com/downloads/SDRDDR.pdf
3
Property of Short Lifetime. This work presents an analysis of the pattern of memory ref-
erences of programs. Of special interest is the study of inter-reference gaps (IRG) and ref-
erence counts of addresses. The reference count of an address is the number of times that
the address is referenced. An Inter-Reference Gap (IRG) is defined as the number of refer-
ences between two consecutive references of an address. Perset IRG values are IRG values
of an individual cache set. Our studies find that the majorityof per set IRG values are small.
This is especially true at the first-level (L1) cache where itis found that 90% of the per set
IRG values are of size one. At the level two (L2) cache, per setIRG values are still small.
This provides strong evidence of temporal locality. However, our studies also show that a
large portion of addresses have low reference counts. At theL2 cache, nearly 50% of all ad-
dresses are referenced only once, and nearly 90% of all addresses are referenced under ten
times. This pattern is named theproperty of short lifetime. This suggests that LRU is less
effective for programs that have a large portion of its addresses with low reference counts,
since LRU does not distinguish between addresses with low reference counts and addresses
with high reference counts, which turns out to be the case of many networked applications.
Development of a New Cache Replacement Algorithm. Based on the property of short
lifetime a new cache replacement algorithm, which is a modification of LRU, was devel-
oped. This new algorithm is referred to asWeighted Least Recently Used (WLRU). Simula-
tions show that WLRU has significantly fewer cache misses than LRU for network protocols
and applications. For other programs, such as SPEC benchmark programs, the difference in
the hit rates of WLRU and LRU is unnoticeable. This means the superiority of WLRU over
LRU for network protocols and applications does not harm theperformance of traditional
applications like SPEC benchmarks. WLRU can replace LRU in general purpose CPUs.
Example Circuit and Simulator. An example circuit of a CPU cache using WLRU re-
placement is presented in this work. The circuit shows that the cost of implementing WLRU
is minimal. WLRU is requires less than 3% of more space than LRU. A trace based simula-
tor is also developed. The simulator implements WLRU, LRU, pseudo-LRU replacements,
and off-line optimal replacement. The simulator is writtenin Java and contains bookkeep-
ing information not found in other simulators. This information is used to investigate the
behavior of different cache replacements and designs.
1.3 Outline of Dissertation
The rest of this work is organized as follows.
4
Chapter 2 describes related research in cache replacement algorithms. Some background
introduction to CPU cache designs is included. Cache replacements in fields other than CPU
caches, such as database buffer caches, are introduced in chapter 2. Chapter 2 also discusses
previous studies on the impact of cache performance on network protocols and applications.
Chapter 3 discusses the empirical analysis methods used forthe study of the memory ac-
cesses of programs and the results of the analysis. The property of short lifetime is intro-
duced in chapter 3.
Chapter 4 discusses the locality characteristics of network protocols and applications.
Chapter 5 presents a new CPU cache replacement algorithm called theWLRUreplacement
algorithm.
Chapter 6 presents an example hardware implementation of WLRU cache in CPU. The hard-
ware cost of implementing WLRU replacement is analyzed and compared with the cost of
implementing LRU replacement in CPU caches.
Chapter 7 describes the design of the CPU cache simulator. The CPU cache simulator in
this work is different from other CPU cache simulators in that its focus is on the cache re-
placement algorithms. Other unique features include the victim analysis and a fast imple-
mentation of the off-line optimal replacement algorithm.
Chapter 8 presents a simulation comparison of the hit rates of WLRU and LRU replacement
algorithms on the SPEC benchmark programs and network protocols and applications. Sim-
ulation results of the off-line optimal replacement (OPT) are provided to better understand
the improvement of WLRU over LRU.
Chapter 9 presents conclusions and a plan for future research.
5
Chapter 2
Background and Related Research
CPU caches have been intensively studied for the last thirtyyears. This chapter briefly ex-
amines the design issues of current CPU caches and research on CPU cache performance
of network protocols and applications.
2.1 Background on CPU Caches
This section introduces the basics of CPU cache design.
2.1.1 Memory Hierarchy
Modern CPUs have a hierarchy of memories. A higher level of memory is faster than a
lower level of memory, but the higher level memory is also smaller in size and more expen-
sive. The highest level or levels of memory are called the CPUcache. Currently, the CPU
cache is on the same die as the CPU execution unit. CPU caches are always made of SRAM
(Static Random Access Memory). The main memory is made of DRAM (Dynamic Random
Access Memory). Visiting the main memory incurs a long latency, typically around 100 ns,
and then fetching the data costs another 2 ns each word1. The time to visit the main memory
is equal to several hundred CPU cycles. CPU caches can reducethe time to a single CPU
cycle since CPU caches are made of SRAM and are usually on the same die as the CPU
execution units. Figure 2.1 shows a hierarchy of memories. The first and the second levels
1source: www.powerlogix.com/downloads/SDRDDR.pdf
6
of the hierarchy are CPU caches, and the third level is the main memory. The fourth level
of the hierarchy is the virtual memory on the disk storage. CPU caches contain a subset of
the main memory.
CPU
L1 Cache
L2 Cache
Main Memory
Virtual Memory
Circuit Technology
SRAM
SRAM
DRAM
Disk
Figure 2.1: The structure of an four levels memory hierarchy.
2.1.2 Cache Lines and Cache Hits
The unit of transfer of data between the CPU execution unit and the cache is a word. Data
transfer between the cache and the main memory is multiple memory words. This takes
advantage of the spatial locality principle in that if one memory location is read then nearby
memory locations are likely to be read [HP96]. Thus CPU caches are organized into cache
lines where each cache line consists of the words read in a single transfer of data between the
main memory and CPU cache. A cache line (depicted in Figure 2.2) consists of an address
tag, status bits and data from the main memory. Transfer of more than one word also has
advantages with respect to the memory bandwidth. The latency of visiting the main memory
is amortized among multiple words. Cache lines also save space since multiple words share
an address tag.
The part of the address of a main memory word is referred to as the address tag. When
the CPU references a main memory word, the address tag part ofthe address of the main
memory word is taken out. The tag of each cache line is compared with the address tag
of the memory word being referenced by the CPU. If there is a match between a tag of a
cache line and the address tag of the word then there is said tobe acache hit, otherwise it
is acache miss. In the case of a cache hit, the referenced word is directly accessed from
the cache, avoiding the latency in retrieving the memory word from the main memory. In
the case of a cache miss the referenced word is accessed from the main memory. There are
7
two status bits in a cache line. The valid status bit is used toindicate that a cache line is not
empty. The dirty status bit is set when the data in a cache linechanges.
Tag V DataD
Figure 2.2: The structure of a CPU cache line.
2.1.3 Set Associative Caches
An important design aspect of CPU caches is determining where in the cache the retrieved
data from main memory can be placed. If a main memory word can only be placed in a
single cache location then the cache is called adirect-mappedcache. Figure 2.3(a) shows
the mapping of a main memory word into a direct-mapped cache.The direct-mapped cache
hasm cache lines and the lowestlogm2 bits of the address is used to map a main memory
word into the cache. When deciding cache hits or misses, direct-mapped caches only need
to compare a single address tag. Thus direct-mapped caches are fast. The problem with
direct-mapped cache is that it incurs more cache misses, which can be illustrated with the
following example. Suppose a program generates a series of memory references such as the
following: 0x1CD, 0x3CD, 0x1CD, 0x3CD. Both of the two memory words are mapped to
the same cache line. This sequence of references causes a continuous stream of evictions
and replacements of cache lines. Thus, direct-mapped caches are fast but also have lower
hit rates. Studies [Prz90, HS89] found that reducing associativity from two-way to direct-
mapped increases the miss rate by 25%.
Another approach would allow a unit of data transfer to be placed in any one of the cache
lines in the cache. This is called afully-associativecache. Replacement of data in a cache
line only occurs when the entire cache has filled up. The replacement algorithm in a fully-
associative cache can replace any cache line in the cache with the incoming cache line.
Fully-associative caches are believed to have the highest hit rates [HR00] for a large number
of replacement algorithms. The address tag is compared in parallel with all of the tags of all
the cache lines in order to retrieve the data quickly. However, a CPU cache typically consists
of hundreds of thousands of cache lines. The circuitry needed to do the parallel comparison
of all tags is expensive. Thus, except for some very small caches, no CPU caches are fully
associative [PH05].
8
A set associative cache combines concepts from direct-mapped cache and fully-associative
cache. Cache lines are organized into cache sets. A main memory word can be placed in
only one cache set but may be placed into any of the cache linesin the cache set. Essentially
this means that a memory word can only be placed in a subset of the cache. A main memory
address is divided into three fields: an address tag, set index and block offset. The set index
field is used to determine the cache set. The address tag uniquely identifies the memory word
and the offset is used to find the word within cache line. A set associative cache becomes a
fully associative cache when the cache has only one cache set.
The number of cache lines in a cache set is referred to as the associativity of the cache. For
example, if there are four cache lines in a cache set, the associativity is four, and the cache is
called a four-way set associative cache. Figure 2.3(b) shows the mapping of a main memory
word into a two-way set associative cache. The cache has the same numberm of cache lines
and is arranged intom/2 cache set. Each main memory word has two possible locations in
the cache. The lowestlogm/2
2 bits of the address is used to map the word into the cache.
Figure 2.4 shows the structure an eight-way set associativeCPU cache. The cache has 1024
cache sets. Each cache set has eight cache lines. Each cache line stores eight words. The
address the CPU is currently referencing is stored in the address latch. The middle ten bits of
the address latch is mapped to one of the 1024 cache sets. The lowest three bits is the block
offset to index into the eight words of a cache line. The highest 19 bits form the address tag.
All the eight address tags of a cache set are compared with theaddress tag in the address
latch. If there is a match, the hit/miss signal indicates a cache hit or miss.
2.1.4 Multiple Level CPU Caches
Modern CPUs usually have a hierarchy of caches. Most of the current CPUs have two levels
of CPU caches. The first level CPU cache is called the L1 cache and can be accessed in one
or two cycles. The gate delay and wire delay limits the size ofthe L1 cache. Typically, the
L1 cache is only 32KB or 64KB. The same speed constraint also limits the associativity of
the L1 cache. To achieve high speed, the L1 cache may be direct-mapped.
The second level cache is called the L2 cache. L2 caches are typically accessed in around ten
CPU cycles and are much larger than L1 caches [PH05]. The L2 cache usually has higher
associativity. L2 caches can be 16 or 32 way associative. Besides the L1 and the L2 caches,
some CPUs have a level three cache. L3 caches are slower and larger than L1 and L2 caches.
9
0 1 2 3 m-1...
...
0 1 2 3 m-1...
...
m m+1m+2m+3 2m-1...
...
n
...
0 1 2 3 m/2 -1...
...
0 1 2 3 m/2-1...
...
m/2 m/2+2 m-1...
...
n
...
(a) Direct-mapped cache
(b) two-way set associative cache
Figure 2.3: The mapping of the main memory words into a direct-mapped cache and a two-way associative cache.
Currently, both the L1 and the L2 caches use LRU replacement.LRU replacement at L2 and
L3 caches are actually Least Recently Missed replacement. The references at the L1 cache
are invisible to the lower level caches. The most recently referenced or loaded address at
the L2 or L3 cache is not necessarily the address which the CPUmost recently referenced
but the address most recently missed in the higher level cache. Since references to items in
the L2 cache are not exactly what the CPU is currently referencing but misses from the L1
cache, LRU at L2 and lower level caches is actually least recently missed replacement algo-
rithm. LRU at L2 cache does not exactly follow the definition of temporal locality [PHS98].
The hit rates of the LRU replacement at the L2 or L3 cache are low. This is considered by
[PHS98] to be a problem of LRU.
10
tag set indexblock offset
01
...set 2
set 1023
set 1set 0
DE
CO
DE
R
...
Address Latch
TAG DATALine0
TAG DATALine1
TAG DATALine2
TAG DATALine3
TAG DATALine4
TAG DATALine5
TAG DATALine6
TAG DATALine7
COMPARATORIncoming TAG
TAGS
HIT/MISSto CPU
Figure 2.4: The structure of an eight-way set associative cache.
2.2 Efforts to Improve Cache Hit Rates
CPU cache hit rates are critical to the performance of CPUs. Many efforts have been made
to improve the CPU cache hit rates. This section briefly describes some of the efforts to
improve the CPU cache hit rate.
2.2.1 Cache Line Size, Prefetching and Stream Buffer
A study [Smi87] found that the optimal cache line size is between 16 to 64 bytes. This con-
clusion may currently not be accurate. Due to the advances inCPU speed and the almost
constant memory latency, larger cache lines are favored. The slight success ofalways-fetch
pre-fetch policy suggests that larger cache lines should befavored [Smi82]. Inalways-fetch
pre-fetching, the CPU cache always reads in the cache line next to the cache line that con-
tains the word that the CPU is currently referencing.Always-fetchshows slight but notice-
able and consistent improvement in cache hit rates. Studies[SS01] of the cache performance
of multimedia workload suggest that larger cache lines, forexample 128 bytes or more, have
better performance for multimedia applications, since audio and video data are in large pack-
ets.
However, larger cache lines also increases the cache pollution. Cache pollution [Smi82]
refers to the effect that useful cache contents are flushed out by content that is almost never
11
re-used. The larger the cache line, the higher the chance of cache pollution is. This has
a negative impact on multi-programming computations [LP98, VL00, Smi82, Hig90]. In
multi-programming environments, larger cache lines causethe cache content of one process
be flushed more quickly by the other processes. Extra cache misses are thus introduced.
The advantage of larger cache lines is that it provides better memory bandwidth. One ap-
proach that allows the use of larger cache lines but reduces the cache pollution of larger
cache lines is to separate transfer sizes from cache line sizes [Prz90]. Small cache line
sizes are used inside the cache, but on cache misses, multiple adjacent cache lines are trans-
ferred together. Work described in [Jou98] suggests that the extra cache lines fetched should
be stored in a small special on-chip buffer instead of in caches and thus reduce the cache
pollution. If the cache line stored in the buffer is referenced, it will then be put into the
cache. This buffer is called thestream buffer. Studies [TFMP95, RTT+98, VL00, PCD+01,
BAYT01, KS02, CH02] show that the stream buffer works well. The possibility of using
stream buffers to replace the entire L2 cache is discussed in[PK94]. The stream buffer with
only 10 entries achieves more than 50% hit rates and in many cases is comparable to a L2
cache.
Different transfer sizes and a fixed small cache line size have been adopted by adaptive
cache systems [VTG+99, JmWH97, KBK02, LKW02]. One of the main purposes of these
adaptive cache systems is to provide better service for multimedia workloads [LKW02], as
multimedia data requires large cache lines [SS01].
The problem of cache line sizes is that there is no single fixedoptimal cache line size for
all applications, since the probability of visiting adjacent words varies among applications.
This leads to an adaptive cache line size approach [VTG+99], where the cache line size is
dynamically increased or decreased, if the recorded probability of visiting the adjacent lines
is high or low enough. Initially, the cache line size is large. The cache line size decreases by
a half or doubles to approach the optimal size. The use of adaptive line sizes showed great
improvement both in hit rates and memory traffic. The decrease in memory traffic for some
applications which need smaller cache line sizes is as much as 50% for 32 byte fixed line
sizes and even more for larger cache line sizes. These results can be viewed as evidence
favoring smaller line sizes. Due to the complexity and high costs of implementing variable
size cache lines, adaptive cache lines are seldom used in CPUcaches.
False sharing in multiprocessors suggests that smaller cache lines are preferable [EK89].
False sharing refers to the situation that two or more processors do not refer to the same ad-
dresses, but the addresses visited happen to be in the same line. The larger the cache line,
12
the higher the probability of false sharing is. On the other hand, other work [RSG93] sug-
gests that false sharing is not a problem, since multiple processors seldom share data at the
granularity of a cache line. However, for multi-threading,which operates in the same ad-
dress space and has a high frequency of sharing, false sharing is an issue [EK89]. However,
studies by [Tor94] show that the poor spatial locality in theshared data has even a larger
impact on CPU cache performances.
The key point in deciding the optimal line size is whether thepredicted probability of vis-
iting the adjacent addresses is high enough to compensate for the latency of reading many
words and the cache pollution. This is related to the accuracy of the cache replacement algo-
rithm. A better cache replacement which can tolerate the cache pollution will benefit from
larger cache line sizes.
The cache line size is important because it is actually the simplest and most efficient form
of cache pre-fetching. Many studies have tried to exploit the most out of the cache line
size design [RBC02, CMT94, PH90a, CHL99, CDL99]. For example, the work in [CHL99,
CDL99] packs data structure elements likely to be accessed together into a cache line . How-
ever, this requires the programmer to modify the code. Section 2.4 introduces a code place-
ment method [MPBO96] that simply puts frequently used code and data in the same cache
lines.
2.2.2 Cache Sizes and Hit Rates
The CPU cache size is the most important factor on cache hit rates. Larger cache sizes will
certainly be better. However, there is no simple way to calculate the cache hit rate given
a cache size since the hit rate varies with workloads. In the 1970s, many formulas were
developed to define the cache hit rate as function of cache size[Smi82]. These formulas
seldom had any real value in estimating cache performance since the workloads used were
not representative of the real workload. The formulas were valid in the workload where it
was derived but was usually not valid in other workloads.
In 1968, aworking set modelwas proposed in [Den68]. Theworking setis defined as the
subset of the program image currently running. The assumption is that if the cache size is
smaller than the working set, there will be a large number of cache misses. The working set
model predicts that the cache size must be larger than some threshold, which is the size of
the working set, and increasing the cache size after reaching the threshold will only gener-
ate marginal gains. For years, the working set behavior, or more accurately the threshold
13
behavior of cache sizes, has been widely observed [Smi82].
A study in 1983 [Goo83] claimed that small caches are surprisingly effective. The effec-
tiveness of small caches is attributed to the small size of the working set of applications at
that time. This may be true for the applications at that time,but may not be still true for
applications these days. Current applications such as database and network based applica-
tions are huge in size, thousands of times larger than applications 20 years ago. As a general
rule, the miss rate of current cache is still around 1% regardless of the cache replacement
algorithm used [HP96].
A set of well known rules of thumb for analyzing cache performance [Hig90] includes the
following rule for cache sizes:
doubling the cache size decreases the miss rate by 25%.
Larger cache sizes are beneficial in multi-programming environments. In multi-programming
environments, several independent streams of executions compete for the limited cache space
and cause extra cache misses. It is possible for larger caches to accommodate them all.
However up to now, there is no known study which can quantitatively relate cache sizes
with multi-programming performance.
2.2.3 Cache Associativity and Victim Cache
The associativity of cache also has an important impact on cache performances. One rule
of thumb [HP96] states that ‘ the miss rate of a direct-mappedcache of size N is about the
same as a two-way set-associative cache of size N/2.’ Higherassociativities are believed
to have better hit rates but come with a cost. Before using CMOS(Complementary Metal
Oxide Semiconductor)technology, the cost of doubling cache associativity was almost the
same as doubling the cache size [Hil88].
In a direct mapped cache, one main memory word can go to one cache line. The only ad-
dress comparison is between the cache line tag and the address tag of the word the CPU is
referencing. There is no need for a replacement algorithm. For set associative caches, more
comparisons between the cache line tags and the address tag of the memory word that the
CPU is referencing are needed. A replacement algorithm is required to determine the cache
line to be replaced. The Least Recently Used (LRU) replacement is typically used. LRU
can be effectively implemented in hardware [Smi82].
14
Direct mapped and set associative caches causeconflict misses. Conflict misses are cache
misses that do not happen in a fully associative cache. Conflict misses are caused by the
limitations of replacement decisions in a single cache set.Conflict misses typically account
for 20% to 40% of all misses of direct mapped cache[Smi82].
Fully associative caches avoid conflict misses but come witha high cost. Fully associative
caches requires comparisons of all tags of cache lines with the address tag of the memory
word the CPU is referencing. It is feasible to do all these comparisons in parallel. An alter-
native is a software based fully associative cache [HR00]. Modern CPUs seldom have fully
associative caches or higher than 32-way associative caches. For example, Pentium 4 has a
four-way set associative L1 cache and an 8 way L2 cache [Int04].
It is observed [Smi82] that the optimal associativity is four to eight. It was also observed
that beyond eight, the miss rate decreases very little, if any. One work [HS89] provides
quantitative values about the miss rate changes with associativities. It shows that when as-
sociativities decrease from eight to four, from four to two,and from two to direct-mapped,
the increases in miss rates are about 5, 10 and 30 percent. This conforms with other stud-
ies[PHH88].
Direct-mapped caches are fast but also have high miss rates [Hil88]. One approach to having
the benefits of using direct-mapped caches and still have lowmiss rates comparable to set-
associative caches is suggested in [Jou98]. A small fully associative cache calledvictim
cacheis attached to a direct-mapped cache. When a line is evicted from the direct-mapped
cache, it is temporally placed in the victim cache for a while. The victim cache uses LRU
replacement. Thevictim cacheis usually only one to five entries in size but can provide good
results. For example, a five entryvictim cachecan remove an average of 50%, in some cases
90%, of the conflict misses.
2.2.4 Split Instruction and Data Cache
Instructions and data can be stored in different caches. Separating instruction and data cache
has many advantages[Smi82]. First, it permits parallel access of the instruction and data
cache by CPU pipelines. Second, separated instruction and data caches are smaller and thus
can be placed near the execution units and thus have higher access speeds. Finally, instruc-
tions and data display different characteristics of locality [Prz90, FTP94, ASW+93, Smi82].
Split instruction and data cache can reduce the possible interferences.
15
Currently, instruction cache and data cache are often of thesame size. However, this de-
cision is more out of simplicity than based on research. One study showed evidence that a
balanced cache should have a larger instruction cache [FTP94]. Another study analyzed the
optimal replacement and suggested larger instruction caches are appropriate[ASW+93].
2.3 Cache Replacement Algorithms Other Than LRU
LRU is the most widely used cache replacement algorithm. LRUis based on the property
of temporal locality [Smi82]. Temporal locality [Smi82] refers to ‘the information that will
used in the near future is likely to be in use already.’ A corollary of temporal locality is
thatrecently used blocks are likely to be used again. Thus the best candidate cache line for
eviction is the least-recently used block[HP96]. The LRU replacement algorithm works as
a stack and always put the just referenced cache line on top ofthe stack. In CPU caches,
LRU can be efficiently implemented with a small number of bitssince the associativities of
CPU caches are very limited [Smi82]. For example, in a two-way associative cache, only
one bit is required. This section discusses other cache replacement algorithms.
2.3.1 Pseudo-LRU Replacements
LRU keeps all cache entries in the order of their last reference time. To keep a strict LRU
order, a relatively large number of bits are needed [Tan87].For example, for an eight-way
associative cache,8 ∗ 8 = 64 bits are needed to keep all the eight cache lines in LRU order.
For a higher associativity, for example 16 way associative,LRU requires16∗16 = 256 bits.
Figure 2.5 shows the space needed to implement LRU on an eight-way associative cache.
To save space, Pseudo-LRU replacement algorithms are proposed. Two kinds of Pseudo-
LRU replacements are widely used, Pseudo-LRU-tree (PLRU-tree) and Pseudo-LRU-msb
(PLRU-msb). In Pseudo-LRU replacement algorithms, the LRUorder of cache lines are
only approximately kept. For example, in PLRU-tree, only the just referenced line is accu-
rately recorded, and the order of other cache lines is not precise. At the cost of precision,
Pseudo LRU replacement algorithms need fewer bits for replacement decision making. For
an eight-way associative cache, PLRU-tree needs seven bitsand PLRU-msb needs eight
bits. Pseudo LRU replacement algorithms need only one bit per cache line. Figures 2.6 and
2.7 shows the space arrangement of PLRU-tree and PLRU-msb replacements respectively.
16
One study [AZMM04] shows that pseudo-LRU replacement algorithms do not significantly
impact the hit rate.
Line 0
Line 3
Line 2
Line 1
Tag + Data + Status (19 + 256 + 2 = 277 bits)
Line 4
Line 7
Line 6
Line 5
LRU Matrix
Figure 2.5: Storage arrangements of an eight-way associative cache set using real LRU andPLRU replacements.
2.3.2 First-In-First-Out and Random Replacements
The majority of CPU caches use the Least Recently Used (LRU) replacement or its variants.
Other frequently mentioned but seldom used CPU caches replacement algorithms include
Least Frequently Used (LFU), First In First Out (FIFO), and Random.
FIFO evicts the oldest cache line, even if the oldest line is just visited. Random does not
exhibit any pattern in selecting a cache line to evict. Thesetwo replacement algorithms
are considered inferior to LRU, but differences in the hit rates of these cache replacement
algorithms and LRU are small except for LFU. Random replacement is less than one percent
worse than LRU in CPU cache hit rates [HP96]. Table 2.3.2 is a comparison of hit rates of
Random and LRU replacement algorithms.
17
Line 0
Line 3
Line 2
Line 1
Tag + Data + Status (19 + 256 + 2 = 277 bits)
Line 4
Line 7
Line 6
Line 5
PLRU-Tree
Figure 2.6: Storage arrangements of an eight-way associative cache set using PLRU-treereplacement.
AssociativityTwo-way Four-way Eight-way
Size LRU Random LRU Random LRU Random16 KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96%64 KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53%
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
Table 2.1: Typical miss rates of LRU and Random with different cache sizes and associa-tivities[HP96].
2.3.3 LRU-k and LIRS Replacement Algorithms
In the field of database disk buffer caches, many replacementalgorithms other than LRU and
LFU are proposed. In database buffer caches, the locality ofreference is less dominant than
in CPU caches [OOW93]. The LRU replacement algorithm is found incapable of handling
some scenarios. Modifications of LRU and LFU have been developed. This includes LRU-
k[OOW93], LIRS[JZ02], FBR[RD90], LRFU[LCK+01], 2Q [JS94] and Multi-Queue [ZCL04].
In the field of virtual memory page buffers, a modification of LRU called EELRU[SKW99]
is proposed.
One modification of LRU is LRU-k[OOW93]. LRU-k replaces the cache item whosekth
reference in the past is the oldest. The valuek is called thebackward k-distance. If a cache
18
Line 0
Line 3
Line 2
Line 1
Tag + Data + Status (19 + 256 + 2 = 277 bits)
Line 4
Line 7
Line 6
Line 5
MSBs
Figure 2.7: Storage arrangements of an eight-way associative cache set using PLRU-msbreplacement.
item has not been referenced up tok times, its backward k-distance is defined as infinite. The
cache item with the highest backward k-distance is replaced. The LRU replacement algo-
rithm can be viewed as a special case of LRU-k, wherek equals one. The work in [OOW93]
suggests that for most cases LRU-2 is sufficient. LRU-2 shows no difference with LRU-3,
but LRU-2 has higher hit rates than LRU for database traces. Experiments in [OOW93]
shows LRU-2 usually requires smaller cache sizes than LRU to reach higher hit rates.
A buffer cache replacement algorithm called LIRS [JZ02] is developed to deal with the fol-
lowing scenarios where LRU fails:
• Sequential scans, which refers to a burst of references to some infrequently used blocks;
• Cyclic (loop-like) accesses, where the interval of loop access may be larger than the
cache size;
• Multi-user applications, which exhibits a property similar to cyclic accesses but is
caused by massive independent user inputs.
In these scenarios, the just referenced item will not be hit again in a short time. LIRS divides
addresses into two groups: good cache-able group and bad cache-able group. LIRS makes
its replacement decisions based on the count of blocks between two consecutive references
19
to the same block. LIRS assumes that future references will likely have the same scale of
distances. LIRS calls the number of other blocks between thelast two references of a block
the IRR value of the block. LIRS replacement records the IRR value of each block and
divides all blocks into two groups. Blocks with high IRR (HIR) value are in the bad group,
and blocks with low IRR (LIR) values are in the good group. Thecache is also partitioned
into two parts, one for the HIR blocks and the other for LIR blocks. The part for LIR blocks
is larger than the part for HIR blocks. The LIR partition receives 99% of the cache size, and
only one percent of the cache size is given to the HIR partition. When replacing a cache
block, one entry in the HIR partition is evicted, and the LIR partition remains intact. In
LIRS, blocks are moved between the LIR/HIR partitions. Whena block in the HIR part is
hit, its IRR value is recalculated. If the new IRR value is smaller than the maximum IRR
block in the LIR part, the two blocks are exchanged.
In LIRS, if a block is referenced only once, its IRR value is defined as infinite, just like in
LRU-k, where if a cache line has not been referencedk times, itskth distance is set as infi-
nite. LIRS and LRU-k are not suitable for CPU caches, since these replacement algorithms
are either too expensive to implement in CPU cache. Due to theheavy locality of references
of CPU caches, these replacement algorithms may not be able outperform LRU.
2.3.4 LFU and FBR Replacement Algorithms
In the Least Frequently Used (LFU) replacement algorithm, the reference count of an ad-
dress is recorded. When replacing, the address with the minimal reference count is evicted.
Addresses which are not hit immediately after being broughtinto the cache are replaced
quickly by LFU. However, LFU tends to keep some addresses in the cache too long. If some
addresses are referenced heavily only in a specific period and never again, these addresses
will be fixed in the cache for a long time by LFU. This reduces the cache size and results
in poor hit rates of LFU. Variants of LFU useagingmechanisms that reduce the reference
counts of addresses periodically.
The Frequency Based Replacement (FBR) algorithm is a modification of the LFU replace-
ment algorithm, which is also used in database buffer caches[RD90]. FBR is different from
LFU in that ‘locality is factored out.’ In FBR, blocks are kept in LRU order, and the top
portion of the LRU stack is defined as thenew sector. The reference count of blocks is not
increased, if the block is hit in the new sector. By doing this, each reference within a burst
to a block is not counted, and the reference count of the blockis not increased. The locality
20
is factored out.
The Least Frequently Used replacement algorithm and its variants are never used in CPU
caches, since CPU caches are too small compared with the mainmemory. LFU and its vari-
ants keep frequently visited items too long. If used in CPU caches, the hit rates of LFU and
its variants would be very low.
2.3.5 LRFU, Multi-Queue and EELRU Replacement Algorithms
The Least Recently Frequency Used (LRFU) replacement algorithm [LCK+01] uses nu-
merical values to represent the replacement priority of cache lines. LRFU is proposed to be
used in database buffer caches. LRFU is targeted to subsume both LRU and LFU. LRFU
assigns a value called the Combined Recency and Frequency (CRF) to each cache block,
and when replacing a block, the block with the minimal CRF value is chosen. The CRF
value is calculated by summing up the weights of references to a block. Every reference
is assigned a weight based on the time span from point of reference up to the current time.
Newer references have higher weights, so the function used to calculate the weight should
be a descending function. All these weights are summed up andthe sum is determined as
the CRF value of the block. Depending on how the weight is calculated, the CRF value can
be a real number. For example, LRFU mimics LRU by using exponential functions smaller
than one for each reference. A shortcoming of LRFU is that thecalculation of weights in
LRFU is very complex and costly, since the entire reference history of a block is needed.
A further drawback of LRFU is that the CRF value of all blocks and the weights of all past
references have to be calculated again for every new reference. For example, suppose a
block was referenced at time 1, 2, 5, 8 and the current time is 10. The CRF value C of the
block at time 10 is calculated asC = f(10 − 1) + f(10 − 2) + f(10 − 5) + f(10 − 8) =
f(9) + f(8) + f(5) + f(2). When the current time is 12, the above CRF value is recal-
culated asC = f(11) + f(10) + f(7) + f(4). To simplify calculations, functions where
f(x + y) = f(x) ∗ f(y) should be used.
Early Eviction LRU (EELRU) addresses LRUs failure to predict cyclic references [SKW99].
In EELRU, for most of the time, uses the LRU order for replacements. When it is detected
that recently evicted pages are referenced again in a short time, theeth most recently refer-
enced pages will be evicted instead of the least recently used one, wheree is a pre-determined
recency position.
Another replacement algorithm for database buffer caches is multi-queue replacement at
21
the second level buffer caches [ZCL04]. Multi-queue is a further improvement of the 2Q
replacement. Based on the belief that LRU-2 is too expensive to implement, the 2Q replace-
ment is developed [JS94]. In 2Q, one FIFO queueA1in, and two LRU listsA1out andAm are
used. Blocks are first put intoA1in, and when evicted put intoA1out. When referenced in
A1out, its identifier is moved toAm. Multi-queue further develops the idea and changes the
Am queue into many ranked LRU queues. Blocks, if referenced in the lower queue, will be
promoted to higher ranked queues. A function is used to control how many references will
put the block in which rank of queues. When replacing, entries from lower ranked queue
will be evicted first. Multi-Queue is found to out-perform FBR, 2Q, LRFU, LRU-2, LRU,
and LFU in L2 database buffer caches [ZCL04].
2.3.6 Dead Alive Prediction Replacement Algorithms
As the associativity of a CPU cache increase, the time that LRU keeps a dead cache line
increases. A dead cache line is a cache line that will not be referenced later. Limiting the
cache stay time of dead cache line can give space to cache lines that are still alive. A set of re-
placement algorithms predicting the dead/alive status of cache lines are proposed [HKM02,
LFF01, KS05]. In [KS05], two prediction methods, AIP and LvP, are proposed. AIP/LvP
uses a counter per cache line to record the events that happened to the cache line, such as the
number of hits or the time when the hits happen. An event threshold is set. If the number of
recorded events exceed the threshold, the cache line is marked for eviction. In AIP, the event
recorded is the number of references to the cache set betweenconsecutive accesses to the
cache line. In LvP, the event recorded is number of hits. AIP/LvP evicts the predicted dead
line faster than LRU. For L2 cache sizes of 512KB and 1MB, AIP/LvP shows 5% improve-
ment in hit rates over LRU for a number of SPEC CPU2000 benchmarks. A shortcoming
of AIP/LvP is that it requires extra space to store events.
2.3.7 Off-Line Optimal Replacement Algorithm
If the future references of the CPU is known, a replacement decision can be made by looking
forward and choose to evict a cache entry which results in thelowest miss rate. This is called
the off-line optimal replacement algorithm. Off-line optimal replacement has the highest
possible hit rate for a CPU cache configuration.
The optimal replacement algorithm is defined in [ADU71]as one that always evicts the block
22
‘which has the longest expected time until next reference.’When replacing, the cache line
whose next reference is most far away is chosen. This replacement is called the Optimal
(OPT) replacement algorithm, whose hit rate is the theoretic upper limit of all replacements.
Figure 2.8 illustrates how the Optimal replacement decision is made. In the example, there
are four cache lines, containing addressA, B, C, andD. Cache lineA is to be referenced
again in the farthest future, and thus it is chosen to be replaced.
A
B
C
D
E B F E B G C B C D F B C A
Figure 2.8: An example of the Optimal replacement decision.
Off-line optimal replacement is used in the competitive analysis of on-line replacement al-
gorithms [ST85]. The hit rate of optimal replacement is usedto evaluate the performance
of cache replacement algorithms.
One result observed when comparing the miss rate of the optimal replacement and LRU is
that LRU makes incorrect choices in choosing the evicted block [SA93]. For fully associa-
tive caches, optimal replacement has 70% fewer cache missesand for two-way associative
cache, optimal replacement has 32% fewer cache misses. Other findings are that the miss
rate of fully associative LRU caches on the SPEC benchmark had been worse than those of
direct-mapped or set-associative caches. In these cases, limited choices of direct-mapped
caches and low set-associative caches actually helped LRU to make better choices than fully
associative or high associative caches. With set-associative and direct-mapped caches, LRU
can not choose the least recently referenced line and will keep it long. It seems this is actu-
ally the right choice. This phenomenon suggests the oldest is still of value.
Optimal replacement is unrealistic for real world caches. It is useful in providing an upper
bound on the cache hit rate possible. Chapter 8 presents the comparison of the hit rates of
two replacement algorithms against the hit rates of the optimal replacement algorithm to
better understand the improvements.
23
2.3.8 Summary of Replacements
For CPU caches, there are few alternatives to LRU replacements. Either LRU out-performs
the new replacement algorithm, or the new replacement algorithm is too expensive to im-
plement in CPU caches. The superiority of LRU is exhibited when the property of temporal
locality is much more obvious and dominant in the memory references of CPU caches than
in database buffer caches. Hit rates of even a small CPU cacheis higher than 90%, but hit
rates of database buffer caches are much lower. For example,in [OOW93], the hit rates of
OLTP trace experiments are no more than 50%.
2.4 CPU Cache Issues of Network Protocols and Applica-
tions
Network protocols and applications are believed to be programs with poor locality and thus
with bad CPU cache hit rates. Network protocols and applications are inherently un-cacheable
for two reasons. First, network protocols and applicationswork on multi-programming plat-
forms, and multiprogramming has negative impact on CPU cache performances. In multi-
programming, multiple processes or threads compete for thelimited CPU cache and may
flush the cache content of each other. This causes extra cachemisses. Second, network
protocols and applications lack substantial computations. There is always a huge amount of
data involved in network protocols and applications, but the operations on the data are sim-
ple and few. The lower reuse of data means intrinsic fewer cache hits. One study [CJRS89]
examined the TCP code by profiling and counting the instructions of TCP processing. The
study found that the number of instructions TCP executed wasvery small. For example,
receiving a packet only involves 335 instructions. TCP has not significantly changed since
this paper was published and thus the conclusions of this paper are still valid.
Network protocols themselves are simple but may interact with the operating system and
may cause big problems. Network protocols involve a small number of instructions [CJRS89],
but studies [NYKT97] show that instruction cache misses have the greatest impact on pro-
tocol latency. Optimization methods targeting the instruction locality of protocols gener-
ated great improvements [Bla96, MPBO96]. The study [Bla96]reduced 90% of instruction
cache misses.
One study [NYKT97] compared the instruction cache references and data cache references
24
of both TCP and UDP. The study found that instruction references are two times more than
data references without check-summing and three times morewith check-summing and the
contribution of instructions references to latency also outweighs data references for both
UDP and TCP. This finding indicates that the simplicity of protocol code does not imply
that the protocol code causes fewer CPU cache problems than what is caused by the data.
Another study suggests that the many instruction cache misses of network protocols are be-
cause of the lack of locality in protocol code [Bla96]. This work studied the trace of TCP
code and found distinct phases. The processing circle of TCPreceiving and acknowledging
is divided into three phases: entry, device interrupt, and exit. The code traces of differ-
ent phases do not overlap. The TCP processing instructions are not reused across phases.
Recorded cache misses caused by processing each packet are found to have the cache misses
of both instruction and data cache misses remain constant with different arrival rates. This
means that even if packets are arriving fast, the processingof each packet is still independent
and there is no increased reuse of instructions in the cache.
An algorithm calledLocality Driven Layer Processing (LDLP)is used to increase the reuse
of TCP code [Bla96]. Each TCP code stage is allowed to processas many packets as possi-
ble before entering to the next code stage. The instruction cache (I-cache) misses per mes-
sage were significantly decreased from 900 to 100 for fast arrival rates. The data cache (D-
cache) misses increased somewhat with faster arrival rate,but the increase can be neglected.
The works shows that before optimization I-cache misses were ten times more than D-cache
misses and with optimization the number of I-cache misses isalmost the same as the number
of D-cache misses. The LDLP scheduling greatly improved CPUcache hit rates. However,
LDLP scheduling can not be used in a real network since it relies on cooperative packet
arrival rates.
A similar idea was proposed to increase the reused of protocol code in cache for a multi-
processor environment [SKT96]. This work proposed to schedule processors based on the
affinity of code running. Another approach to improving the bad locality of protocol code
was proposed in [MPBO96]. The TCP code is re-arranged byout-lining andcloning: fre-
quently executed instructions are compacted together. Thus, the reuse of cache lines in-
crease. This work reported improvements in TCP latency by a factor of 1.35 to 5.8.
25
2.5 Summary
CPU caches relies on the property of locality in the memory references of programs. LRU
replacement is widely used in CPU caches because it is believed to best exploit the tem-
poral locality. In databases, where the locality of references is not as intensive as in CPU
caches, cache replacements other than LRU are developed. These replacement algorithms
have significant improvements over LRU. However, these replacement algorithm are not
used in CPU caches. Either these replacement algorithms aretoo expensive to implement
in CPU caches, or LRU outperforms them.
Network protocols and applications have poor CPU cache hit rates. The LDLP scheduling
developed by Blackwell significantly reduced the cache misses of TCP processing. Other
locality optimization also showed improvement in the CPU cache hit rates of network pro-
tocols and applications. These studies show that there is room for optimization the CPU
cache design to achieve better performance for network protocols and applications. Due
to the large CPU cache miss penalty, optimizing the CPU cacheperformance of network
protocols and applications can greatly improvement the performance of servers.
26
Chapter 3
Principle of Locality and Property of
Short Lifetime
Currently, LRU is the CPU cache replacement algorithm most widely used in computing
systems. LRU is based on the concept of temporal locality that states that ‘recently accessed
items are likely to be accessed in the near future’ [HP96]. LRU replaces the cache entry
that has not been referenced for the longest time. Under LRU replacement, if an address
is just referenced, it will stay in the cache for the longest possible time. If the program ex-
hibits strong temporal locality, then the address is most likely to be referenced in the future.
Chapter 2 introduces research that suggests that not all programs (specifically networked
applications) strongly exhibit temporal locality. This chapter and the next chapter describe
the research conducted in analyzing memory access patterns. This work is the basis for a
new cache replacement algorithm described in Chapter 5.
3.1 Memory Reference Traces
This study analyzes memory reference traces. A memory reference trace of a program (hence
referred to as memory trace) is a stream of main memory addresses issued by the CPU when
executing the program. Standard memory traces, such as the traces of SPEC CPU2000
benchmarks [Hen00], provide a consistent foundation to compare meaningfully and accept-
ably different CPU cache designs. SPEC benchmarks are the de-facto benchmarks for CPU
performance comparisons, and CPU2000 is most updated version.
27
In this work, memory traces are empirically analyzed to better understand the locality char-
acteristics of the programs. The memory traces used includethose of the SPEC CPU2000
benchmarks and memory traces of web servers (this analysis is in chapter 4). The SPEC
CPU2000 benchmarks consist of 12 integer programs and 14 floating point programs. The
SPEC CPU2000 integer programs are the following:Name Remarks
gzip Data compression utility
vpr FPGA circuit placement and routing
gcc C compiler
mcf Minimum cost network flow solver
crafty Chess program
parser Natural language processing
eon Ray tracing
perl Perl
gap Computational group theory
vortex Object Oriented Database
bzip Data compression utility
twolf Place and route simulator
The SPEC CPU2000 floating point programs are the following:
28
Name Remarks
wupwise Quantum chromodynamics
swim Shallow water modeling
mgrid Multi-grid solver in 3D potential field
applu Parabolic/elliptic partial differential equations
mesa 3D Graphics library
galgel Fluid dynamics: analysis of oscillatory instability
art Neural network simulation; adaptive resonance theory
equake Finite element simulation; earthquake modeling
facerec Computer vision: recognizes faces
ammp Computational chemistry
lucas Number theory: primality testing
fma3d Finite element crash simulation
sixtrack Particle accelerator model
apsi Solves problems regarding temperature, wind, velocity
and distribution of pollutants
The SPEC CPU2000 benchmarks are frequently used as standardworkloads in CPU and
CPU cache studies[CH01, PH90b, AZMM04]. The memory traces of SPEC benchmarks
are from the BYU (Brigham Young University) trace archive1. The BYU SPEC bench-
marks traces are generated with the CPU caches disabled and captured by hardware. Each
BYU SPEC benchmark trace is slightly more than ten million references long. The BYU
trace archive also has L2 cache traces but does not offer manychoices of the L1 cache con-
figurations. The L2 cache traces used in this chapter are generated from the BYU L1 traces
by simulating the L1 cache. We wrote a L1 cache simulator and fed the BYU L1 traces into
the simulator. The L1 cache simulator offers a wide range of L1 cache configurations. The
cache misses of the L1 cache simulator are dumped as the L2 cache traces.
3.2 Principle of Temporal Locality and LRU
ThePrinciple of Locality[HP96] refers to the belief that ‘programs tend to reuse dataand
instructions they have used recently’.Principle of localityconsists of two kinds of locality:
1http://tds.cs.byu.edu/tds/
29
temporal localityandspatial locality. Temporal localityis defined [HP96] as ‘recently ac-
cessed items are likely to be accessed in the near future.’Spatial localityis defined as ‘items
whose addresses are near one another tend to be referenced close together in time [HP96].’
The application of spatial locality is the cache line. Multiple adjacent memory words are
loaded as a cache line to increase the possibility of hits. For CPU cache replacement algo-
rithm, temporal locality is more important. This work focuses on temporal locality. Unless
specified otherwise, in this work the termlocality refers to temporal locality.
The Least Recently Used (LRU) cache replacement algorithm is based on the above defini-
tion of temporal locality. LRU selects the cache entry that has not been referenced for the
longest time. If an address is just referenced, it stays in the cache for the longest possible
time. Doing so, if the address is to be re-referenced, it is most likely to result in a cache hit.
3.3 Inter-Reference Gaps and Temporal Locality
The term Inter-Reference Gap (IRG) [PG95, Quo94] refers to the number of memory ref-
erences between two consecutive references of the same address. For example, if there are
100 memory accesses between two references of an address then the IRG value is 100. Fig-
ure 3.1 illustrates the calculation of IRG values through anexample sequence of main mem-
ory references. This section presents the analysis of IRG values for the address traces de-
scribed in section 3.1.
a b c b d a c e b f d a d c b e b c f d
5 7
Set0: a c a c e a c e c
Set1: b b d b f d d b b f d
4
2
Figure 3.1: Two per set IRG values and their corresponding whole stream IRG values.
30
3.3.1 Inter-Reference Gaps and LRU
If using the LRU cache replacement algorithm, IRG values determine if a memory reference
is a cache hit or a miss. Assuming a fully associative (see section 2.1.3) CPU cache of size
256KB, all IRG values below 256K can be satisfied in cache. IRGvalues larger than the
cache size may result in cache hits since it is the number of unique addresses in a gap that
decides whether it is a cache hit or a miss. For two IRG gaps of the same value, the IRG
gap with more unique addresses is more likely to result in a cache miss than the one with
fewer unique addresses.
The study of IRG values in this chapter is not the first study onIRG values. One study
[PG95] found that the distribution of IRG values of a single address is highly clustered. A
Markov chain model is proposed to predict IRG values. Cache replacement decisions are
based on these predictions. Figure 3.2 shows the IRG values of three addresses in the mem-
ory trace of programCC1, which is a C compiler [PG95]. The IRG index in Figure 3.2 is
the time of the reference. The first reference to an address has an IRG index of one. The
IRG value is the size of the IRG gap in units of memory references. The three addresses in
the figure are the most referenced, the10th most referenced, and the100th most referenced
address of the memory trace. The clustered distribution of IRG values is seen in Figure 3.2.
However, the three addresses chosen in [PG95] are among the most intensely referenced
addresses. The IRG distribution of less intensely referenced addresses, especially the IRG
values of those addresses which are referenced only severaltimes, are not provided in the
study. The work presented in this chapter is the first investigative study that analyzes the
IRG values of all addresses.
3.3.2 Complete Program Stream and Per Set IRG Values
The analysis described in [PG95] measures IRG values of the entire program memory ref-
erence stream, in which every memory reference counts in calculating a IRG value. How-
ever, CPU caches are set associative. Memory references to one cache set have no impact
on the cache hit rate of the other cache sets. The analysis conducted in this work measures
IRG values on a per set basis i.e., only memory references mapped to the same cache set is
counted in calculating an IRG value. For example, suppose addressa is referenced again
after ten other memory addresses have been referenced. Measured in the complete program
memory reference stream, the IRG value is ten. However, if the ten memory references are
31
Figure 3.2: IRG strings of three addresses in the CC1 trace [PG95]. IRG index is the indexnumber of the first reference of an IRG gap.
all mapped to different cache sets than addressa, the per set IRG value is one. In this work,
IRG values measured on a single cache set is referred to asper set IRGvalues and the IRG
values measured in [PG95] are referred to aswhole stream IRGvalues.
To get the per set IRG values, we first split the trace of memoryreferences into sub traces of
sets. Each sub trace represents the memory references to a single cache set. The set index
part of the address is used to map a reference to a cache set. Later, IRG values are calcu-
lated in the sub traces, which become per set IRG values. Figure 3.1 presents an example
illustrating the derivation of per set IRG values from the whole stream IRG values. In the
example, there are two cache sets:set0 andset1. Addressesa, c, ande are mapped toset0,
and addressesb, d, andf are mapped toset1. Two whole stream IRG values of size five and
seven, after being mapped to the two cache sets, become per set IRG values of size two and
four. In this work, the term IRG refers to per set IRG unless specified otherwise.
We wrote a software tool to map the whole stream IRG values into per set IRG values and to
measure the distribution of per set IRG values. With this tool, we gathered the distributions
of per set IRG values of all SPEC CPU2000 benchmarks. All of these results can be found
32
in appendix A.
3.3.3 Distributions of Per Set IRG Values and Temporal Locality
The property of temporal locality is seen in the distributions of per set IRG values. Fig-
ure 3.3 is the distribution of per set IRG values of eight SPECCPU2000 benchmarks. The
eight SPEC benchmarks are chosen randomly from the 26 SPEC CPU2000 benchmarks.
The four programs,gcc, gzip, craftyandperlare integer programs from the SPEC CPU2000
INT suit. The other four programs,wupwise, ammp, apsiandfma3dare floating point pro-
grams. In appendix A, the distributions of per set IRG valuesof all 26 SPEC CPU2000
benchmarks are provided.
The IRG values in Figure 3.3 are measured on a cache line basisinstead of on an individual
address. To save space, a CPU cache line always consists of multiple words. Currently, a
CPU cache usually has eight or more words. The eight contiguous words of a cache line are
loaded into and evicted from the cache as a whole. A hit of one word in the cache line is also
a hit for the other words in the cache line. IRG values measured on a cache line basis are
smaller than those measured on individual address. The results presented in Figure 3.3 is
based on a cache configuration of 256 cache sets with 32 byte cache lines. The cache lines
of all cache sets are of the same size [PH05].
As seen in Figure 3.3, most of the IRG values are small. More than 90% of per set IRG
gaps are equal to one. Out of 26 SPEC benchmarks, there are twomemory traces,mcfand
vortex, where the number of IRG values that equal to one is less than 90%. Some programs
have per set IRG values of size one as high as 98%.
The small IRG values are direct evidence of the temporal locality. Since the majority of per
set IRG values are so small, even small CPU caches, for example a 2 KB cache, will have
high hit rates using cache replacement algorithms such as FIFO and Random. It has been
observed in [HP96, page 379] that the difference in hit ratesbetween LRU and RANDOM
are minimal with LRU having slightly better hit rates. The small IRG can explain this. IRG
values are so small that most addresses are re-referenced immediately before there are any
replacement decisions. Thus any replacement algorithms can have high hit rates.
Using LRU replacement, IRG values are directly related to cache hits and misses. LRU
replacement guarantees that every address stays in the cache for a time span that is at least
the size of the associativity of the cache. All IRG values below the associativity of the cache
33
1 2 to 9 10~100 100~1000 10^3~10^4 10^4~10^5:
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Per Set IRG Distributions of SPEC INT Benchmarks, L1
Crafty Gcc Gzip Perl
IRG value
perc
ent
1 2 to 9 10~100 100~1000 10^3~10^4 10^4~10^5:
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Per Set IRG Distributions of SPEC FLT Benchmarks, L1
Ammp Apsi Fma3d Wupwise
IRG value
perc
ent
Figure 3.3: The distributions of per set IRG values of eight SPEC benchmarks.
results in cache hits under LRU. As the result of potentiallytoo many comparisons needed to
compare the address tag with the cache line tag it is difficultfor set associative CPU caches
to achieve more than 32-way associativity. Assuming a 16-way associativity, using LRU
replacement, IRG values below 16 are cache hits. These IRG values account for more than
95% of all per set IRG values (see table 4.3). However, increasing the associativity of LRU
caches, for example from 16 to 32 way associative, may not be agood idea. The number
of IRG values between 16 and 32 is very limited, and there is a considerable percentage of
IRG values as high as one hundred. These IRG values are not easily accommodated by LRU
replacement by increasing the associativity or the size of the cache.
34
3.4 Reference Counts and Property of Short Lifetime
This section presents the result of analyzing reference counts of programs.
3.4.1 Property of Short Lifetime
The reference countof an address is the number of times the address is referencedin the
lifetime of the program. The analysis presented in this workon reference counts finds that
the distribution of reference counts of addresses is heavily long tailed and the majority of
addresses have low reference counts. Reference counts tendto pack around small numbers,
which has a negative impact on the hit rates of CPU caches using LRU replacement. LRU
assumes that all addresses are equally likely to be referenced again. The analysis presented
in this work shows that the majority of addresses are not likely to be re-referenced often. A
very small portion of addresses are heavily re-referenced.
Figure 3.4 is the distribution of reference counts of eight SPEC benchmarks. In appendix
A, the distribution of reference counts of all 26 SPEC CPU2000 benchmarks are provided.
The figure demonstrates that a large portion of addresses, 10% to 75%, are referenced only
once, and the majority of addresses, nearly 90%, are referenced less than 10 times. We name
this phenomenon theproperty of short lifetime.
The property of short lifetime is found in all 26 SPEC benchmarks. The majority of ad-
dresses have small reference counts. On the other hand, eachof the SPEC benchmark pro-
grams has a small quantity of addresses very intensively referenced. There are always some
addresses in each of the 26 SPEC benchmarks that are referenced as high as hundreds of
thousands of times. Three SPEC benchmarks,lucas, mesaandmgrid, have two or three
addresses referenced more than a million times. For each of these benchmarks, this rep-
resents more than one tenth of the memory trace. The distribution of reference counts is
heavily long tailed. Frequently referenced addresses onlyaccount for a small portion, no
more than 10%, of all addresses in number, but these frequently referenced addresses form
the majority of the memory references.
3.4.2 Reference Counts of Cache Lines
To save space, a CPU cache line typically consists of multiple words. Currently, a CPU
cache line often has eight or more words. The eight contiguous words of a cache line are
35
1 2 to 9 10~100 100~1000 10^3~10^4 10^4~10^5: 10^5~10^6:
0.0000%5.0000%
10.0000%15.0000%20.0000%25.0000%30.0000%35.0000%40.0000%45.0000%50.0000%55.0000%60.0000%65.0000%70.0000%
Distributions of Per-address Reference Counts of SPEC INT
Crafty Gcc Gzip Perl
reference counts
perc
ent
1 2 to 9 10~100 100~1000 10^3~10^4 10^4~10^5: 10^5~10^6:
0.0000%
10.0000%
20.0000%
30.0000%
40.0000%
50.0000%
60.0000%
70.0000%
80.0000%
Distributions of Per-address Reference Counts of SPEC FLT
Ammp Apsi Fma3d Galgel
reference counts
perc
ent
Figure 3.4: The distributions of per address reference counts of eight SPEC benchmarks.
loaded into and evicted from the cache as a whole. Even if a word is referenced for the first
time, if the cache line containing the word is already in the cache, it is still a cache hit. Thus
the analysis focuses on calculating reference counts of cache lines. The reference count of
a cache line is the sum of the reference counts of all the eightwords in the cache line.
Figure 3.5 is the distributions of reference counts of cachelines. Figure 3.5 shows eight
SPEC benchmarks. In appendix A, the distributions of cache line reference counts of all
26 SPEC CPU 2000 benchmarks are provided. Compared with the per address reference
counts shown in Figure 3.4, reference counts of cache lines tend to be a little bit larger. The
proportions of small reference counts are smaller, but the long tailed trend is still obvious in
the distribution of reference counts of cache lines in that 90% of cache lines have reference
counts less than 100. The property of short lifetime still holds in the distribution of reference
counts of cache lines.
36
1 2 to 9 10~100 100~1000 10^3~10^4 10^4~10^5: 10^5~10^6:
0.000%5.000%
10.000%15.000%20.000%25.000%30.000%35.000%40.000%45.000%50.000%55.000%60.000%65.000%70.000%
Distributions of Per-line Reference Counts of SPEC INT Benchmarks
Crafty Gcc Gzip Perl
reference counts
perc
ent
1 2 to 9 10~100 100~1000 10^3~10^4 10^4~10^5: 10^5~10^6:
0.000%
10.000%
20.000%
30.000%
40.000%
50.000%
60.000%
70.000%
80.000%
90.000%
Distributions of Per-line Reference Counts of SPEC FLT Benchmarks
Ammp Apsi Fma3d Galgel
reference counts
perc
ent
Figure 3.5: The distributions of reference counts of cache lines of eight SPEC benchmarks.
3.5 Relationship between Average Reference Counts and
LRU Hit Rates
The study of memory traces done in this work shows that the average reference count is
representative of the temporal locality of the program. Programs with good temporal lo-
cality have high average reference counts. Programs with poor temporal locality tend to
have small average reference counts. Programs with high average reference counts also
have higher hit rates in CPU caches using LRU replacement than programs with lower av-
erage reference counts. Figures 3.6 and 3.7 show the averagereference count of the SPEC
integer and floating point benchmarks and their cache miss rates under LRU replacement.
The average reference counts of SPEC benchmarks vary from 12to 539.
Programs with higher average reference counts have higher cache hit rates. The reason for
37
the correlation of average reference counts and cache hit rates is that for high average ref-
erence counts, there are fewer unique addresses or cache lines in an IRG gap than for low
average reference counts. In section 3.3.2, it is seen that the distributions of IRG values
are similar for the programs under examination. However, average reference counts of pro-
grams vary a great deal. High average reference counts implies a larger coverage of IRG
values with the same cache size, and thus higher CPU cache hitrates. The average refer-
ence count of a program is determined by the nature of the computation.
Bzip Crafty Eon Gap Gcc Gzip Mcf Parser Perl Twolf Vortex Vpr
0
50
100
150
200
250
300
350
400
450
500
550
Average Reference Counts of SPEC INT Benchmarks
SPEC INT Benchmarks
Ave
rage
Ref
eren
ce C
ount
Bzip Crafty Eon Gap Gcc Gzip Mcf Parser Perl Twolf Vortex Vpr
0.00%0.50%1.00%1.50%2.00%2.50%3.00%3.50%4.00%4.50%5.00%5.50%6.00%6.50%7.00%7.50%8.00%
LRU Miss Rates of SPEC INT Benchmarks
SPEC INT Benchmark
mis
s ra
te
Figure 3.6: The average reference counts of SPEC integer benchmarks and their miss ratesunder LRU.
38
Ammp
Applu Apsi Art Equake
Facerec
Fma3d
Galgel
Lucas Mesa Mgrid Six-track
Swim Wupwise
0255075
100125150175200225250275300325350
Average Reference Counts of SPEC FLT Benchmarks
SPEC FLT Benchmark
Ave
rage
Ref
eren
ce C
ount
Ammp
Applu Apsi Art Equake
Facerec
Fma3d
Galgel
Lucas Mesa Mgrid Six-track
Swim Wupwise
0.00%
0.25%
0.50%
0.75%
1.00%
1.25%
1.50%
1.75%
2.00%
2.25%
LRU Miss Rates of SPEC FLT Benchmarks
SPEC FLT Benchmark
mis
s ra
te
Figure 3.7: The average reference counts of SPEC floating point benchmarks and their missrates under LRU.
3.6 L2 IRG and Reference Count Distributions
This section presents the results of the analysis of reference counts and IRG distributions
for L2 cache memory references.
3.6.1 L2 Reference Count Distributions
As shown in Figure 3.3, the majority of IRG values at the L1 cache are small. Small IRG
values result in cache hits in the L1 cache. The L2 cache only knows the first reference to a
cache line, which loads it from the main memory into both the L1 and the L2 cache. The hits,
39
if any, of the cache line in the L1 cache are not visible to the L2 cache. Thus, the number
of references to cache lines seen at the L2 are much fewer thanat the L1 cache. Figure 3.8
is the distribution of reference counts of eight SPEC benchmarks at the L2 cache. The L1
cache in Figure 3.8 is a two-way associative 8 KB cache. In appendix A, the distributions
of L2 reference counts of all the 26 SPEC benchmarks are provided.
1 2 to 9 10~100 100~1000 10^3~10^40.00%5.00%
10.00%15.00%20.00%25.00%30.00%35.00%40.00%45.00%50.00%55.00%60.00%
Distributions of L2 RefCounts of SPEC INT Benchmarks
Crafty Gcc Gzip Perl
reference counts
perc
ent
1 2 to 9 10~100 100~1000 10^3~10^4
0.00%5.00%
10.00%15.00%20.00%25.00%30.00%35.00%40.00%45.00%50.00%55.00%60.00%65.00%70.00%
Distributions of L2 RefCounts of SPEC FLT Benchmarks
Ammp Apsi Fma3d Galgel
reference counts
perc
ent
Figure 3.8: The distributions of L2 reference counts of eight SPEC benchmarks.
Figure 3.8 shows that a higher percentages of addresses havesmall reference counts, and
specifically, the proportion of addresses referenced only once increases dramatically. The
distribution of reference counts are the same at the L1 and the L2 caches. The property of
short lifetime is more obvious at the L2 cache.
40
3.6.2 L2 IRG Distributions
Figure 3.9 represents the distributions of L2 IRG values of eight SPEC benchmarks. Since
the size and associativity of a CPU cache is always to the power of two, the distributions of
L2 IRG values is presented in Figure 3.10 on alog2 scale. In appendix A, the distributions
of L2 IRG values of all 26 SPEC benchmarks are provided.
1 2 to 9 10~100 100~1000 10^3~10^4
0.0000%
10.0000%
20.0000%
30.0000%
40.0000%
50.0000%
60.0000%
70.0000%
80.0000%
IRG Distributions of SPEC INT Benchmarks at L2 Cache
Crafty Gcc Gzip Perl
IRG value
perc
ent
1 2 to 9 10~100 100~1000 10^3~10^4
0.0000%
10.0000%
20.0000%
30.0000%
40.0000%
50.0000%
60.0000%
70.0000%
80.0000%
90.0000%
IRG Distributions of SPEC FLT Benchmarks at L2 Cache
Ammp Apsi Fma3d Wupwise
IRG value
perc
ent
Figure 3.9: The distributions of L2 IRG values of eight SPEC benchmarks.
From the figures it is observed that, at the L2 cache, there arefewer IRG values of size one.
The proportion of IRG values of size one is less than 30% for most of the SPEC benchmarks.
However, at the L2 cache, the majority of IRG values are stillsmall. More than 60% of L2
IRG values are between two to eight.
L2 IRG values that are higher than the associativity of the L2cache are more likely to result
in cache misses rather than cache hits. At the L1 cache, most IRG values are of size one,
which means an address is immediately and repeatedly referenced. Even with a large IRG
41
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
0.0000%2.5000%5.0000%7.5000%
10.0000%12.5000%15.0000%17.5000%20.0000%22.5000%25.0000%27.5000%30.0000%32.5000%
Distributions of L2 IRGs of SPEC INT Benchmarks
Crafty Gcc Gzip Perl
IRG value
perc
ent
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
0.0000%
5.0000%
10.0000%
15.0000%
20.0000%
25.0000%
30.0000%
35.0000%
40.0000%
45.0000%
Distributions of L2 IRGs of SPEC FLT Benchmarks
Ammp Apsi Fma3d Wupwise
IRG value
perc
ent
Figure 3.10: The distributions of L2 IRG values of SPEC benchmarks onlog2 scale.
value, at the L1 cache, there are a fewer number of unique addresses in the IRG, and thus
the IRG are likely to cause a cache hit. However, at the L2 cache, the percentage of L2
IRG values of size one is much smaller, and for a same IRG value, there are more unique
addresses, and thus, L2 IRG values larger than the associativity are likely to be cache misses.
In appendix A, the L2 IRG distributions of all 26 SPEC benchmarks are presented.
3.7 Summary
The distribution of per set IRG values clearly shows the property of temporal locality in that
nearly 90% of per set IRG values are only one, and the majorityof IRG values are small.
42
Besides the temporal locality, it is observed that a large portion of addresses are referenced
only once and the majority of addresses are referenced a small number of times. This phe-
nomenon is namedthe property of short lifetime. All SPEC benchmarks, both integer and
floating point, exhibit the property of short lifetime.
As indicated by the distribution of L1 IRG values, addressesare re-referenced immediately
and repeatedly. The memory references at the L1 cache exhibits heavy locality. LRU re-
placement has nearly perfect hit rates at the L1 cache for small and low associative caches,
since given a small cache LRU can provide longest coverage ofIRG for every address. The
temporal locality is so overwhelming that any cache replacements, such as FIFO and RAN-
DOM, have more than 90% hit rates [HP96].
Memory references at the L2 cache show more obviously the property of short lifetime.
Even with a very small L1 cache, such as two-way associative 8KB, there is a higher per-
centage of addresses with low reference counts and a lower percentage of small IRG values
at the L2 cache than at the L1 cache. This implies LRU replacement is in-appropriate at the
L2 cache.
43
Chapter 4
Locality Characteristics of Network
Protocols and Applications
This section presents analysis of the IRG values and reference counts on the memory traces
of network protocols and applications. The results show that network protocols and appli-
cations have different locality characteristics than SPECbenchmarks.
4.1 Motivation
This work began with the investigation of web server performance dropping when over-
loaded. The abrupt dropping of web server throughput and responsiveness is a commonly
seen phenomenon. Experiments were done that flooded a web server with requests from
a client. The server and the client were connected through a 100Mbps Ethernet. To avoid
disk activities, the client requested the same static web page repeatedly. The web servers,
Apache1 andthttpd2, are compiled withgcc3 without loop-nest optimization. When over-
loading happened, the servers network usage is only half of the capacity, and there were
no packet losses, nor page faults. The server memory image iswell below the main mem-
ory size. However, the CPU utilization rate of the server is always below 50% in periods of
overloading. The low CPU utilization rates of web servers, even under overload conditions,
suggest poor CPU cache performance.
1http://www.apache.com/2http://www.acme.com/software/thttpd/3http://gcc.gnu.org/
44
Programs such as the SPEC CPU2000 benchmarks have much higher CPU utilization rates.
SPEC CPU2000 benchmarks are not I/O bound. There are more intensive computation in
SPEC CPU2000 benchmarks than web servers. SPEC CPU2000 benchmarks more strongly
exhibit the property of temporal locality, which results inhigher than 99% CPU cache hit
rates. Network protocols and applications are believed to be of poor temporal locality (see
section 2.4). This chapter applies the locality analysis methods discussed in chapter 3 to
network memory traces. The empirical analysis of network memory traces shows that net-
work protocols and applications have less temporal locality than SPEC benchmarks. Net-
work memory traces more strongly show the property of short lifetime.
4.2 Memory Traces of Web Servers
One of the major obstacles in studying the CPU cache performance of network protocols
and applications is the lack of suitable memory traces. Currently, network applications run
on multiprogramming platforms, and thus the interference of operating systems on network
performance can not be ignored. The network memory reference trace must be a full-system
one, which should include interrupt handlers, OS kernel tasks, and user programs.
A full system simulator calledSimics[MCE+02] is used to generate the network memory
traces. Since the web is the most prevalent network application, a set of memory traces of
web severs, 32 memory traces in total, were generated. The three factors used in configuring
the generation of web server traces are the following: server architectures, request rates and
web page sizes. The analysis is based on two web page sizes, 20KB and 200KB, and request
rates are set from 50 requests per second to 150 requests per second for each web page size.
Larger file sizes or higher request rates are not used since the simulator does not have the
appropriate computing power. Web page size 20KB is chosen toreflect average size of text
based web pages, and web page size 200KB represents an typical image file size. Memory
traces for mixed request file sizes are also generated. In mixed file sizes scenarios, two web
pages of size 20KB and 200KB respectively, are requested at rates of 50 requests per second
to 150 requests per second. Two kinds of arbitrarily chosen file mixture ratios are used. For
one mixture ratio, one 200KB web page is requested in every four 20KB web pages, and
for the other ratio, one 200KB page is request in every nine 20KB pages.
The server architecture refers to the mechanism of handlingconcurrent connections in web
servers:select()based orfork() based. Concurrent connection handling withselect()re-
quires only one process for all connnection, butfork() requires one process for each connec-
45
tion. The two web architectures are very different from eachother.Fork()based web servers
have much larger process images thanselect()based servers, and the internal scheduling of
the two web server architectures are different. We are expecting the differences in web ar-
chitecture may exhibit different CPU cache behaviors. An example of a web server using
select()is thttpdby Poskanzer4. An example of afork()based web server isApache5, which
is currently the most widely used web server. Memory traces are generated for each com-
bination of request rates and web page sizes for boththttpdandApacheweb servers.
The combination of choices in web architectures, web page sizes and request rates is 32, and
32 web server memory traces are generated. Examples of notations used to denote memory
traces of web servers are ‘a20kr50’ and ‘t200kr50’. The ‘a’ represents Apache web server,
and the ‘t’ stands for thttpd web server. ‘20k’ refers to the file size, and ‘r50’ is the request
rate. Table 4.1 represents the names of the memory traces andtheir configurations.
4.3 Average Reference Counts of Web Server Memory Traces
Table 4.2 shows the average reference counts of the 32 web server memory traces. The av-
erage reference counts of the web server memory traces vary to some extent but not as much
as the average reference counts of SPEC CPU2000 benchmarks (see Section 3.5). Except
for one trace, the average reference counts of web server memory traces are between 20 to
60, but the average reference counts of SPEC CPU2000 benchmark traces range from 12
to 539. Since the average reference count is correlated withthe LRU hit rate of programs,
and higher average reference counts have higher hit rates (see section 3.5), the average ref-
erence counts of web server memory traces imply that web servers are of poor temporal
locality compared with some SPEC benchmarks and thus have higher miss rates than the
SPEC benchmarks. This is supported by the simulation results (see chapter 8). Table 4.2
shows that for boththttpdandApacheweb servers, small request file sizes have higher av-
erage reference counts and thttpd traces have higher average reference counts than Apache
traces.
4http://www.acme.com/software/thttpd/5http://www.apache.org/
46
Trace Name server type file size(KB) request rate (r/s)a20kr50 apache 20 50a20kr90 apache 20 90a20kr120 apache 20 120a20kr150 apache 20 150a200kr50 apache 200 50a200kr90 apache 200 90a200kr120 apache 200 120a200kr150 apache 200 150mixapr50-ap apache mixed 1 200KB 4 20KB 50mixapr90-ap apache mixed 1 200KB 4 20KB 90mixapr120-ap apache mixed 1 200KB 4 20KB 120mixapr150-ap apache mixed 1 200KB 4 20KB 150mix1-9apr50-ap apache mixed 1 200KB 9 20KB 50mix1-9apr90-ap apache mixed 1 200KB 9 20KB 90mix1-9apr120-ap apache mixed 1 200KB 9 20KB 120mix1-9apr150-ap apache mixed 1 200KB 9 20KB 150t20kr50 thttpd 20 50t20kr90 thttpd 20 90t20kr120 thttpd 20 120t20kr150 thttpd 20 150t200kr50 thttpd 200 50t200kr90 thttpd 200 90t200kr120 thttpd 200 120t200kr150 thttpd 200 150mixthr50-th thttpd mixed 1 200KB 4 20KB 50mixthr90-th thttpd mixed 1 200KB 4 20KB 90mixthr120-th thttpd mixed 1 200KB 4 20KB 120mixthr150-th thttpd mixed 1 200KB 4 20KB 150mix1-9thr50-th thttpd mixed 1 200KB 9 20KB 50mix1-9thr90-th thttpd mixed 1 200KB 9 20KB 90mix1-9thr120-th thttpd mixed 1 200KB 9 20KB 120mix1-9thr150-th thttpd mixed 1 200KB 9 20KB 150
Table 4.1: Names of network traces and their configurations.
4.4 Reference Count Distributions of Web Server Memory
Traces
Figure 4.1 shows the distribution of per address reference counts of four web server memory
traces on a per address basis. The distributions of all 32 webserver memory traces can be
found in appendix A. The distributions shows that the property of short lifetime is obvious
in web server memory traces. The distributions of referencecounts of web server mem-
ory traces do not show much difference from the distributions of reference counts of SPEC
benchmark traces.
47
avg refcount avg refcounta20kr50 61 t20kr50 150a20kr90 40 t20kr90 106a20kr120 41 t20kr120 93a20kr150 51 t20kr150 80a200kr50 294 t200kr50 21a200kr90 22 t200kr90 21a200kr120 28 t200kr120 18a200kr150 23 t200kr150 20mixapr50 37 mixthr50 51mixapr90 37 mixthr90 36mixapr120 27 mixthr120 29mixapr150 35 mixthr150 31mix1-9-apr50 32 mix1-9-thr50 56mix1-9-apr90 21 mix1-9-thr90 44mix1-9-apr120 32 mix1-9-thr120 37mix1-9-apr150 37 mix1-9-thr150 31
Table 4.2: Average reference counts of network traces.
Figure 4.2 is the distribution of reference counts of cache lines of the same four web server
memory traces as Figure 4.1. To show more precision, the distributions are onlog2 scales.
This shows that different configurations of web server types, request rates and web page
sizes exhibit great differences in the distributions of reference counts.
4.5 L2 Distributions of Reference Counts of Web Server
Memory Traces
Figure 4.3 shows the distribution of reference counts of cache lines of four web server mem-
ory traces at the L2 cache. Compared with the L2 reference count distributions of SPEC
benchmarks, web server memory traces have a lower percentage of cache lines with low
reference counts than SPEC benchmarks. This should not be interpreted as web server mem-
ory traces have better locality than SPEC benchmarks. The property of short lifetime is still
very obvious in the distributions of reference counts of webserver memory traces. Given
that web server memory traces have low average reference counts, even if there are smaller
portions of cache lines with small reference counts, web server memory traces are still of
poor temporal locality.
48
1 2 to 9: 10^1 to 10^2:
10^2 to 10^3:
10^3 to 10^4:
10^4 to 10^5:
10^5 to 10^6:
0.0000%5.0000%
10.0000%15.0000%20.0000%25.0000%30.0000%35.0000%40.0000%45.0000%50.0000%55.0000%60.0000%65.0000%70.0000%
Distributions of Per-addr Reference Counts of Network Traces, L1
A20kr90 A200kr90 T20kr90 T200kr90
reference counts
perc
ent
Figure 4.1: The distributions of per address reference counts of four web sever memorytraces.
4.6 L2 IRG Distributions of Web Server Memory Traces
The poor temporal locality of web server memory traces is also seen by examining the distri-
butions of L2 IRG values. Figure 4.4 graphically depicts L2 IRG values of four web server
memory traces. Distributions of other web server memory traces and the distributions of
IRG values at the L1 cache of web server memory traces can be found in appendix A. Com-
pared with the distributions of IRG values of SPEC benchmarks, the L2 IRG values of web
server memory traces are noticeably larger. Assuming a 16-way set associative cache, the
percentages of IRG values larger than the associativity, which is 16, in web server memory
traces are much larger than in the SPEC benchmarks.
Tables 4.3 and 4.4 present the percentages of IRG values of web server memory traces and
SPEC benchmarks which are smaller than 16 and percentages ofIRG values greater than
256. Except fora200kr50, web server memory traces generally have 60% to 70% IRG val-
ues smaller or equal to 16. In comparison, SPEC benchmarks typically have more than 85%
IRG values smaller or equal to 16, exceptmcfwhich has only 25% IRG values under 16.
Assuming a 16-way associative L2 cache using LRU replacement, IRG values under 16 are
guaranteed to result in cache hits. Web server memory traceshave fewer IRG values below
the associativity of the CPU cache, and thus fewer guaranteed cache hits. Even worse is
that due to the low average reference counts, the large IRG values in web server memory
traces are more likely to result in cache misses than those SPEC benchmarks which have
huge average reference counts.
49
1 2 4 8 16 32 64 128 256 512 1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
0.0000%2.5000%5.0000%7.5000%
10.0000%12.5000%15.0000%17.5000%20.0000%22.5000%25.0000%27.5000%30.0000%32.5000%35.0000%
Distributions of L1 Per-line Reference Counts of Network Traces
A20kr90 A200kr90 T20kr90 T200kr90
reference counts
perc
ent
Figure 4.2: The distributions of reference counts of cache lines of four web server memorytraces.
Meanwhile, web server memory traces have more IRG values above 256 than SPEC bench-
mark traces do. Among the web server memory traces, traces ofthe small web page size,
20KB, for both Apache and thttpd, have less than 10% IRG values larger than 256. Web
sever memory tracea200kr50has the second smallest percentage, 1.75%, of IRG values
larger than 256. However, it also has by far the largest percentage of IRG values below 16,
which is 79%. This is reflected in very good cache hit rate for tracea200kr50.
4.7 Summary
Memory traces of web servers are different from the memory traces of SPEC CPU2000
benchmarks. Web server memory traces have lower percentages of small IRG values and
also have higher percentages of large IRG values. The average reference counts of web
server memory traces are smaller than the average referencecounts of good locality SPEC
benchmarks. Since the average reference count of a program are representative of a pro-
gram’s temporal locality, this suggests that web servers are of poorer temporal locality than
SPEC CPU2000 benchmarks.
50
1 2 4 8 16 32 64 128 256 512 1024 2048 4096
0.00%2.50%5.00%7.50%
10.00%12.50%15.00%17.50%20.00%22.50%25.00%27.50%30.00%32.50%35.00%
Distributions of L2 Per-line Reference Counts of Network Traces
A20kr90 A200kr90 T20kr90 T200kr90
reference counts
perc
ent
Figure 4.3: The distributions of reference counts of cache lines of four web server memorytraces at the L2 cache.
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
0.00%
2.50%
5.00%
7.50%
10.00%
12.50%
15.00%
17.50%
20.00%
22.50%
25.00%
27.50%
Distributions of L2 IRGs of Network Traces
A20kr90 A200kr90 T20kr90 T200kr90
IRG value
perc
ent
Figure 4.4: The distributions of IRG values of four web server memory traces at the L2cache.
51
ammp applu apsi art bzip crafty eon% of IRG values< 16 93.11% 97.37% 94.02% 70.83% 95.17% 89.24% 96.53%
equake facerec fma3d galgel gap gcc gzip% of IRG values< 16 89.66% 96.86% 97.31% 90.40% 93.02% 90.80% 71.05%
lucas mcf mesa mgrid parser perl sixtrack% of IRG values< 16 92.69% 23.12% 96.77% 97.65% 75.83% 84.58% 89.61%
swim twolf vortex vpr wupwise% of IRG values< 16 74.35% 70.05% 85.90% 56.40% 94.75%
ammp applu apsi art bzip crafty eon% of IRG values≥ 256 0.22% 0.25% 0.64% 16.40% 0.03% 1.21% 0.58%
equake facerec fma3d galgel gap gcc gzip% of IRG values≥ 256 0.59% 0.11% 0.30% 0.08% 0.70% 0.64% 1.33%
lucas mcf mesa mgrid parser perl sixtrack% of IRG values≥ 256 0.17% 58.91% 0.15% 0.39% 0.49% 1.11% 5.99%
swim twolf vortex vpr wupwise% of IRG values≥ 256 0.41% 8.98% 1.99% 11.30% 0.33%
Table 4.3: Percentages of IRG values< 16 and percentages of IRG values≥ 256 of SPECbenchmarks.
52
a20kr50 a20kr90 a20kr120 a20kr150% of IRG values< 16 60.81% 59.76% 60.62% 61.94%
a200kr50 a200kr90 a200kr120 a200kr150% of IRG values< 16 79.40% 65.35% 66.47% 65.31%
mixapr50-ap mixapr90-ap mixapr120-ap mixapr150-ap% of IRG values< 16 63.16% 62.48% 62.52% 62.20%
mix1-9apr50-ap mix1-9apr90-ap mix1-9apr120-ap mix1-9apr150-ap% of IRG values< 16 62.85% 59.88% 59.69% 60.85%
t20kr50 t20kr90 t20kr120 t20kr150% of IRG values< 16 63.23% 59.82% 61.59% 61.83%
t200kr50 t200kr90 t200kr120 t200kr150% of IRG values< 16 66.87% 69.00% 68.41% 68.69%
mixthr50-th mixthr90-th mixthr120-th mixthr150-th% of IRG values< 16 63.68% 62.82% 64.52% 63.46%
mix1-9thr50-th mix1-9thr90-th mix1-9thr120-th mix1-9thr150-th% of IRG values< 16 63.84% 62.72% 61.81% 65.21%
a20kr50 a20kr90 a20kr120 a20kr150% of IRG values≥ 256 3.51% 7.60% 6.87% 7.39%
a200kr50 a200kr90 a200kr120 a200kr150% of IRG values≥ 256 1.75% 13.97% 11.98% 15.01%
mixapr50-ap mixapr90-ap mixapr120-ap mixapr150-ap% of IRG values≥ 256 12.43% 12.04% 11.10% 11.34%
mix1-9apr50-ap mix1-9apr90-ap mix1-9apr120-ap mix1-9apr150-ap% of IRG values≥ 256 8.09% 11.37% 11.51% 9.41%
t20kr50 t20kr90 t20kr120 t20kr150% of IRG values≥ 256 1.21% 2.72% 4.07% 5.07%
t200kr50 t200kr90 t200kr120 t200kr150% of IRG values≥ 256 15.95% 11.65% 10.71% 12.00%
mixthr50-th mixthr90-th mixthr120-th mixthr150-th% of IRG values≥ 256 9.12% 11.57% 12.28% 11.78%
mix1-9thr50-th mix1-9thr90-th mix1-9thr120-th mix1-9thr150-th% of IRG values≥ 256 6.14% 8.46% 9.57% 8.73%
Table 4.4: Percentages of IRG values< 16 and percentages of IRG values≥ 256 of networktraces.
53
Chapter 5
WLRU Cache Replacement
The studies presented in Chapters 3 and 4 suggest that good temporal locality is exhibited
by relatively few addresses. The property of short lifetimeshows that there are a large num-
ber of addresses that will not be re-referenced. The LRU replacement algorithm does not
distinguish between the heavily referenced addresses and the infrequently referenced ad-
dresses. This chapter describes a new cache replacement algorithm called WLRU. WLRU
is a modification of the LRU cache replacement that differentiates addresses.
5.1 Correlation of IRG and Reference Counts
Addresses are of different cache values. The cache value of an address is its possibility of
being re-referenced. A good replacement algorithm would always try to keep in the cache
addresses that are most likely to be referenced again. Thoseaddresses that are referenced
only once are of no cache value. Generally, addresses of higher reference counts are of
higher cache value than address with lower reference counts. However, IRG values need
to be taken into consideration. For two addresses, which areboth referenced again, the one
that will be referenced again before the other is of higher cache value.
The off-line optimal cache replacement looks forward in thememory trace and replaces the
cache line that will be referenced in the longest future [ADU71]. Details of the off-line
optimal cache replacement can be found in Section 7.5. In terms of IRG values, the off-
line optimal cache replacement replaces the address with the largest IRG value. The IRG
values of an address are correlated with the reference countof the address. Addresses that
54
have higher reference counts also have more IRG of small sizes, and large IRG gaps are
more likely to be associated with addresses with low reference counts. Table 5.1 shows the
IRG values of addresses that map to cache set 0 of SPEC benchmark crafty. Table 5.2 shows
the IRG values of addresses that map to cache set 0 of network tracea20kr50. All the IRG
values are at the L2 cache, and the L1 is two-way set associative 8 KB cache. Due to space
limitations, only a portion of the addresses are shown in tables 5.1 and 5.2.
address reference count 1 2 < 4 < 8 < 16 < 32 < 64 < 128 < 256 < 5126f0e00 279 8 40 56 108 65 1 0 0 0 04e3900 275 5 102 93 41 18 12 2 1 0 04f2700 262 6 81 91 50 21 9 2 1 0 0
......
44ed00 2 0 0 0 0 0 0 0 0 1 044ef00 2 0 0 0 0 0 0 0 0 1 047b300 2 0 0 0 0 0 1 0 0 0 07aa800 2 0 0 0 0 0 0 0 0 0 17aa900 2 0 0 0 0 0 0 0 0 0 17aaa00 2 0 0 0 0 0 0 0 0 0 1
Table 5.1: The IRG values of address tags mapping toset0 of SPEC benchmarkcrafty.
address reference count 1 2 < 4 < 8 < 16 < 32 < 64 < 128 < 256 < 51211b00 854 0 170 203 346 90 0 0 43 0 010e00 850 0 48 151 340 253 9 47 1 0 0
602100 383 46 53 164 62 2 0 16 10 28 160d600 363 47 43 142 69 7 0 16 9 28 1
......
608500 4 0 0 0 0 0 0 0 0 3 0609400 4 0 0 0 0 0 0 0 0 3 068fb00 4 0 0 0 0 0 0 0 1 2 05fee00 3 0 0 0 0 0 0 0 0 2 05ff300 3 0 0 0 0 1 0 0 0 1 0
Table 5.2: The IRG values of address tags mapping toset0 of network tracea20kr50.
Tables 5.1 and 5.2 show that the IRG values of addresses with low reference counts are large
and small IRG values are bound to addresses with high reference counts. We gathered the
distributions of IRG values at L2 cache of addresses of all SPEC CPU2000 benchmarks (the
distributions of IRG values of addresses of all SPEC CPU2000benchmarks can be found
55
in Appendix A). It was found that, except for one programswim(see Section 8.3 and Ta-
ble 8.1), the reference count of an address and the IRG valuesof the address are related
with each other. Large IRG values tend to be associated with addresses with low reference
counts, and small IRG values are likely to be associated withaddresses with high reference
counts. This reveals the basis of WLRU replacement. Good cache value addresses shows
themselves quickly. If an address is not re-referenced again in a short time since it is brought
into the cache, it may never be re-referenced. However, if anaddress is hit quickly after be-
ing brought into the cache, it is likely to be hit again and again. WLRU judges the cache
value of an address by its number of hits immediately after itis brought into the cache. If
an address is not hit after being brought into the cache in a short time, the address can be
evicted fast.
5.2 Problems with LRU and LFU
LRU does not differentiate the two kinds of references. A hitin the cache and the initial
reference to an address are treated in the same way. The initial reference to an address is
the first time the address is brought into the cache. Initial references and hits are of different
cache values. The correlation of reference counts and IRG values suggests that hits in a
short time after initial references are likely to representhigh reference count addresses. LRU
keeps addresses not hit for too long a period of time. If the cache size is limited, under LRU
replacement, high cache values addresses are likely to be flushed out by a large number of
addresses of low reference counts or large IRG gaps.
The Least Frequently Used (LFU) replacement and its variants differentiate addresses with
different cache values [EH84]. Addresses that are not hit immediately after being brought
into the cache are replaced quickly using LFU replacement. However, LFU tends to keep
addresses of good cache value in the cache too long. If some addresses are referenced thou-
sands of times in a certain period of time only and never again, these addresses will be fixed
in the cache for a long time by LFU. This results in poor hit rates for LFU. To address this
problem, variants of LFU haveagingmechanisms that reduce the reference counts of ad-
dresses periodically. Even with the aging mechanism, LFU still can not adapt to the phase
change of programs quickly enough. During a phase change, a large number of new ad-
dresses replace the old ones, and it takes LFU too long to evict the good old addresses. LFU
and its variants are never used in CPU caches.
56
5.3 WLRU Cache Replacement
WLRU addresses the limitations of LRU and LFU by limiting thecache stay time of ad-
dresses not hit frequently and addresses hit frequently. WLRU uses weights to achieve these
goals. Weights in WLRU are integer numbers representing thereplacement priority of cache
lines. When replacing, the cache line with the minimal weight is chosen. Weights change
based on the reference history of cache lines.
Weights have been used in the LFU cache replacement and its variants. The reference count
of addresses in LFU behaves as a weight. When choosing an entry to be replaced, the en-
try with the minimum reference count is chosen. The difference between WLRU and LFU
is in the calculation of weights. Weights in WLRU have clearly defined upper limits, but
reference counts in LFU do not have upper limits. LFU may havesome limit on the largest
possible reference count number, but it is the result of the physical constraint on the size
of the storage of reference counts of LFU. The upper limit of weights in WLRU prevents
cache lines being fixed in the cache.
The following equation shows the changing of the weight of a cache line as a function of
previous weight and the hit/miss status of the current reference. There are three configurable
parameters in WLRU. These are the increment of weights,i, the upper limit of weights,r,
and the initial weight,b. The calculation of weights is illustrated in the equation,wherewt
is the weight of a cache line for thetth reference to the cache set, andwt+1 is the weight at
the(t + 1)th reference. The equation reflects that if a cache line is hit, its weight increases
until reaching the upper limit, and for every reference to a cache set, if a cache line is not
hit, its weight is deducted by one until the weight reaches zero.
wt+1 =
b if referenced for the first time,
wt + i if hit andwt + i ≤ r,
r if hit andwt + i > r,
wt − 1 if not hit andwt > 1,
0 if not hit andwt ≤ 1.
(5.0)
In Equation 5.0, the weight of a cache line is always decremented by one if not hit. It is
possible that weights are decremented by a number more than one. However, doing so will
57
require a larger increment of weights and also larger upper limits of weights. This will in-
crease the complexity of the circuits.
As a general rule, programs that exhibit poor temporal locality have a large portion of ad-
dresses that will not be referenced again. For these programs, the setting of the initial weight
should be small and the increment of weights and the upper limit of weights should be large
enough when compared with the initial weight. Using small initial weights, cache lines
that will not be referenced again are replaced more quickly than when using LRU. Valu-
able cache contents are kept longer by the large weight increment and the large upper limit
of weights.
The upper limit of weights controls the length of time a previously frequently used address
stays in the cache without being hit again. Due to phase changes in programs, where the
set of frequently referenced addresses changes, the upper limit should not be set too large
compared with the increment of weights, or some addresses may be fixed in the CPU cache.
The upper limit of weights in WLRU should be no more than several times larger than the
weight increment, and thus during a phase change, new heavily referenced addresses can
easily replace the old heavily referenced addresses.
For programs of poor temporal locality, an initial weight that is set low will purge addresses
not referenced again faster. Initial weights, although small, are necessary. A small initial
weight, as low as two or four, is enough for good cache value addresses to show themselves.
The exact settings of the parameters of WLRU also depend on the size and the associativity
of the CPU cache.
5.4 Notation Used to Represent WLRU Parameter Settings
There are three configurable parameters in WLRU. Notation inthe form of i64r128b2is
used to represent the settings of the parameters of WLRU. These notations are called the
weight formula. Thei in i64r128b2stands for increment of weights, and, in the example, the
weight is increased by 64 if the cache line is hit. Ther represents the upper limit, and, in the
above example, the upper limit is set at 128. Theb in weight formulas stands for the initial
weight, and the initial weight is set at 2 ini64r128b2. Using weight formulai64r128b2,
when an address is first loaded into the cache, its weight is two. Every time the cache line
is hit, its weight increases by 64 until it reaches 128. Everyreference, if the cache line is
not hit, its weight is deducted by one until it reaches zero.
58
Figure 5.1 is the comparison of the WLRU and LRU replacement decision on an example
reference string. The reference string exhibits the property of short lifetime scenario, where
a few addresses, A and B, are referenced very often and a lot ofother addresses referenced
only one or two times. The weight formula used in figure 5.1 isi6r8b1.
5.5 WLRU Mimicking LRU
WLRU is very versatile. It can be configured to behave as LFU, FIFO (First In First Out),
and LRU. When the upper limit of weights is very large, the initial weight and the incre-
ment of weights is one, and weights never decrease, WLRU becomes LFU. When the ini-
tial weight is equal to the upper limit of weights and both theinitial weight and the upper
limit of weights are large and the weight increment is zero orvery small, WLRU is a FIFO
replacement. When the upper limit, the initial weight and the increment of weight when hit
are all the same and are large enough, WLRU behaves exactly asLRU replacement. Ta-
ble 5.5 is the comparison of the total cache misses of a set of WLRU weight formulas and
the LRU replacement. Weight formulasi512r512b512andi256r256b256have exactly the
same number of cache misses as LRU replacement.
128KB i512r512b512 i256r256b256 i128r128b128 i64r64b64 i32r32b32 i16r16b16LRU 211743 211743 211743 211743 211743 211743WLRU 211743 211743 211742 211748 211728 203653
256KB i512r512b512 i256r256b256 i128r128b128 i64r64b64 i32r32b32 i16r16b16LRU 120817 120817 120817 120817 120817 120817WLRU 120817 120820 120800 120613 117482 110028
Table 5.3: Comparison of total cache misses of LRU and weightformulas mimicking LRU.
5.6 Comparison of WLRU with Other Cache Replacement
Algorithms
The difference between WLRU and LRU is that WLRU discriminates against addresses
with low reference counts, especially addresses which are referenced only once. There are
other cache replacements which discriminate low referencecount addresses by side effects.
LRU-k [OOW93]and LIRS [JZ02] are such examples. In LRU-k, the most recent reference
59
A B C A B D H E F A C G B H A B K A C J L B
A B
A
C
B
A
A
C
B
B
A
C
D
B
A
C
H
D
B
A
E
H
D
B
F
E
H
D
A
F
E
H
C
A
F
E
G
C
A
F
B
G
C
A
H
B
G
C
A
H
B
G
B
A
H
G
K
B
A
H
A
K
B
H
C
A
K
B
J
C
A
K
L
J
C
A
B
L
J
C
A, 1 B, 1
A, 0
C, 1
B, 0
A, 0
A, 6
B, 0
C, 0
B, 6
A, 5
C, 0
B, 5
A, 4
D, 1
C, 0
B, 4
A, 3
H, 1
D, 0
B, 3
A, 2
E, 1
H, 0
B, 2
F, 1
A, 1
E, 0
A, 7
B, 1
F, 0
E, 0
A, 6
C, 1
B, 0
F, 0
A, 5
G, 1
C, 0
B, 0
B, 6
A, 4
G, 0
C, 0
B, 5
A, 3
H, 1
G, 0
A, 8
B, 4
H, 0
G, 0
B, 8
A, 7
H, 0
G, 0
B, 7
A, 6
K, 1
H, 0
A, 8
B, 6
K, 0
H, 0
A, 7
B, 5
C, 1
K, 0
A, 6
B, 4
J, 1
C, 0
A, 5
B, 3
L, 1
J, 0
B, 8
A, 4
L, 0
J, 0
Reference String:
LRU: 14 misses
top
bottom
WLRU: (with weight formula i6r8b1)
top
bottom
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21Step:
C A B D H E F A C G H B K AReplaced:
C D H E F C G H K CReplaced:
10 misses
Figure 5.1: Comparison of the replacement decision of WLRU and LRU.
60
to a cache line is not used to make a replacement decision, butthekth backward reference
is used to make a replacement decision. If a cache line has notbeen referencedk times, its
priority is minimal. A side effect of this arrangement is that addresses not hit in the cache
are always discriminated against. These addresses always have the lowest priority and are
frequently replaced. In LIRS, the time span between two consecutive references to the same
cache line is used to represent the replacement priority of acache line. If a cache line is not
hit at least once in the cache, it has the lowest priority possible. LIRS is against cache lines
with no hit history in the cache. Those not-hit cache lines are evicted fast.
The problems with LRU-k and LIRS are that these replacement algorithms are too expen-
sive to implement in CPU caches. LRU-k needs to store the reference history up to the lastk
references and also needs to compare the order of the lastk references of cache lines. LIRS
stores the last two references and maintains two ordered queues. Assuming storing one ref-
erence history requires eight bits, LRU-2 and LIRS need to store two reference histories,
and they require at least 16 bits per cache line, which is twice the space than WLRU. An-
other difference of LRU-k and LIRS than WLRU is that, although LRU-k and LIRS can
evict addresses that are not hit in the cache faster than LRU,the ways LRU-k and LIRS
treat frequently referenced addresses are not as accurate as WLRU. For LRU-2, cache lines
hit three times have no advantage against cache lines hit twice, and for LIRS, the number
of times cache lines are hit do not matter if the number of hitsis more than one. This makes
LRU-k and LIRS worse than WLRU.
A Least Recently Frequency Used (LRFU) replacement algorithm [LCK+01] is also a re-
placement using weights. LRFU is proposed to be used in database buffer caches. LRFU
assigns a value called CRF (Combined Recency and Frequency)to each cache block, and
when replacing a block, the block with the minimal CRF value is chosen (see section 2.3).
The calculation of CRF values involves multiplying of real numbers. Floating point calcu-
lations are much more expensive than integer calculations.An integer ALU requires around
600 to 700 transistors1, but a floating point unit requires 40K transistors2. The CRF value
of a block needs to be calculated for each reference in the history. In comparison, WLRU
involves simple adding and minus by one and only calculates once. LRFU requires storage
of the whole reference history, while WLRU does not store anyreferences. Although LRFU
is proposed to subsume both LRU and LFU, it is based on totallydifferent observations and
theories than WLRU.
1http://ieeexplore.ieee.org/iel5/8641/27381/01217869.pdf2http://www.intel.com/standards/floatingpoint.pdf
61
WLRU is different from the replacement algorithms that predict dead and alive cache lines
[HKM02, LFF01, KS05] in that WLRU does not predict that a cache line is dead or alive.
WLRU predicts whether the distance of the next reference of acache line is large or short.
WLRU tries to evict addresses with large IRG gaps, but these addresses are not necessarily
dead. In fact, a considerable portion of victim cache lines of WLRU are still alive. Dead
cache lines are good choices for replacements, but when the size of cache is small compared
with the size of the working set of workloads, the possibility of finding a dead line also
decreases. In this scenario, the best choice is the largest IRG gap.
AIP/LvP [KS05] is a CPU cache replacement algorithm that predicts dead cache lines. AIP/LvP
tries to evict a cache line referenced more than a specified threshold faster than LRU. When
the L2 cache is small, for example 128KB, the time WLRU keeps adead cache line is within
a small constant of the optimal replacement algorithm. The maximum time WLRU keeps
dead cache lines is equal to the upper limit of weights. In this case, WLRU can be viewed as
a simplified version of AIP/LvP. AIP/LvP adapts the keep timeof dead lines. WLRU uses
a fixed setting and thus has lower costs.
5.7 Summary
The property of short lifetime shows only a small portion of addresses are worth being kept
in the cache. However, LRU does not distinguish between addresses with high reference
counts and the addresses with low reference counts. LRU always keeps the just referenced
addresses as long as possible. For programs with poor locality, where a large number of
addresses will not be referenced again, LRU is not optimal. WLRU addresses this short-
coming of LRU by differentiating addresses.
The IRG values of an address is found to be correlated with thereference count of the ad-
dress. Addresses with high reference counts tend to have small IRG values, and large IRG
values are likely to be associated with addresses of low reference counts. This finding is
exploited by WLRU to identify the high cache value addresses. In WLRU, if an address
is not hit in a short time after it is brought into the cache, the address is evicted fast. For
programs with a lot of addresses never re-referenced, WLRU can keeps the valuable cache
contents longer than LRU.
WLRU uses integers called weights to denote the cache replacement priority of cache lines.
When replacing, the cache line with the minimal weight is chosen. Weights changes ac-
62
cording to the reference history of cache lines. By controlling the changing of weights,
WLRU can suit a wide range of locality and mimic other cache replacements such as LRU,
LFU, FIFO. However, WLRU is intended to be mainly used in programs with poor locality.
Guidelines on how to set the WLRU parameters are provided.
63
Chapter 6
Hardware Implementations of WLRU
In this chapter, an example implementation of WLRU replacement in a CPU cache is pre-
sented.
6.1 Space Requirements of WLRU
Besides space for the data, the address tag and the status bits of a cache line, each cache line
needs circuitry to implement the replacement. LRU needs space to store the LRU order, and
WLRU need space to store the weight. Figure 6.1 shows an example storage requirement
of a cache set using WLRU replacement. Eight bits are used by each cache line to store the
weight. This implies the upper limit of the weight is less than 255. Simulations in Chapter
8 show that an upper limit of weights of 256 have good hit ratesfor both SPEC CPU2000
benchmarks and web server memory traces. In this example implementation of circuits, we
used the upper limit of weight of 256 only. Besides the space used by each cache line to
store the weight, a CPU cache using WLRU replacement needs two registers to store the
increment of weight when hit and the initial weight.
6.2 Overall Structure of WLRU CPU Cache
Figure 6.2 is the overall structure of a CPU cache implementing the WLRU replacement
algorithm. The WLRU CPU cache in the figure is a L1 cache. Without any internal changes,
64
VALID
<1> <1> <19> <8> <256>
line0
line1
line2
line3
line4
line5
line6
line7
DIRTY TAG WEIGHT DATA
Figure 6.1: Storage arrangement of an eight-way associative cache set using WLRU re-placement.
the WLRU CPU cache can be used as a L2 cache. The CPU cache in figure 6.2 is an eight-
way set associative cache, 256KB in size, using 32 byte cachelines. The cache lines are
organized in 1024 cache sets, and each cache set has eight cache lines, identified asline0
to line7. The storage of the CPU cache is organized into eight ways, identified asway0 to
way7. A way is a collection storage for the cache lines of the same number from all cache
sets. Each way has five RAM memory arrays that are referred to as follows: tag RAM, data
RAM, weight RAM, dirty status RAM and valid status RAM. TheseRAM arrays provide
storage for the tag, weight, data, dirty status and valid status part of cache lines. The dirty
status bit is set when the cache line has been written. The valid status bit indicates that the
cache line contains valid cache content i.e., not empty or being invalidated explicitly.
Figure 6.3 shows the weight RAM, the data RAM, the tag RAM, thedirty status RAM and
the valid status RAM of a way. Each RAM array has 1024 storage cells. A 10 bit wide
address line is used by the RAM arrays to index to a single storage cell among the 1024
cells. The number of storage cells in the RAM arrays corresponds to the number of cache
sets. The weight RAM cells are eight bits wide, the tag RAM cells are 19 bits wide, and the
dirty and valid status RAM cells are one bit wide. It is also possible to merge the Dirty and
Valid status RAM arrays into a single RAM array with two bit wide cells. The data RAM
supports two modes of operation modes: line mode and word mode. In the line mode, a 256
bit cache line is read or written. In the word mode, a 32 bit word is accessed. The data RAM
has a 13 bit wide address line. In the line mode, only the highest 10 bits of the address are
used. Cells in the Data RAM are 256 bits wide, if in the line mode, otherwise, 32 bits wide.
65
Internal Bus
CPU MMU
Write Buffer
RegisterFile
Cache Load
Queue
tagset
indexblock offset tag data
Cache
Address Latch Castout Buffer
set index
Weight Increment Register
Initial Weight Register
1
...
associative2 / line2
associative7 / line7
associative1 / line1associative0 / line0
Weight RAM
Valid Status RAM
Dirty Status RAM
Data RAMTag RAM
Hit/MissLogic
Read/WriteLogic
WeightControlLogic
Replace-mentLogic
Linefill/CastoutLogic
10
Fig
ure
6.2
:T
he
structu
reo
faC
PU
cache
usin
gW
LRU
replacem
en
t.
66
Tag RAM1024X19
(19)
(19)
(10)
RD/WR
ADDR
CLOCK
DATA OUT
DATA IN
Data RAM1024X256
(256/32)
(256/32)
10/13
RD/WR
ADDR
CLOCK
DATA OUT
DATA IN
W/L
Dirty Status RAM1024X1
(1)
(1)
(10)
RD/WR
ADDR
CLOCK
DATA OUT
DATA IN
Valid Status RAM1024X1
(1)
(1)
(10)
RD/WR
ADDR
CLOCK
DATA OUT
DATA IN
Weight RAM1024X8
(8)
(8)
(10)
RD/WR
ADDR
CLOCK
DATA OUT
DATA IN
Figure 6.3: The RAM memory arrays used in an associative of the WLRU CPU cache.
The WLRU CPU cache in figure 6.2 supports cache lookup and line-fill operations. Cache
lookups are initiated by the CPU, when the CPU is reading or writing to an address. Line-
fills refer to the loading of cache lines from the main memory.Line-fills are initiated by
the Memory Management Unit (MMU), when MMU has finished a mainmemory read. In
cache lookup operations, the address of the word the CPU is currently referencing is stored
in the Address Latch. The set index part of the address,bit3 to bit12 , is used by the RAM
arrays to index to a single cache set. The eight tag RAM arraysare read and compared in
parallel with the tag part of the address latch. If a match is found, it is a cache hit, otherwise a
cache miss. A hit/miss signal is sent back to the CPU indicating the result of the comparison.
67
In the case of a cache hit, a 32 bit word is read from or written to the Data RAM cell of the
matching way. If it is a cache miss, the CPU sends the memory request to MMU, and MMU
retrieves that data from the main memory. This incurs long delays. Cache lookup operations
cause the weights of cache lines to change based on the hit or miss results of lookups.
A line-fill operation is the result of MMU visiting the main memory. A cache line, 256 bits
in size, is read from the main memory and stored in the Cache Load Queue of MMU. The
MMU then initiates a line-fill operation which replace the cache line chosen by the replace-
ment algorithm with a new cache line stored in the Cache Load Queue of the MMU. If the
replaced cache line is valid and marked dirty, there is a cast-out operation that first transfers
the contents of the replaced cache line into the Cast-out Buffer and then writes the contents
in the Cast-out Buffer into the main memory. If the replaced cache line is not valid or the
dirty bit is not set, then the selected cache line is simply overwritten. The address of the
cache line containing the word the CPU is referencing is stored in the Address Latch. The
highest 19 bits of the address are stored in the tag RAM array.The 256 bit data is stored in
the data RAM. The weight RAM and the status RAMs of the cache line are also initialized.
The WLRU cache has two software configurable registers. These are theweight increment
registerand theinitial weight register. The value stored in the weight increment register
represents the value ofi (see weight formula in chapter 5). The initial weight register stores
the value ofb. The upper limit of weights is implicitly implemented by thesize of the weight
storage. In this example, eight bits per cache line are used to store the weight, and accord-
ingly the upper limit of weights is fixed at 255.
Figure 6.4 is the data path, the address path and the control signals of the WLRU CPU cache.
This includes a bidirectional 256 bit wide data path, a bidirectional 32 bit wide address path,
a read/write (RD/WR) signal, a lookup/line-fill signal, a hit/miss signal and a cast-out signal.
The lookup/line-fill signal indicates two types of cache operations: cache lookup operations
and cache line-fill operations. In cache lookup operations,the RD/WR signal indicates read
or write accesses to the cache. The hit/miss signal is used toindicate whether a cache line
of the same address as the referenced address is found in the cache. For cache line-fill op-
erations, the cast-out signal indicates whether the replaced cache line needs to be written to
the main memory. In this example, the data part of a cache lineis 256 bits long, and thus
the data path is 256 bits wide. In cache lookup operations, only one word is transferred at a
time, and thus only the first 32 bits of the data path are used. In cache line-fill operations, all
256 bits are used. Assuming an 32 bit CPU core, the address path is 32 bits wide. In cache
line-fill operations, the lowest three bits of the address path are not used.
68
Cache
256
data
32
addr RD/WR lookup/linefill hit/miss castout
Internal Bus
Figure 6.4: The data path, address path and control signals of the WLRU CPU cache.
6.3 Hit/Miss Logic
Figure 6.5 graphically depicts the hit/miss control logic of the WLRU CPU cache. The
hit/miss logic is involved in cache lookup operations and changes the weights of cache lines.
The address of the word that the CPU is currently referencingis stored in the Address Latch.
The middle 10 bits of the address is the set index bits. Set index bits are connected with all
the RAM memory arrays except the data RAM memory arrays of theeight ways. The data
RAM memory arrays are connected to the lower 13 bits of the Address Latch.
The core of the hit/miss logic is a parallel comparator. The comparator simultaneously com-
pares the stored address tags of the eight cache lines with the tag part of the Address Latch
and outputs the hit/miss signal and the lineselect signal. Only valid cache lines are involved
in the tag comparison. If one of the eight tags stored in the Tag RAMs is the same as that
in the Address Latch and the valid bit of the cache line is set,there is a cache hit and the
hit/miss signal is set high. The lineselect signal indicates the number of the cache line the
cache hit.
If no matching address tag is found, it is a cache miss. In sucha case, the hit/miss signal is
set low, and the lineselect signal is not used. There will be no reading from nor writing to
cache lines. The lineselect signal and the hit/miss signal are needed in the weight control
logic to update the weights of all cache lines in the cache set, as illustrated in Figure 6.6.
When a cache hit happens, if it is a cache read, the lineselect signal controls the output of
69
COMPARATOR
EN
B
EN
B
EN
B
...
...
. . .
v0
v1
v7
tag0
tag1
tag7
d0
d1
d7
...
...data0
data1
data7
hit/miss
3:8 DECODER
RD/WR
. . .MUX
. . .3
1
Data Output
32
tagset
indexoffset
19
Address Latch
line_select
V bit RAM
1024X1
(1)
(1)
10Tag RAM1024X19
(19)
(19)
10
V bit RAM
1024X1
(1)
(1)
10
Tag RAM1024X19
(19)
(19)
10
V bit RAM
1024X1
(1)
(1)
10Tag RAM1024X19
(19)
(19)
10
D bit RAM
1024X1
(1)
(1)
10
D bit RAM
1024X1
(1)
(1)
10
D bit RAM
1024X1
(1)
(1)
10
Data RAM1024X256
(32)
(32)
13
Data RAM1024X256
(32)
(32)
13
32
32
32
32
32
32
Data RAM1024X256
(32)
(32)
13
19
19
19
Data Inputlookup/linefill
3
13
10
Fig
ure
6.5
:T
he
hit/m
isslo
gic
ofth
eW
LRU
CP
Ucach
e.
70
a multiplexer and one word in the hit cache line is read out. Inthe case of a cache write,
the lineselect signal is decoded in a decoder. The output of the decoder activates only one
cache line, the hit cache line, and one word of the cache line is overwritten by the value on
the data input bus. When writing occurs, the dirty status bitof the hit cache line is set. Later,
when the line is replaced, the content of the cache line, whose dirty status bit is set, will be
written to the Cast-out Buffer, as illustrated in Figure 6.8. In cache lookup operations, only
one word is read from or written to the cache line. The block offset part of the address,
which is stored in the lowest 3 bits of the Address Latch, controls which one of the eight
words in a cache line data is accessed.
6.4 Weight Control Logic
Figure 6.6 graphically depicts weight control logic involved in cache lookup operations.
Weights are recalculated for every reference to the cache set. The set index bits of the Ad-
dress Latch are connected to the weight RAM memory arrays of each of the eight ways.
Only the weights of the selected cache set are changed. If there is a cache hit, as indicated
by the hit/miss signal, a decoder decodes the lineselect signal and selects the hit cache as-
sociative. The weight of the hit cache line is increased, butthe weights of all other cache
lines are deducted by one. In the case of cache misses, when the hit/miss signal is low, the
weights of all cache lines in the cache set are deducted by one. The weight arithmetic circuit
takes the old weight as input and outputs the new weight. Thisis illustrated in Figure 6.7.
Figure 6.7 is a diagram showing an example weight arithmeticcircuit. The core of the cir-
cuit is an eight bit wide full adder. The weight arithmetic circuit performs operations for
increasing the weight and deducting the weight by one. Inputsignal Inc/deduct controls the
choice of the two operations. Input linesw0 to w7 are bits of the old weight. Output lines
o0 to o7 are bits of the new weight.
If a cache line is hit, the Inc/deduct signal is set high. Its weight arithmetic circuit performs
an unsigned adding of the old weight and the value of the weight increment register. The
weight increment register is software configurable. The size of the register is eight bits,
which can support weight formulas whose increments are equal to or less than 255. If the
sum of the old weight and the number in the weight increment register overflows, which is
indicated by setting of the carryout signal of the adder, theweight-out bits,o0 to o7, are all
set to one. The new weight becomes 255, which is the upper limit of weights.
71
In the case of cache misses, the Inc/deduct signal is set low.The weight arithmetic circuit
of a cache line performs a signed deduction of the old weight by one. When the old weight
is already zero, deducting it by one results in underflow. In the case of underflow, the new
weight is zero. Underflow is detected if all bits of the old weight are zero. Thus, weights in
this example are always greater than or equal to zero and lessthan or equal the upper limit.
6.5 Replacement and Line-Fill/Cast-Out Logic
Figure 6.8 is a block diagram illustrating the replacement and line-fill/cast-out logic circuits
of a WLRU cache. Cache line-fill operations are initiated by MMU. The Address Latch
contains the address of the new cache line. The 256 bit wide data input has the data of the
new cache line to be filled. All eight words of the data of the new cache line are loaded
simultaneously. The replacement circuit uses all eight weights and their valid status bits as
inputs and outputs the number of the cache line to be replaced. The 3-bit wide victimline
signal contains the number of the line to be replaced. The victim line signal is decoded at a
decoder to enable writing to the chosen cache lines. The words in the RAM memory arrays
of the selected cache line are overwritten. The tag and the data will have new contents. The
valid status bit will be set and the dirty status bit will be cleared. The newly loaded cache
line is assigned an initial weight from the initial weight register. The Initial Weight Register
is software configurable.
If the selected cache line to be replaced is valid and its dirty status is set, the data and the
address tag of the cache line are written to the Cast-out Buffer before the cache line is over-
written. The Cast-out Buffer consists of the tag, the set index and the data. The tag and the
data are read from the RAM memory arrays of the cache line to bereplaced, but the set in-
dex is read from the Address Latch. Although the address latch contains the address of the
new cache line, the set index part is the same as the set index of the address of the to-be-
replaced cache line. The castout signal is set high if the Cast-out Buffer is loaded. If the
replaced cache line is not valid, nor dirty, the existing cache line is simply overwritten and
there is no cast-out, in which case the castout signal is set low. If the castout signal is high,
MMU reads the content of the Cast-out Buffer to its write buffer and will later write it to
the main memory when the memory bus is not busy.
Figure 6.9 is a diagram showing an example of the replacementcircuit used in Figure 6.8.
Input signalw0 to w7, each is eight bits wide, are the weights of cacheline0 to cacheline7.
Input signalv0 tov7 are the valid status bits of the eight cache lines. Output signalso0, o1, o2
72
encode the number of the cache line to be replaced. The valid signal indicates whether the
to-be-replaced cache line is valid or not. The core of the replacement circuit is a comparator.
The comparator takes the weights of all cache lines in the cache set as inputs and outputs
the number of the cache line with the minimum weight. If thereis more than one cache line
with the minimum weight, the cache line with the smallest number is chosen.
If there are invalid cache lines, the replacement circuit chooses an invalid cache line and
ignores the output of the comparator. Invalid cache lines can be empty cache lines or cache
lines being invalidated explicitly by other processors in amulti-processor environment. In
both cases, invalid cache lines are replaced first. If there is more than one invalid cache line,
the cache line with the smallest number is chosen. If all cache lines are valid, the output of
the comparator is used, and the valid signal is set high. The valid signal is used in figure 6.8
to determine if the Cast-out Buffer should be loaded. If the to-be-replaced cache line is an
invalid one, the valid signal is set low. In this case, there is no need to copy the content of
to be replaced cache line to the Cast-out Buffer.
6.6 Comparison of WLRU and LRU
Compared to LRU, the primary cost of WLRU is the space to storethe weight. WLRU
needs eight bits per cache line to store the weight. Assumingan eight bit weight storage
per cache line, for an-way associative cache, WLRU requires8 ∗ n bits per cache set. In
comparison, LRU needsn2 bits to store the LRU order of cache lines [Tan87]. For an eight-
way associative cache set, WLRU replacement needs8 ∗ 8 = 64 bits per cache set. LRU
replacement also needs82 = 64 bits. For higher associativity, WLRU requires less space
than LRU. For example, for a 16-way associative cache set, LRU requires162 = 256 bits,
and WLRU requires8 ∗ 16 = 128 bits.
Pseudo-LRU replacement algorithms require much less spaceto store the pseudo-LRU or-
ders. For an-way set associative cache, pseudo-LRU-tree requiresn− 1 bits, and pseudo-
LRU-msb needsn bits. For an eight-way associative cache set, Pseudo-LRU-tree requires
a storage of seven bits, and Pseudo-LRU-msb requires eight bits to store the pseudo-LRU
order (see section 2.3.1).
WLRU has different weight control logic than LRU. Each reference to the WLRU CPU
cache, all of the weights of the affected cache set are changed. Most of the weights are
deducted by one. The weight of the hit line is increased. There aren such weight con-
73
trol circuits for an-way set associative cache. The core of the weight control logic is an
adder. Since the weight change is simple, it does not need a full adder, and the circuit of the
weight control logic can be simplified. LRU and the Pseudo-LRU replacement algorithms
have similar circuits too. Every reference, the LRU or Pseudo-LRU order needs to be up-
dated. The circuit for implementing the LRU or Pseudo-LRU order is of the same order of
complexity as WLRU weight control logic.
The only difference of the replacement circuit of WLRU and the replacement circuits of
LRU and the Pseudo-LRU replacement algorithms is the comparator of WLRU weights.
However, only one comparator is needed for the entire cache.It is expected that the com-
parison of weights in the replacement circuit of WLRU does not slow down the CPU cache
for two reasons. First cache line replacements occur infrequently, especially at the L2 cache.
Second the replacement of the old cache line by a new cache line happens in parallel with
the loading of the CPU execution units with the word in the newcache line or the loading of
the L1 cache. When the memory word arrives at the cache load queue, the CPU execution
unit is to load the word. The stopped pipeline is immediatelyresumed. The loading of the
cache happens in parallel with the resumption of the pipeline. No CPU cycles are wasted
in replacing and loading the cache.
6.7 Summary
The example implementation of a WLRU CPU cache described in this chapter is very simi-
lar to a standard set associative CPU cache using LRU replacement. The CPU cache lookup
logics are exactly the same. The other logics are slightly different. Compared with LRU and
Pseudo-LRU caches, WLRU CPU cache requires a little more storage for weights. Pseudo
LRU replacements need one bit per cache line for replacementinformation, and WLRU re-
quires 8 bits per cache line. However, the total storage of a cache line, including storage for
the tag, the data, the weight and the status bits, are 285 bits. The storage for weights, which
is eight bits per cache line, only amounts for less than8/285 = 2.8% of all the 285 bits of
storage of a cache line.
74
8
8 weight0
weight1
...
weight7
3:8 DE
CO
DE
R
3line_select ...
01
7
Hit/Miss
Weight Arithmetic Circuit
WinWout
Inc/deduct
Weight RAM
1024X8
(8)
(8)
10
8
8
Weight Arithmetic Circuit
WinWout
Inc/deduct
Weight RAM
1024X8
(8)
(8)
10
8
8
Weight Arithmetic Circuit
WinWout
Inc/deduct
Weight RAM
1024X8
(8)
(8)
10
lookup/linefill
tagset
indexoffset
Address Latch
Figure 6.6: The weight control logic of the WLRU CPU cache.
75
�
a0
a1
a2
b0
b1
b2
a3
b3
a4
b4
a5
b5
a6
b6
a7
b7
c0
c1
c2
c3
c4
c5
c6
c7
carry in
carry out
7 6 5 4 3 2 1 0
w1
w2
w3
w4
w5
w6
w7
w0
Inc/deduct
weight-in
weight-out
o7 o6 o5 o4 o3 o2 o1 o0
Weight Increment Register
Figure 6.7: The weight arithmetic circuit of the weight control logic.
76
weight6
weight7
weight0
v6
v7
v1
v0
d6
d7
d1
d0
tag6
tag7
tag1
tag0
data6
data7
data1
data0
...
...
...
...
...
replacementcircuit
MUX
19
MUX
256
data input
castout
...
w0w1w6w7v0v1v6v7
� � �
victim lineinitial weight
1 0
3:8 DE
CO
DE
R
Castout Buffer
10
valid
Initial Weight Register
weight1
Weight RAM
1024X8
(8)
(8)
10
Weight RAM
1024X8
(8)
(8)
10
Weight RAM
1024X8
(8)
(8)
10
Weight RAM
1024X8
(8)
(8)
10
V bit RAM
1024X1
(1)
(1)
10
V bit RAM
1024X1
(1)
(1)
10
V bit RAM
1024X1
(1)
(1)
10
V bit RAM
1024X1
(1)
(1)
10
tag set index offset
D bit RAM
1024X1
(1)
(1)
10
D bit RAM
1024X1
(1)
(1)
10
D bit RAM
1024X1
(1)
(1)
10
D bit RAM
1024X1
(1)
(1)
10
Tag RAM
1024X19
(19)
(19)
10
Tag RAM
1024X19
(19)
(19)
10
Tag RAM
1024X19
(19)
(19)
10
Tag RAM
1024X19
(19)
(19)
10
Data RAM
1024X256
(256)
(256)
10
Data RAM
1024X256
(256)
(256)
10
Data RAM
1024X256
(256)
(256)
10
Data RAM
1024X256
(256)
(256)
10
Address Latchlookup/linefill
3
8
8
8
8
8
8
8
8 19
19
19
19
19
19
19
19
256
256
256
256
256
256
256
256
256
3 3
3
256
8
Q
QSET
CLR
D
Q
QSET
CLR
D
Q
QSET
CLR
D
Fig
ure
6.8
:T
he
line-fill/cast-o
utlo
gic
ofth
eW
LRU
CP
Ucach
e.
77
v0
v1
v2
v3
v4
v5
v6
v7
1 2
3 4
5 6
7 8
9 10
COMPARATOR
w0
w1
w2
w3
w4
w5
w6
w7
8
8
8
8
8
8
8
8
11 12
13 14
1
15 16 17
c0 c1 c2
2 3
o0 o1 o2 valid
replacement circuit
Figure 6.9: The replacement logic of the WLRU CPU cache.
78
Chapter 7
WLRU CPU Cache Simulator
Simulation is a commonly used technique for analyzing the performance of computer sys-
tems. A simulation using a memory trace as its input is calleda trace based simulation
[UM97]. A memory trace is the sequence of addresses of memoryreferences made during
the execution of a program. Trace-based simulations are a standard technique for studying
CPU cache behavior since the 1960’s [UM97]. There are a number of CPU cache simu-
lators available including Dinero IV1, Cheetah [Sug93], Tycho [Hil87] and Fast-Cache2.
A detailed survey on trace based simulation of CPU caches canbe found in [UM97]. This
work did not use an existing CPU cache simulator for the following reasons. Existing CPU
cache simulators focus on cache designs (e.g., the size, theline size, and associativity of
caches, and multiple levels of caches) with relatively little support for evaluating cache re-
placement algorithms. As a result much of the analysis needed for comparing CPU cache
algorithms (e.g., victim analysis) is not included. This chapter describes the cache simula-
tor designed and implemented for this work that addresses the weaknesses of existing CPU
cache simulators.
Trace based simulations are used in this work to compare the WLRU and LRU replace-
ments. An Object-Oriented CPU cache simulator, written in Java, is developed that is capa-
ble of testing multiple cache configurations simultaneously. It generates the miss rates or hit
rates does victim analysis and provides statistics. The cache simulator can be configured to
mimic multi-threading environments, especially simultaneous multi-threading (SMT) [MCFT99].
SMT is a multi-threading technology used in multi-issue CPUs. Multi-issue CPUs can ex-
1http://www.cs.wisc.edu/ markhill/DineroIV2http://www.cs.duke.edu/ alvy/fast-cache/
79
ecute multiple instructions every cycle. In SMT, the CPU issues in a single cycle multiple
instructions from multiple threads simultaneously.
7.1 Memory Trace Based CPU Cache Simulations
Trace based CPU cache studies consist of the trace generation, trace reduction and trace pro-
cessing stages [UM97]. In trace generation, the addresses of memory references and other
information are dumped to files that will be referred to as memory trace files. Trace gen-
erations is done by hardware or software simulations. The traces in the BYU trace archive
[FNAG92] are generated using hardware. The challenge of thetrace generation is to find a
representative workload. To study CPU caches, SPEC benchmarks are frequently used as
a standard workload. In this work, web servers are used as workloads for network memory
traces.
Memory traces are very large in size and need to be trimmed of extra information. This
is called trace reduction. Good reductions may shrink the trace size by a factor of 10 to
100 [UM97]. In trace processing, a CPU cache simulator is able to simulate one or more
cache configurations. Addresses are read from the trace filesand are provided as input. The
simulator generates results including miss rates, misses per instruction (MPI), and cycles
per instruction (CPI) of the cache configurations.
Trace based CPU cache simulations are time consuming. Thereare many factors to consider
in the CPU cache design including the cache size, the cache associativity, cache line size, the
replacement algorithm, the fetch policy, the write policy,split or uniform caches and multi-
level caches. The number of combinations of these factors may exceed several million. To
find the best configuration for a workload may require hundreds of thousands of simula-
tions. Each simulation may take tens of minutes to run since memory traces usually have
millions of references. Depending on the size of available main memory, several simula-
tions can be run in parallel. In parallel simulations, several CPU caches are constructed and
independently provided with the same inputs and the miss rates and hit rates of the caches
are gathered separately. The time reading the same memory trace is saved. Depending on
the available memory, the simulator of this work can run up toeight simulations in parallel.
Multi-configuration simulation algorithms were developedto effectively compare a large
number of cache configurations in one pass. Multi-configuration simulation is different from
parallel simulations. In multi-configuration simulation,a single run of simulation can gen-
80
erate miss rate results for a set of cache configurations.Stack processingis the first and the
most important multi-configuration simulation methods [MGS+70]. Stack processing and
its variants can achieve less than 5% the overhead of single configuration simulations, but
stack processing primarily assumes LRU replacement. For replacements other than LRU,
multi-configuration simulations are not applicable. This work does not use multi-configuration
simulations.
7.2 Architecture of CPU Cache Simulator
This section describes the architecture of the CPU cache simulator developed in this work.
The architecture is graphically depicted in figure 7.1.
SimuEngine
DM Cache
...
Trace Files
SynthesizingPolicy
Set Associative Cache
Two LevelCache
Multi-levelCache
Simulation Results
CacheDevice Objects
Figure 7.1: The architecture of the CPU cache simulator.
7.2.1 SimuEngine Object and Trace Synthesizing
The simulator starts with the SimuEngine object opening a trace file. The SimuEngine then
carries out the trace synthesizing operation. A unique feature of the simulator is its trace
synthesizing ability. For most simulation tasks, there is only one trace file. However, since
multi-threading, especially Simultaneous Multi-threading [MCFT99], is frequently used in
modern CPUs. There is a need for a simulator that can support multi-threading. The SimuEngine
81
object can mimic a multi-threading reference stream by merging multiple single thread trace
files.
Multi-threading memory trace is accurate but very difficultto generate. A compromise ap-
proach is trace synthesizing. To mimic a multi-threading memory trace, multiple single
thread memory traces are merged into a single memory reference stream. A synthesized
memory trace introduces the cache flush of threads, but it cannot reflect the unseen inter-
dependencies among threads. A major shortcoming of synthesized memory trace is the ab-
sence of the operating system.
To synthesize multi-threading traces, the SimuEngine usesa WorkLoadobject to provide
the synthesizing policy guiding the synthesizing of multi-thread traces. The simplest syn-
thesizing policy is to read in a round-robin manner from a setof single thread memory trace
files. The SimuEngine supports complex reading policies that can support context switch
effects. For example, trace files are arranged into groups. These groups represent a set of
simultaneously running threads. The SimuEngine reads froma group of trace files a spe-
cific number of references and then switches to another groupof trace files. Inside a group,
the files are read in a round-robin manner. Figure 7.2 presents an example of a trace syn-
thesizing arrangement . Figure 7.2 mimics a SMT CPU issuing two instructions from two
threads in a cycle. Traces A and B form a group, and traces C andD form another group. A
specific number of references are read from traces A and B. Context switching then occurs,
which results in reading a specific number of references fromtraces C and D.
An …. A3 A2 A1
Bn …. B3 B2 B1
Cn …. C3 C2 C1
Dn …. D3 D2 D1
Dn Cn … D3 C3 D2 C2 D1 C1 B3 A3 B2 A2 B1 A1
Figure 7.2: An example trace synthesizing scenario which includes context switching ef-fects.
A memory reference is encapsulated as anAddrobject inside the SimuEngine. Addr objects
are sent to the CacheDevice objects with an integer calledseqTime, which represents the
time of the reference. This begins with zero and increases byone every time the SimuEngine
reads a memory reference record from a trace file. The integerseqTime represents the time
cycle of the CPU. However, since the exact execution of the CPU is unpredictable due to
branching, data dependencies, and cache misses. Thus, seqTime should not to be interpreted
as the CPU execution time or main memory visiting time.
82
7.2.2 CacheDevice Interface
CacheDeviceis the interface of all CPU cache objects. CacheDevice is theabstraction of
all CPU caches including direct-mapped and set associativecaches, and single level and
multi-level caches. The interface consists of thebeginSimulation()method, thecyclePing()
method, thecontextSwitching(), and theendSimulation()method. When the SimuEngine
object opens the trace files and begins reading the first memory reference record, it calls the
beginSimulation() method of each CacheDevice object.
The most frequently called method of the CacheDevice interface is thecyclePing(Addr a,
int seqTime). Each memory reference record read from a trace file is converted to an Addr
object. The SimuEngine then calls the cyclePing() method ofeach CacheDevice object. The
cyclePing() method represents a cache lookup operation to the CPU cache, during which the
CacheDevice object records whether there is a cache hit or miss and updates the content of
the cache if a replacement is needed.
When the SimuEngine object is configured to mimic multi-threading or multi-programming
environments, thecontextSwitch()method is called when the SimuEngine switches to a new
thread. The default action of a CacheDevice in the contextSwitching() method is to report
the hit rate or miss rates of the thread/process being scheduled out.
At the end of the simulation, SimuEngine calls theendSimulation()method of every CacheDe-
vice object. This signals the caches to report the simulation results such as hit or miss rates
and process the bookkeeping records. Miss rates and bookkeeping information of each cache
is written to a text file. The bookkeeping information includes the stay time, idle time, and
hit counts of every addresses. This will be used in the victimanalysis (see section 7.6).
For this work, direct-mapped cache and set associative caches, one level and multi-level
CPU caches were implemented . Figure 7.3 is the UML graph of CacheDevice interface.
A two-level CPU cache object is implemented as two cache objects. The Addr objects need
to be dispatched from the L1 to the L2 cache. The L2 cache is checked only when the L1
cache generates a miss.
7.3 Cache Sets and Replacement Objects
Since candidates for a replacement is limited to a single cache set, the replacement logic is
implemented in the cache set object. The abstract base classof all cache sets in set associa-
83
Figure 7.3: The UML graph ofCacheDeviceinterface.
tive caches is theCacheSetclass. TheSetCacheclass implements the CacheDevice inter-
face. It represents a set associative cache. The interaction with the SimuEngine is done in a
SetCache object. A CacheSet object is only concerned with the replacement of cache lines.
A SetCache object constructs all cache sets, initializes the bookkeeping in thebeginSimula-
tion() method, dispatches the Addr object to the appropriate cacheset based on its address
in the cyclePing()method, and gathers the miss and hit results or other statistics such as
victim analysis (see section 7.6) from each cache set in theendSimulation()method. Fig-
ure 7.4 is a flow chart of thecyclePing()method ofSetCacheclass. Thereplace()method
of theCacheSetclass is called when thereferenced()method returns false. This causes the
corresponding replacement logic of the CacheSet object to be evoked.
The CacheSetclass is the base class of all cache sets each of which may use adifferent
replacement algorithm. A CacheSet object consists of an array of CacheLineobjects with
each object encapsulating a CPU cache line. In trace based cache simulations, the data of a
CPU cache line is not of interest. Thus a CacheLine object stores only the address tag of a
cache line and the bookkeeping information of a cache line.
The CacheSet interface has two methods related to the replacement: referenced()andre-
place(). The referenced() method returns aBooleanvalue indicating the cache hit or miss
of a memory reference. The replace() method is called only when the referenced() method
returnsfalseindicating a cache miss. The replace() method returns aVictim object that is
84
START
END
Mapping to a single cache set with the index part of the
address
Call the referenced() method of the cache set with Addr and
seqTime arguments
Referenced() return false?
Cache hit, update the weights of all cache lines
of the cache set.
N
Cache miss, update the weights of all cache lines of
the cache set.
Y
Call the replace() method of the cache set
Figure 7.4: A flow chart of thecyclePing()method of classSetCache.
null if the cache set is not yet full. The Victim object returned by the replace() method con-
sists of the address tag of the cache line to be replaced and a copy of the bookkeeping in-
formation of the cache line. Victim objects are gathered andput into a disk file for further
analysis. The separation of the referenced() and replace()methods is to supportcache by-
passing[Smi82], where some cache misses do not cause replacements.In cache bypassing,
some main memory addresses are explicitly marked uncache-able. These addresses will not
replace existing cache contents.
Subclasses of theCacheSetclass implement different replacement algorithms. Three LRU
replacements, and a general WLRU replacement with configurable weight formula are im-
plemented. Figure 7.5 is the UML graph of CacheSet base class.
7.4 WLRU Replacements
The WLRUsubclass ofCacheSetimplements the WLRU replacement. The referenced()
method of the WLRU class calculates the weights of all cache lines of the set are calculated.
85
The hit line has its weight increased, and other lines have their weight deducted by one un-
less the weight is zero. Figure 7.6 is the flow chart of the referenced() and the replaced()
methods of the WLRU class.
Each WLRU cache set is given aWeightFunctionobject, which controls how weights are
calculated. The increment of weight, the upper limit of weights, and the initial weight are
all in the weight function object.
7.5 Optimal Replacement Algorithm
To meaningfully compare the CPU cache replacement algorithms, the hit rates or miss rates
of the off-line optimal replacement are used for comparison. The optimal replacement al-
gorithms(OPT) requires knowledge of future references. The miss rate of the optimal re-
placement algorithm is important in comparing cache replacements. It shows potential im-
provement. In Chapter 8, the miss rates of WLRU is compared with the miss rates of LRU.
The miss rates of the optimal replacement algorithm are provided to show the improvement
WLRU has made.
An implementation of the optimal replacement algorithm requires reading ahead every ref-
erence. It takes many days to finish a memory trace of ten million references. In this work,
two methods are used to reduce the simulation time of the optimal replacement algorithm.
First, the memory trace is split into smaller trace files by cache sets before the simulation.
Doing so the length of each looking forward is significantly reduced. The references map-
ping to the same cache set are stored in a single trace file. Simulations are done on one cache
set at a time. The miss rates of all cache sets are averaged as the final miss rate. Second, a
java databasejdbm is used to store the entire reference history of every address. Instead of
looking ahead when making replacement decisions, the database is checked for the time of
the next reference of the cache line. This significantly reduces the length of each disk ac-
cess. The use of a database is made possible because of the splitting of the trace into cache
sets, which reduces the trace length by a factor of more than 1000. Otherwise, no simple
database can easily hold the whole trace file or files. With these two optimizations, the op-
timal replacement simulation which takes days can be finished in half an hour.
86
7.6 Victim Analysis
Victims in CPU cache replacements refer to cache lines beingreplaced. The CPU simulator
developed does statistical analysis on victims of a replacement. These statistics are helpful
in evaluating the accuracy of replacements and may also provide hints for improving the
performance of a replacement algorithm.
In the CPU cache simulator, bookkeeping information of a cache line, such as the hit count,
stay time, idle time, and the weight of each cache line when evicted, are recorded. This in-
formation is copied to aVictim object, when the cache line is evicted from the cache. For
LRU replacement, which does not use weight, the weight information is empty. The statis-
tic analysis of the victim objects is called theVictim Analysis. Victim analysis currently in-
cludes the study of the distribution of idle time, stay time,and hit counts of all cache lines.
Victim analysis is a unique feature of the simulator of this work. It is also one of the ma-
jor motivations of developing a CPU cache simulator from scratch. In section 8.5, we will
present victim analysis results on SPEC benchmark traces and network traces.
7.7 Validation of Simulator
The simulator in this work consists of three cache replacement algorithms: LRU, WLRU,
and OPT(the optimal replacement algorithm). The simulatoris validated in several ways.
The overall architecture of the simulator and the implementation of the LRU replacement
algorithm is ensured by checking the hit rate against other cache simulators. The reading and
dispatching mechanism of memory trace records and the LRU cache simulation is compared
with a cache simulator written in C on Unix3. The two simulators generate exactly the same
results for the BYU traces.
For OPT, two versions of the optimal replacement algorithm were implemented in this work.
One version uses the traditional read-ahead approach, and the other version uses a database.
The results of the two implementations are the same. The firstversion of the OPT replace-
ment algorithm reads ahead in the trace file to decide replacement, but reading ahead is very
slow (see section 7.5). The database version of OPT is much faster and is used in the sim-
ulation. The only purpose of the look-ahead version of OPT isto check the correctness of
the database version of OPT. The two versions of OPT do not share any common code.
3http://traces.byu.edu/new/Tools/b2asrc/byucache.c
87
Since the WLRU replacement algorithm is new, there is no other simulator to check against.
A set of test cases of WLRU is applied to the WLRU simulator to ensure its correctness. The
test cases are manually generated and checked. Besides testcases, there are also a large
number of assertions, such as assertions on the calculationof the weight of a cache line, in
the code of WLRU to ensure that the WLRU cache always has the correct status.
7.8 Summary
Trace based simulation is the main tool to study CPU caches. The main challenge is to gen-
erate accurate memory traces. The simulator developed in this works is written in Java and
can be very easily expanded. The simulator contains a set of unique features. A fast imple-
mentation of the optimal replacement algorithm and the victim analysis are two important
inventions of the simulator. The victim analysis is very useful in studying the detail behav-
ior of CPU cache replacement algorithms.
88
Figure 7.5: The UML graph ofCacheSetclass, which is the base class of all replacements.
89
START START
Compare the address tags of cache lines with the incoming address tag
Choose the cache line with the
minimal weight.
Increase the weight of the hit cache line and deduce the weights of other lines
by one.
Load the first empty line with
new address tag.
Found?
Empty lines?
Replace the cache line with new address tag.
Assign initial weight to the new
cache line.
END
Return true.
Deduce the weights of all lines by one.
Return false.
END
Y
Y
N
N
A B
Figure 7.6: The flow chart of thereferenced()andreplace()method ofWLRUclass.
90
Chapter 8
Simulation Results
This chapter presents the results of the simulations.
8.1 Experimental Design
The first goal of the experiments is to compare the WLRU, LRU and OPT (optimal) replace-
ment algorithms. The metric used for the comparison is the hit rate. The second goal of the
experiments is to gain an understanding of the behavior of each replacement algorithm by
victim analysis.
The factors of the experiments are divided into three groups: WLRU weight formula pa-
rameters, the workloads and the CPU cache configurations. The workloads include web
servers, SPEC benchmarks and the combination of the two by multi-threading. The full
combination of all these factors results in a large number ofexperiments that may require
efforts that can be better spent somewhere else. However, WLRU is a new cache replace-
ment algorithm whose design space has not been investigated. If only a small fraction of the
combinations are tested, important discoveries may be missed. Due to these considerations,
the following approach is used in the experimental design ofthis work. For WLRU weight
formula parameters and the web server workloads, a full combination of factors was tested.
The advantage of this experimental design is that it provides a nearly thorough examination
of WLRU and the number of experiments is manageable.
The factors associated with CPU cache configurations include cache sizes, the associativity
of the cache, the cache line size, the cache levels and the size and associativity of the levels
91
(a complete list can be found in chapter 2). The number of combinations is high. However,
constraints such as circuit costs limit the choice of CPU cache configurations. Thus only a
small number of the most commonly seen CPU cache configurations are tested.
Since current CPUs usually use two level CPU caches, the CPU cache configurations used
in the experiments are two-level caches. Due to the overwhelming locality exhibited in the
L1 memory reference, LRU is nearly optimal for the L1 cache ifthe L1 cache size is small
and has low associativity (see chapter 3). WLRU is used only in the L2 cache. The L1 cache
uses LRU. Since the number pins of a CPU is a primary constraint [FH05], larger cache lines
require more pins and impose hard challenges to CPU designs.Almost all current CPUs use
32 byte cache line1. All experiments assume that 32 byte cache lines are used.
Three sizes of the L1 cache are tested: 8 KB, 16 KB and 32 KB. Since the L1 cache needs
to be visited in one or two cycles, current CPUs seldom have larger than 64 KB L1 caches.
For higher frequencies, current CPUs tend to have L1 caches of smaller sizes and lower
associativities, such as two or four way. For example, the Intel Pentium 4 Prescott CPU has
16 KB L1 data cache2, and the Sun Sparc T1 has a 16 KB L1 instruction cache and an 8 KB
L1 data cache3. The L1 cache in the experiments is a two-way set associativecache. For
most workloads, an arbitrarily chosen number of four-way set associative L1 caches were
also experimented with. The results of these experiments show no differences between the
hit rates of the L2 cache when comparing two-way and four-wayset associative L1 caches.
These results can be found in appendix A.
Since the L2 cache has longer delay, typically 10 CPU cycles,it can support higher associa-
tivities. 16-way associative caches are frequently used incurrent L2 caches, and higher than
16-way associative caches are seldom seen. The L2 cache in the experiments is a 16-way
set associative cache. Two sizes of the L2 cache, 128 KB and 256 KB, are tested. Larger
L2 cache sizes are not tested due to the limit of the length of the memory traces. Large L2
caches, such as a 1 or 2 MB, requires long memory traces of 100 million or 1 billion refer-
ences. Otherwise, a large L2 caches may have empty entries. Memory traces of 1 billion
references will be more than 20 G byte long, even in compact format. The disk storage for
all 32 web server memory traces exceeds 100 G bytes. Simulations also need storage for
data such as index files and trace splits (see section 7.5) andspace for the databases of the
optimal replacement algorithm. A disk array of 1000 G bytes is required. The simulation
1http://www.geek.com/procspec/2http://www.geek.com/procspec/intel/prescott.htm3http://www.sun.com/processors/UltraSPARC-T1/specs.xml
92
time also increases exponentially. At this stage, only 128 KB and 256 KB L2 cache sizes
are tested. In fact, L2 caches of size 128 KB and 256 KB are commonly seen configurations
for current low end CPUs. For example, the Intel Pentium 4 Willamette CPUs have a L2
cache of 256 KB. It is just recently CPUs have larger than 512 KB L2 caches. The total
number of CPU cache configurations in the experiments is3 ∗ 2 = 6.
WLRU weight formulas have three configurable parameters. These are the initial weight,
the increment of weights and the upper limit of weights. In the experiments, these param-
eters are set from 1 to 1024, increasing exponentially. Thuseach parameter of WLRU for-
mula has 11 settings. For an upper limit of weights of valuen, wheren = 2i, the number of
combinations of the increment of weights and the initial weight is(i + 1)2. The total num-
ber of WLRU weight formulas tested for each pair of workload and cache configuration is
Σ10i=1(i + 1)2, which is506.
The number of web server traces generated and tested is 32 (see chapter 4). The web server
traces use static web pages to eliminate the impact of disk I/O. Two kinds of web servers,
Apacheandthttpd, are tested. The total number of simulation experiments on web server
traces is32 ∗ 6 ∗ 506 = 97152.
For SPEC benchmarks, due to limitation of time, only a small number of WLRU weight for-
mulas are tested. In future work, all WLRU weight formulas will be tested against SPEC
benchmark traces. These weight formulas are chosen since these formulas result in good
improvement in hit rates for web servers. Some other arbitrarily chosen weight formulas
are also tested to see the trend of hit rate changing for different settings of WLRU weight
formula parameters. For each SPEC benchmark trace, about 10to 20 weight formulas are
tested. The cache configurations are the same as those used for web server memory traces.
WLRU is designed to exploit the unique locality feature of web server memory references.
The goal of the experiments using SPEC benchmark is to show that the speed up of WLRU
on web servers does not harm traditional applications as represented by the SPEC bench-
marks. If WLRU does not show noticeably lower hit rates than LRU for a large number of
SPEC benchmarks, WLRU can be said to be a success in improvingthe CPU cache perfor-
mance of web servers.
It is possible, in a multi-threaded environment, both networked applications and traditional
applications are running. Experiments are designed to testthe performance of WLRU on
multi-threading computing platforms by synthesizing the SPEC benchmark traces and the
web server traces ( see Section 7.2.1). Each SPEC benchmark trace is synthesized with the
same web server memory trace. The synthesized multi-threading traces are simulated on
93
a CPU cache configuration of a two-way 32 KB L1 cache and a 16-way 128KB L2 cache.
This CPU cache configuration is the same as one of those used inweb server and SPEC
benchmark experiments. The weight formulas that were tested on SPEC benchmarks were
tested in the multi-threading simulations. The hit rates ofWLRU, LRU and OPT replace-
ment algorithms on multi-threading experiments are compared.
For every pair of the combinations of the CPU cache configuration and the workloads, both
web servers and SPEC benchmarks, the hit rate of the optimal replacement algorithm is gen-
erated. The number of optimal experiments is6 ∗ (32 + 26) = 348. The hit rates of OPT
can be found in appendix A. The number of experiments on SPEC benchmarks and multi-
threaded workloads, including WLRU, LRU and OPT, is more than one and a half thousand.
8.2 WLRU on Web Server Memory Traces
The experiments do not show that a single weight formula always has the best hit rate for
all cache configurations and web server memory traces. However, the experiments show
that the same small number, 8 to 12, of weight formulas are always among the best for all
combinations of cache configurations and web server traces.Out of them, weight formula
i64r256b4has the most consistent performance for all scenarios. Weight formulai64r256b4
has the best hit rates for nearly half of the combinations of cache configurations and web
server memory traces. For the remaining half, the differences betweeni64r256b4and the
weight formula with the best hit rate is within 1%. Weight formulai64r256b4is chosen for
presentation. The hit rates of other weight formulas can be found in appendix A. The hit rate
of WLRU i64r256b4is compared with hit rates of LRU and OPT replacement algorithms.
Figures 8.1 to 8.4 compare the miss rates for the WLRUi64r256b4and LRU replacement
algorithms on the web server memory traces. The CPU cache configurations in the figures
are two level caches with a two-way 32 KB L1 cache and 16-way 128 KB and 256 KB L2
caches.
The reason that WLRUi64r256b4shows significant improvement over LRU for web servers
can be found in the IRG distributions of web server memory traces (see chapter 4). It is seen
in table 4.6 that web server memory traces have a smaller percentage of small IRG values,
less than 16, and a higher percentage of large IRG values, larger than 256, than SPEC bench-
marks. There is also a higher percentage of IRG values of medium sizes, 32 to 256, for web
server memory traces. The medium sized IRG gaps are likely toresult in cache misses using
94
50 90 120 150
0.00%
0.25%
0.50%
0.75%
1.00%
1.25%
1.50%
1.75%
2.00%
2.25%
2.50%
2.75%
Apache, 20K, L2 128KB
LRU WLRU OPT
request rate
mis
s ra
te
50 90 120 150
0.00%
0.20%
0.40%
0.60%
0.80%
1.00%
1.20%
1.40%
1.60%
Apache, 20K, L2 256KB
LRU WLRU OPT
request rate
mis
s ra
te
50 90 120 150
0.00%
0.25%
0.50%
0.75%
1.00%
1.25%
1.50%
1.75%
2.00%
2.25%
2.50%
2.75%
Apache, 200K, L2 128KB
LRU WLRU OPT
request rate
mis
s ra
te
50 90 120 150
0.00%
0.25%
0.50%
0.75%
1.00%
1.25%
1.50%
1.75%
2.00%
2.25%
Apache, 200K, L2 256KB
LRU WLRU OPT
request rate
mis
s ra
te
Figure 8.1: Comparison of miss rates of WLRU, LRU and Optimalon Apache traces.
LRU. Under WLRUi64r256b4, the small initial weight evicts low cache value addresses
faster than LRU. The result is that these medium sized IRG gaps are more likely to become
cache hits.
WLRU shows better improvements over LRU when the web page size is small. The dif-
ferences in the miss rates of WLRU and LRU are larger when the L2 cache size is 128KB
instead of 256KB. The figures show that WLRU has a higher improvement in hit rates for
thethttpdweb server than for theApacheweb server. The experiments show that WLRU has
more improvements for thethttpdweb server than for theApacheweb server when the web
page sizes are small. There are nearly 50% fewer L2 cache misses. For large web pages, the
improvement of WLRU forthttpd is 5%. The improvements of WLRU onApacheserver
are less sensitive to web page sizes. The different behaviors of thttpdandApachesevers on
95
50 90 120 150
0.00%
0.25%
0.50%
0.75%
1.00%
1.25%
1.50%
1.75%
2.00%
2.25%
2.50%
2.75%
Apache, mixed1-4, L2 128KB
LRU WLRU OPT
request rate
mis
s ra
te
50 90 120 150
0.00%
0.25%
0.50%
0.75%
1.00%
1.25%
1.50%
1.75%
2.00%
2.25%
Apache, mixed1-4, L2 256KB
LRU WLRU OPT
request rate
mis
s ra
te
50 90 120 150
0.00%
0.25%0.50%
0.75%
1.00%1.25%
1.50%
1.75%2.00%
2.25%2.50%
2.75%
3.00%
Apache, mixed1-9, L2 128KB
LRU WLRU OPT
request rate
mis
s ra
te
50 90 120 150
0.00%
0.25%
0.50%
0.75%
1.00%
1.25%
1.50%
1.75%
2.00%
2.25%
Apache, mixed1-9, L2 256KB
LRU WLRU OPT
request rate
mis
s ra
te
Figure 8.2: Comparison of miss rates of WLRU, LRU and Optimalon Apache traces withmixed web page sizes.
WLRU has some causes in their different locality features asindicated by the distributions
of reference counts and IRG values ( see chapter 4). The two ways of handling concurrent
connections of web servers have impact on the temporal locality of the web servers. The
results of this works show that the two kinds of web servers,thttpdandApache, calls for
different management strategies.
For theApacheweb server, WLRU always has more than 15% fewer L2 cache misses than
LRU except for tracea200kr50. Tracea200kr50has good hit rates for both WLRU and
LRU. The hit rates are almost optimal. This is because tracea200kr50has very good tem-
poral locality. This is reflected by the average reference count of tracea200kr50, which is
more than 200. Other web server memory traces have average reference counts of no more
than 60.
96
50 90 120 150
0.00%
0.25%
0.50%
0.75%
1.00%
1.25%
1.50%
1.75%
2.00%
2.25%
2.50%
2.75%
thttpd, 20K, L2 128KB
LRU WLRU OPT
request rate
mis
s ra
te
50 90 120 150
0.00%
0.10%
0.20%
0.30%
0.40%
0.50%
0.60%
0.70%
0.80%
0.90%
1.00%
1.10%
thttpd, 20K, L2 256KB
LRU WLRU OPT
request rate
mis
s ra
te
50 90 120 150
0.00%
0.25%
0.50%
0.75%
1.00%
1.25%
1.50%
1.75%
2.00%
2.25%
2.50%
thttpd, 200K, L2 128KB
LRU WLRU OPT
request rate
mis
s ra
te
50 90 120 150
0.00%
0.25%
0.50%
0.75%
1.00%
1.25%
1.50%
1.75%
2.00%
2.25%
thttpd, 200K, L2 256KB
LRU WLRU OPT
request rate
mis
s ra
te
Figure 8.3: Comparison of miss rates of WLRU, LRU and Optimalon thttpd traces.
8.3 WLRU on SPEC CPU2000 Benchmarks
WLRU weight formulai64r256b4, which shows significant improvement on web servers,
is tested for all SPEC CPU2000 benchmarks. Figure 8.5 is the comparison of WLRU and
LRU miss rates of SPEC integer programs. Figure 8.5 is the comparison of WLRU and LRU
miss rates of SPEC floating point programs. The simulation results in figures 8.5 and 8.5
are on a two-level cache configuration. WLRU replacement is used only in the L2 cache.
For eight SPEC benchmarks, WLRU has higher hit rates than LRU. For three SPEC bench-
marks,ammp, appluandart, WLRU and LRU are almost exactly the same. For the remain-
ing 15 SPEC benchmarks, LRU has slightly better hit rates than WLRU. For SPEC INT
benchmarks, LRU has slightly better hit rates than WLRU, butboth LRU and WLRU have
near optimal hit rates. WLRU is slightly better than LRU for most of the SPEC FLT bench-
97
50 90 120 150
0.00%
0.25%
0.50%
0.75%
1.00%
1.25%
1.50%
1.75%
2.00%
2.25%
thttpd, mixed1-4, L2 128KB
LRU WLRU OPT
request rate
mis
s ra
te
50 90 120 150
0.00%
0.20%
0.40%
0.60%
0.80%
1.00%
1.20%
1.40%
1.60%
thttpd, mixed1-4, L2 256KB
LRU WLRU OPT
request rate
mis
s ra
te
50 90 120 150
0.00%
0.25%
0.50%
0.75%
1.00%
1.25%
1.50%
1.75%
2.00%
2.25%
thttpd, mixed1-9, L2 128KB
LRU WLRU OPT
request rate
mis
s ra
te
50 90 120 150
0.00%
0.20%
0.40%
0.60%
0.80%
1.00%
1.20%
1.40%
1.60%
thttpd, mixed1-9, L2 256KB
LRU WLRU OPT
request rate
mis
s ra
te
Figure 8.4: Comparison of miss rates of WLRU, LRU and Optimalon thttpd traces withmixed web page sizes.
marks. It is believed that the SPEC FLT benchmarks have worselocality than the SPEC
INT benchmarks [CH01]. This is reflected in the hit rates. Themiss rates of WLRU and
LRU for most of the SPEC INT benchmarks are well below 1%.
The reason that WLRU with weight formulai64r256b4has slightly worse hit rates than
LRU for some SPEC benchmarks is that addresses with high reference counts represent a
larger portion of the memory references in these SPEC benchmarks. The initial weight of
weight formulai64r256b4is set too low, which results in evicting new cache contents too
early. If provided with larger initial weights, for exampleb = 32, WLRU has higher hit
rates. An initial weight of 32 keeps new cache contents a little longer but still for a much
shorter than LRU.
98
bzip crafty eon gap gcc gzip mcf parser
perl twolf vor-tex
vpr0.00%0.50%1.00%1.50%2.00%2.50%3.00%3.50%4.00%4.50%5.00%5.50%6.00%6.50%7.00%7.50%8.00%
LRU WLRU OPT
Figure 8.5: Comparison of miss rates of OPT, LRU and WLRU on SPEC integer bench-marks.
Figures 8.7 and 8.8 compare the miss rates for SPEC CPU2000 benchmarks when using
WLRU i64r256b4and WLRUi64r256b32with LRU and OPT. The figures show that WLRU
i64r256b32noticeably improved the hit rates of WLRU for SPEC benchmarks when com-
pared with WLRUi64r256b4. Four more SPEC benchmarks,applu, art, mcfandperl, have
WLRU outperforming LRU. For two SPEC benchmarks,eonandammp, the hit rates of
WLRU and LRU are almost exactly the same. Withi64r256b32, WLRU outperforms LRU
in more than half of the SPEC CPU2000 benchmarks. For web servers, WLRUi64r256b32
is worse thani64r256b4( see appendix A).
The SPEC benchmarkswimhas a noticeable difference in hit rates obtained using WLRU
and LRU. This is the result ofswimhaving a large number of addresses referenced only
twice. The IRG values of these addresses are small. For example, in cacheset0 of swim,
the majority, 164 out of 211, of the IRG values of the addresses referenced twice are less
than 32, with the remaining 44 IRG values less than 64. These IRG values are smaller than
the IRG values of addresses with low reference counts in web server traces and other SPEC
benchmarks (see tables 5.1 and 5.2). The SPEC benchmarkswimhas a unique memory
reference pattern, and WLRUi64r256b4keeps these addresses longer than LRU would.
99
amm
pap
plu apsi ar
teq
ua
kefa
c-
erec fm
a3
dga
l
gel luc
asm
esa
mgr
i
dsix
-
track sw
imwup
wise
0.00%
0.25%
0.50%
0.75%
1.00%
1.25%
1.50%
1.75%
2.00%
2.25%
2.50%
2.75%
3.00%
3.25%
LRU WLRU OPT
Figure 8.6: Comparison of miss rates of OPT, LRU and WLRU on SPEC floating pointbenchmarks.
Table 8.3 shows the IRG values of some of the addresses of the SPEC benchmarkswimthat
map to cacheset0. There are only five address tags inset0 of swim that have reference
counts larger than 20. One address tag has a reference count as high as 1618.
8.4 WLRU Performance on Multi-threaded Workloads
WLRU shows significant improvements over LRU for web servers. For SPEC benchmarks,
the difference between WLRU and LRU is small, within 5%. Ideally WLRU would replace
LRU in CPUs used in machines where there is a mix of applications running. This section
describes results for experiments that had a mix of SPEC benchmarks and web server mem-
ory traces.
In the simulations, one copy of SPEC benchmark trace and one copy web server trace are
synthesized to simulate a simultaneous multi-threading CPU. There are 26 SPEC bench-
marks and 32 web server traces. The number of combinations ofSPEC benchmarks and
web servers is26 ∗ 32 = 832. This work tested all SPEC benchmarks on a single web
10
0
address reference count 1 2 < 4 < 8 < 16 < 32 < 64 < 128 < 256 < 512351a00 1618 785 780 33 5 2 12 0 0 0 026a00 32 0 0 0 30 0 0 1 0 0 07e3400 32 2 7 6 2 2 2 0 1 0 9398200 27 1 7 5 2 1 0 0 1 1 854f900 20 0 1 3 2 2 0 1 1 0 9
......
2aa300 2 0 1 0 0 0 0 0 0 0 02aa500 2 0 1 0 0 0 0 0 0 0 02d6a00 2 0 0 0 0 0 1 0 0 0 02d6b00 2 0 0 0 0 0 1 0 0 0 031a300 2 0 0 0 0 0 1 0 0 0 031a400 2 0 0 0 0 0 1 0 0 0 0
Table 8.1: The IRG values of address tags with reference count of two in set0 of SPEC benchmarkswim.
101
bzip crafty
eon gap gcc gzip mcf parser
perl twolf vor-tex
vpr0.00%0.50%1.00%1.50%2.00%2.50%3.00%3.50%4.00%4.50%5.00%5.50%6.00%6.50%7.00%7.50%8.00%
Opt LRU WLRU WLRU b32
Figure 8.7: Comparison of miss rates of OPT, LRU and WLRU on SPEC integer bench-marks, where WLRU usingi64r256b32.
server tracet20kr90, sincet20kr90shows the best improvement of WLRU over LRU.
Figures 8.9 and 8.10 show the miss rates of LRU, WLRU and OPT replacement algorithms
on a two-level cache. The weight formula used by WLRU in the figures isi64r256b4, which
shows significant improvement for web server traces (see section 8.2). The web server mem-
ory trace used in the figures ist20kr90. The CPU cache has a two-way 32KB L1 cache
and a 16-way 128KB L2 cache. For most of the SPEC benchmarks, WLRU has more than
25% fewer L2 cache misses than LRU. The improvement of WLRU over LRU on multi-
threading simulations is consistent and not affected by thechoice of SPEC benchmarks.
8.5 Comparison of LRU and WLRU Using Victim Analy-
sis
The victim analysis applied to SPEC benchmark and web servermemory traces shows that
LRU and WLRU have different behaviors. As expected, victim analysis shows that WLRU
102
ammp
applu
apsi art equake
fac-erec
fma3d
galgel
lu-cas
mesa
mgrid
six-track
swim
wupwise
0.00%
0.25%
0.50%
0.75%
1.00%
1.25%
1.50%
1.75%
2.00%
2.25%
2.50%
2.75%
3.00%
3.25%
Opt LRU WLRU WLRU b32
Figure 8.8: Comparison of miss rates of OPT, LRU and WLRU on SPEC floating pointbenchmarks, where WLRU usingi64r256b32.
bzip crafty
eon gap gcc gzip mcf parser
perl twolf vor-tex
vpr0.00%0.50%1.00%1.50%2.00%2.50%3.00%3.50%4.00%4.50%5.00%5.50%6.00%
LRU WLRU OPT
SPEC INT
mis
s ra
te
Figure 8.9: The miss rates of LRU, WLRU and OPT replacements on multi-threading SPECINT benchmarks.
103
ammp
applu apsi art equake
fac-erec
fma3d galgel lucas mesa mgrid six-track
swim wupwise
0.00%
0.25%
0.50%
0.75%
1.00%
1.25%
1.50%
1.75%
2.00%
2.25%
2.50%
2.75%
3.00%
3.25%
LRU WLRU OPT
SPEC FLT
mis
s ra
te
Figure 8.10: The miss rates of LRU, WLRU and OPT replacementson multi-threadingSPEC FLT benchmarks.
evicts addresses that are not referenced again faster than LRU. Victim analysis shows that
WLRU is better than LRU in keeping high cache value addresses. The victim analysis is
helpful in the understanding of the decision making processof a cache replacement algo-
rithm. The victim analysis is useful in fine-tuning a replacement algorithm. The differences
between the locality qualities of SPEC benchmarks and web servers are also reflected in the
victim analysis.
Victim analysis is the empirical analysis of the hit counts,stay time and idle time of vic-
tims. The difference between LRU and WLRU is manifested in the distribution of victim
hit counts. Table 8.2 is the distribution of the hit counts ofvictims of LRU and WLRU re-
placement on web server tracet20kr50. The distributions of hit counts for other web servers
traces are provided in appendix A. LRU replacement has more cache misses than WLRU
and thus more victims. The table shows that LRU evicts a largenumber, more than 20%,
of victims which are referenced more than once, and a significant number of victims are
evicted even if these addresses are referenced more than a hundred times. In comparison,
98% of the victims under WLRU replacement are not hit at all, and there are almost no vic-
tims whose hit counts are more than a hundred. The distribution of hit counts of victims
shows that the cache flushing effect of LRU replacement is obvious. LRU is not good at
keeping high value cache contents.
Table 8.3 is the distribution of victim hit counts of SPEC benchmarkcrafty. The distribu-
tions of hit counts of other SPEC benchmarks can be found in appendix A. Although WLRU
104
keeps high reference count addresses longer, table 8.5 shows that the distributions of victim
hit counts of WLRU and LRU are quite similar. This is because SPEC benchmarks such as
craftyhas different temporal locality than web servers. The re-reference to the same address
happens in a short time. This is reflected in the L2 IRG distributions of SPEC benchmarks.
The L2 IRG gaps of SPEC benchmarks that are of medium sizes arefewer than the web
server traces. The IRG values are so small that both LRU and WLRU can result in cache
hits for these IRG values. There is no additional value for WLRU to keep a cache line pre-
viously hit longer than LRU. For some SPEC benchmarks, WLRU has slightly higher miss
rates. It is because the small initial weight of WLRU evicts new cache contents too early.
The idle time and cache stay time of victims are important indicators of the accuracy of a
replacement algorithm. The idle time is defined as the time span between the last hit or ref-
erence to a cache line and its eviction. The stay time is the time between a cache line being
loaded and evicted. Using the distributions of idle time andstay time of the optimal re-
placement as a target, better replacement algorithms should have smaller differences in idle
time and stay time with the optimal replacement. Figures 8.11 and 8.12 represent the dis-
tributions of idle time and stay time of LRU, WLRU and OPT on web server tracet20kr50.
Figures 8.13 and 8.14 are the distributions of idle time and stay time of LRU, WLRU and
OPT replacement algorithms on web server tracet20kr90. The CPU cache configurations
in the figures are a two level cache with a two-way 32KB L1 cacheand 16-way 128KB L2
cache. The weight formula of the WLRU replacement in the figures isi64r256b4. The idle
time and stay time statistics for other cache configurationsand memory traces are provided
in appendix A.
For web server traces, the idle time and stay time distributions of WLRU are much closer
to the distributions of the optimal replacement than the idle time and stay time distributions
of LRU are. The idle time of addresses for LRU replacement packed around 32 references.
This is the result of LRU replacement keeping an address in the cache for at least a time
span equal to the associativity of the cache set. This minimal stay time of LRU is too long
since a large number of addresses will not be re-referenced.As the associativity of modern
CPU caches tends to increase, LRU keeps the addresses that will not be re-referenced even
longer. The victim analysis shows that WLRU can limit the stay time of the not-to-be-re-
referenced addresses.
However, for SPEC benchmarks, the difference between the distributions of idle time and
stay time of victims of LRU and WLRU is not manifested as much as it is in the distributions
of web server memory traces. Figures 8.15 and 8.16 representthe distributions of idle time
10
5
LRUtotal victims 388276Hit Count 0 1 2 4 8 16 32 64 128 256 512 1024#Victims 302502 47276 13282 11237 9359 3011 1051 357 131 50 17 3% 77.91% 12.18% 3.42% 2.89% 2.41% 0.78% 0.27% 0.09% 0.03% 0.01% 0.0% 0.0%WLRUtotal victims 231291Hit Count 0 1 2 4 8 16 32 64 128 256 512 1024#Victims 228430 1178 318 258 504 319 205 65 12 2 0 0% 98.76% 0.51% 0.14% 0.11% 0.22% 0.14% 0.09% 0.03% 0.01% 0% 0% 0%
Table 8.2: The distribution of victim hit counts of WLRU and LRU replacements on network tracet20kr50.
10
6
LRUtotal victims 60121Hit Count 0 1 2 4 8 16 32 64 128 256 512#Victims 47007 4364 803 1125 1176 824 357 313 40 12 4% 78.19% 7.26% 1.34% 1.87% 1.96% 1.37% 0.59% 0.52% 0.07% 0.02% 0.01%WLRUtotal victims 63572Hit Count 0 1 2 4 8 16 32 64 128 256 512#Victims 51461 3942 952 1000 901 560 303 290 45 14 8% 80.95% 6.20% 1.50% 1.57% 1.42% 0.88% 0.48% 0.46% 0.07% 0.02% 0.01%
Table 8.3: The distribution of victim hit counts of WLRU and LRU replacements on SPEC benchmarkcrafty.
107
1 2 4 8 16 32 64 128 256 512 1024 20480
250005000075000
100000125000150000175000200000225000250000275000300000325000350000
Distributions of Victim Idle Time
OPT WLRU LRU
idle time (references)
#vic
tims
Figure 8.11: The distributions of idle time of WLRU, LRU and OPT replacements of net-work tracet20kr50.
and stay time of SPEC benchmarkcrafty. The figures shows that the curves of idle and stay
time for the OPT, WLRU and LRU replacement algorithms have nearly the same shape.
The OPT curve is at the left most. The LRU curve is at the right most. WLRU curve is to
the left a little bit to the LRU curve. This is the result of SPEC benchmarkcrafty having
good locality. The WLRU and LRU replacement algorithms are very similar to the optimal
replacement algorithm. Both WLRU and LRU have the almost thesame hit rates, which
are very close to the optimal hit rate.
The idle time and stay time distributions of the OPT replacement algorithm for the SPEC
benchmarkcrafty are very different from those of web server memory tracest20kr50and
t20kr90. For web server memory traces, the idle time and stay time of the optimal replace-
ment is much flatter than those of SPEC benchmark. This is result of that web server traces
have more addresses of medium sized IRG values (see chapter 4).
Victim analysis is a very powerful tool to study the details of a cache replacement algo-
rithm. Victim analysis also reveals information about the temporal locality of programs. It
has been observed for a long time that the locality of networkprotocols and applications is
different from other computation intensive programs such as SPEC benchmarks (see chapter
2). However, no previous studies showed detailed differences. Victim analysis shows that
curves of the distributions of victim idle and stay time are very different for web servers
and SPEC benchmarks. Victim analysis shows that LRU is near optimal for SPEC bench-
108
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
0
25000
50000
75000
100000
125000
150000
175000
200000
225000
250000
275000
300000
Distributions of Victim Stay Time
OPT WLRU LRU
stay time (references)
#vic
tims
Figure 8.12: The distributions of stay time of WLRU, LRU and OPT replacements of net-work tracet20kr50.
marks. This will affect the micro-architecture design philosophies of CPUs targeting SPEC
benchmarks.
8.6 Summary
WLRU shows significant improvements over LRU for web server memory traces. It is found
that the best weight formulas for web server memory traces are those with a small initial
weight and large weight increments and upper limits of weights. A example of such weight
formulas isi64r256b4. In the best cases, WLRUi64r256b4has more than 50% fewer L2
cache misses than LRU for web servers. For SPEC CPU2000 benchmarks, the difference
of WLRU i64r256b4and LRU is small. If WLRU uses weight formulas with higher initial
weights such asi64r256b32, the hit rates of WLRU on SPEC CPU2000 benchmarks will
improve. The advantage of WLRU is that the speed-up of WLRU for web servers does not
harm traditional applications such as SPEC CPU2000 benchmarks. Simulation on synthe-
sized multi-threading memory traces shows WLRU consistently has nearly 25% fewer L2
cache misses.
Victim analysis shows that the distributions of hit counts and stay time and idle time of vic-
tims of WLRU are closer to the distributions of victims of OPTthan LRU. Victim analysis
109
1 2 4 8 16 32 64 128 256 512 1024 20480
250005000075000
100000125000150000175000200000225000250000275000300000325000350000375000
Distributions of Victim Idle Time
OPT WLRU LRU
idle time (references)
#vic
tims
Figure 8.13: The distributions of idle time of WLRU, LRU and OPT replacements of net-work tracet20kr90.
also shows that the distributions of hit counts and stay timeand idle time of victims of OPT
replacement algorithm is different for SPEC benchmarks andweb server memory traces.
This implies web servers have different temporal locality patterns than SPEC benchmarks.
SPEC benchmarks have better temporal locality than web servers. This is reflected in the
distributions of victim stay time and idle time of OPT replacement. The distributions of web
servers are flat, but the distributions of SPEC benchmarks shows clearly a peak.
WLRU does not show as large of an improvement for SPEC CPU2000benchmarks as it
does for web servers. The difference in the hit rates of WLRU and LRU for SPEC CPU2000
benchmarks is small, with WLRU slightly better than LRU for half of the SPEC CPU2000
benchmarks. Although the use of WLRU results in higher miss rates for some of the SPEC
CPU 2000 benchmarks, it was observed that the miss rate for the optimal replacement algo-
rithm is not that much lower. ?It is observed that WLRU on SPECCPU2000 benchmarks
results in better hit rates for SPEC CPU2000 benchmarks compared to other alternative
replacement algorithms. ?For example, the improvement in hit rates of WLRU on SPEC
CPU2000 benchmarks are better than hit rates reported in AIP/LvP algorithms [KS05]. ?The
experiments in [KS05] used L2 caches that were 512 KB and 1 MB in size. In this work the
results are based on using L2 caches of 128 KB and 256 KB. Sincethe primary shortcom-
ing of WLRU on SPEC CPU2000 ?benchmarks is the small initial weight, if given larger
L2 cache sizes, new cache lines can stay longer with small initial weights, and WLRU can
110
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
0
25000
50000
75000
100000
125000
150000
175000
200000
225000
250000
275000
300000
325000
Distributions of Victim Stay Time
OPT WLRU LRU
stay time (references)
#vic
tims
Figure 8.14: The distributions of stay time of WLRU, LRU and OPT replacements of net-work tracet20kr90.
show even more improvement.
111
1 2 4 8 16 32 64 128 256 512 1024
2048
4096
8192
16384
0
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
LRU WLRU OPT
idle time (in references)
# vi
ctim
s
Figure 8.15: The distributions of idle time of WLRU, LRU and OPT replacements of SPECbenchmarkcrafty.
1 2 4 8 16 32 64 128 256 512 1024
2048
4096
8192
16384
0
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
LRU WLRU OPT
stay time (in references)
# vi
ctim
s
Figure 8.16: The distributions of stay time of WLRU, LRU and OPT replacements of SPECbenchmarkcrafty.
112
Chapter 9
Conclusions and Future Research
9.1 Conclusions
In this work, empirical analysis is applied on the memory reference traces of SPEC CPU2000
benchmark and web server memory traces. The existence of theproperty of temporal local-
ity is supported by the distribution of per set IRG values. The majority of per set IRG values
are very small. This is especially true at the L1 cache where 90% per set IRG values are of
size one. At the L2 cache, per set IRG values are still small. This provides strong evidence
of the temporal locality. However, study presented in this work also shows that a large por-
tion of addresses have low reference counts. At the L2 cache,nearly 50% addresses are
referenced only once, and nearly 90% are referenced under ten times. This pattern is called
theproperty of short lifetime. The property of short lifetime suggests that LRU cache re-
placement is not optimal for programs that have a large number of addresses with low ref-
erence counts. One class of applications is network applications and protocols, such as web
servers. The distributions of reference counts and per set IRG values of web server traces
are found to be different from the distributions of SPEC benchmarks. The L2 IRG distri-
butions of web server memory traces have fewer small IRG values and more medium sized
IRG values. This analysis lead to the development of a new CPUcache replacement called
WLRU.
WLRU addresses the shortcoming of LRU by differentiating addresses. It is found that the
per set IRG values of an address are correlated with the reference count of the address. Ad-
dresses that have higher reference counts also have more IRGgaps of small size, and large
IRG gaps are more likely to be associated with addresses thathave low reference counts.
113
The correlation of reference counts and IRG values suggeststhe basis of WLRU replace-
ment. Good cache value addresses show themselves quickly. Thus, if an address is not
referenced again within a short time after it is brought intothe cache, it may never be re-
referenced. If an address is hit quickly after being broughtinto the cache, it is likely to be
hit repeatedly. WLRU distinguishes an address by its numberof hits in a short time. Ad-
dresses that are hit in a short time after being brought into the cache are kept in the cache,
and addresses that are not hit in a short time after being brought into the cache are evicted
fast.
Trace based simulations show that for SPEC benchmarks, WLRUhas hit rates as good as
LRU. For web server traces, WLRU shows significant improvements over LRU in hit rates.
In the best scenario, WLRU has more than 50% fewer cache misses than LRU. The huge
speed up of WLRU on web servers does not harm traditional applications such as SPEC
benchmarks. Simulations on synthesized multi-threading memory traces show that WLRU
consistently has significant improvements over LRU for all SPEC benchmarks in multi-
threading environments.
WLRU is better than LRU for web server traces, but WLRU still has around 30% more
cache misses than the off-line optimal cache replacement. There is still room for better CPU
cache replacements and designs.
9.2 Future Research
The number of simulation experiments in Chapter 8 is not evennear full-factorial. The best
WLRU weight formula for each application domains is relatedwith the CPU cache config-
urations such as the cache associativity and cache size, especially the cache size. In future
research, we plan to do more experiments to reveal the relation between WLRU weight for-
mula and CPU cache configurations. We will also compare the hit rates of WLRU with more
replacement algorithms such as those mentioned in Chapter 5for various work loads.
In future research, we also plan to investigate the localitycharacteristics of more applica-
tions domains, such as multimedia, graphics, and more network equipment and applications.
A hardware implementation of WLRU is also planned. More importantly, we will conduct
research to develop better cache-able software systems, such as OS and application algo-
rithms.
114
9.2.1 Hardware Prototype
Trace based simulation is unavoidably the first step to evaluate WLRU CPU cache replace-
ment. We are planning to implement WLRU CPU cache in FPGA. Current FPGA works
at frequency of 500 Mhz, and the cache miss penalty is around 50. Compared with cache
miss penalties of high end CPUs, which are several hundreds,the cache miss penalty of
FPGA system is small but still large enough to manifest the difference between the hit rates
of WLRU and LRU replacements. We plan to test the throughput and responsiveness of
web servers and other network systems on the FPGA system.
9.2.2 Locality Analysis for More Applications Domains
The analysis of reference counts and per set IRG values reveals the difference in locality
between web servers and SPEC CPU2000 benchmarks. We plan to apply the analysis of lo-
cality on more applications domains and compare the hit rates of WLRU and LRU for these
application domains. Application domains that will be furthered explored include database
servers, search engines, network file servers and DNS servers. ?The difficulty in studying
these domains is that trace based simulation is inappropriate. An FPGA prototype CPU will
be the primary means of investigation. We plan to run TPC-C1 benchmarks on the FPGA
prototype CPU. The hit rates of WLRU and LRU will be compared.We will test the hit
rates of different WLRU weight formulas and CPU cache configurations.
Besides network servers, network equipment, such as routers, and wireless and handset equip-
ment are interesting topics for locality analysis. Networkequipment is used to be built on
DSP and ASIC. Increasing management requirements and security challenges are changing
the design philosophy of network equipment. There are more and more software in network
equipment. For wireless equipment, higher security demands are driving the equipment us-
ing more complex and more computationally intensive encryption and management soft-
ware. On handset equipment, video and game applications aregetting more popularity. Not
surprisingly, CPU caches will become the performance bottleneck. We plan to investigate
the CPU cache performance of these equipment, apply locality analysis on these equipment
and examine the possibility of using WLRU cache in these equipment to improve the CPU
cache hit rates.
1http://www.tpc.org/
115
9.2.3 OS and algorithm Design Issues
The importance of CPU cache on the performance of applications is recognized in the design
of algorithms. The cache oblivious algorithms [FLPR99] arean example effort of prevent
bad impact of CPU caches. There is a good deal of other research with a goal of designing
better cached programs (e.g. [CHL99]). However, the cache models the research is based on
involves the LRU replacement only. The findings in this work will help understand the CPU
cache requirements of operations and provide more accuratecache models and guidelines
for designing new algorithms.
The finding in the works and the WLRU replacement change one ofthe most basic constraint
of computer architecture. This will cause re-evaluations of CPU design trade offs and ap-
proaches. Because of the build-in mechanism against cache pollution [GC90] of WLRU,
more memory bandwidth is possible by using larger cache lines and more aggressive pre-
fetching. The advantage of WLRU replacement on multi-threading is also an interesting
research topic.
WLRU cache replacement will also have impact on the designs of operating systems. In
chapter 4, it is seen that web memory tracea200kr50has by far the best temporal locality
and the best CPU cache hit rates. Web server tracea200kr50shows that minor changes in
the scheduling of OS operations cause huge differences on the temporal locality and thus
the CPU cache performance of server applications. We plan toinvestigate the details of the
great difference in temporal localities of OS scheduling policies. We will also investigate
the CPU cache cost of OS operations under the WLRU replacement. Using the WLRU re-
placement, the scheduling policy of the operating system may be different from that of using
the LRU replacement. CPU cache gains will be one of the top important considerations of
OS scheduling and management strategies. Web server memorytracea200kr50shows there
are great potentials in CPU cache aware OS designs.
The discovery of the property of short lifetime improves ourunderstanding of the local-
ity of programs. WLRU exploits the property of short lifetime. Compared with other ap-
proaches of improving the CPU cache performance such as re-writing the software for better
cache hit rates, WLRU CPU cache is simpler and of much lower costs. Since the hit rates
of WLRU is close to the OPT replacement algorithm, changing of software seems unavoid-
able. Since WLRU cache replacement algorithm has more parameters to fine tune than LRU
has, WLRU can better support the re-writing of software for higher hit rates.
116
Appendix A
Analysis and Simulation Results
The attached CD contains the analysis and simulation results not described in the main body
of the dissertation. The materials in the CD are organized into three directories. The three
directories arelocality-analysis, hit-ratesandvictim-analysis. In each directory, there is a
READMEfile describing the structure of the directory.
The directory oflocality-analysiscontains the results of the statistical studies of SPEC CPU2000
benchmarks and web server memory traces. The distributionsof reference counts and per
set IRG values are provided in this directory. The distributions include both the L1 cache
and the L2 cache.
The directory ofhit-ratesis the simulation results of SPEC CPU2000 benchmarks and web
server memory traces. Multi-threading simulation resultsare also provided. This direc-
tory has three sub-directories representing SPEC CPU2000 hit rates, web server results and
multi-threading simulations. The hit rates of web servers are grouped by CPU cache con-
figurations.
The last first level directory is thevictim-analysis. The results of victim analysis of WLRU,
LRU and OPT replacement algorithms are provided in this directory. The subdirectories are
arranged into CPU cache configurations. Each subdirectory represents one configuration of
CPU caches. The victim analysis results of WLRU, LRU and OPT replacement algorithms
are stored under the same subdirectory.
117
References
[ADU71] Alfred V. Aho, Peter J. Denning, and Jeffrey D. Ullman. Principles of optimalpage replacement.J. ACM, 18(1):80–93, 1971.
[ASW+93] Santosh G. Abraham, Rabin A. Sugumar, Daniel Windheiser, B. R. Rau, andRajiv Gupta. Predictability of load/store instruction latencies. InProceedingsof the 26th annual international symposium on Microarchitecture, pages 139–152. IEEE Computer Society Press, 1993.
[AZMM04] Hussein Al-Zoubi, Aleksandar Milenkovic, and Milena Milenkovic. Perfor-mance evaluation of cache replacement policies for the speccpu2000 bench-mark suite. InProceedings of the 42nd annual Southeast regional conference,pages 267–272. ACM Press, 2004.
[BAYT01] Abdel-Hameed A. Badawy, Aneesh Aggarwal, Donald Yeung, and Chau-WenTseng. Evaluating the impact of memory system performance on softwareprefetching and locality optimizations. InProceedings of the 15th interna-tional conference on Supercomputing, pages 486–500. ACM Press, 2001.
[Bla96] Trevor Blackwell. Speeding up protocols for small messages. InConfer-ence proceedings on Applications, technologies, architectures, and protocolsfor computer communications, pages 85–95. ACM Press, 1996.
[CDL99] Trishul M. Chilimbi, Bob Davidson, and James R. Larus. Cache-consciousstructure definition. InProceedings of the ACM SIGPLAN 1999 conferenceon Programming language design and implementation, pages 13–24. ACMPress, 1999.
[CH01] Jason F. Cantin and Mark D. Hill. Cache performance for selected speccpu2000 benchmarks.SIGARCH Comput. Archit. News, 29(4):13–18, 2001.
[CH02] Trishul M. Chilimbi and Martin Hirzel. Dynamic hot data stream prefetchingfor general-purpose programs. InProceedings of the ACM SIGPLAN 2002Conference on Programming language design and implementation, pages199–209. ACM Press, 2002.
118
[CHL99] Trishul M. Chilimbi, Mark D. Hill, and James R. Larus. Cache-consciousstructure layout. InProceedings of the ACM SIGPLAN 1999 conference onProgramming language design and implementation, pages 1–12. ACM Press,1999.
[CJRS89] David D. Clark, Van Jacobson, John Romkey, and Howard Salwen. An analy-sis of tcp processing overhead.IEEE Communications Magazine, 27(6), June1989.
[CMT94] Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. Compiler optimiza-tions for improving data locality. InProceedings of the sixth international con-ference on Architectural support for programming languages and operatingsystems, pages 252–262. ACM Press, 1994.
[Den68] Peter J. Denning. The working set model for program behavior. Commun.ACM, 11(5):323–333, 1968.
[dLJ03] P.J. de Langen and B.H.H. Juurlink. Reducing conflict misses in caches. InProceedings of the 14th Annual Workshop on Circuits, Systems and SignalProcessing, ProRisc 2003, pages 505–510, November 2003.
[EH84] Wolfgang Effelsberg and Theo Haerder. Principles ofdatabase buffer manage-ment.ACM Trans. Database Syst., 9(4):560–595, 1984.
[EK89] S. J. Eggers and R. H. Katz. The effect of sharing on thecache and bus per-formance of parallel programs. InProceedings of the third international con-ference on Architectural support for programming languages and operatingsystems, pages 257–270. ACM Press, 1989.
[FH05] Michael J. Flynn and Patrick Hung. Microprocessor design issues: Thoughtson the road ahead.IEEE Micro, 25(3):16–31, 2005.
[FLPR99] Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachan-dran. Cache-oblivious algorithms. InFOCS, pages 285–298, 1999.
[FNAG92] J. Kelly Flanagan, Brent E. Nelson, James K Archibald, and Knut Grimsrud.BACH: BYU Address Collection Hardware, the collection of complete traces.In Proc. of the 6th Int. Conf. on Modelling Techniques and Toolsfor ComputerPerformance Evaluation, pages 128–137, 1992.
[FTP94] M. Farrens, G. Tyson, and A. R. Pleszkun. A study of single-chip proces-sor/cache organizations for large numbers of transistors.In Proceedings of the21ST annual international symposium on Computer architecture, pages 338–347. IEEE Computer Society Press, 1994.
119
[GC90] Rajiv Gupta and Chi-Hung Chi. Improving instructioncache behavior by re-ducing cache pollution. InProceedings of the 1990 ACM/IEEE conference onSupercomputing, pages 82–91. IEEE Computer Society, 1990.
[Goo83] James R. Goodman. Using cache memory to reduce processor-memory traf-fic. In ISCA ’83: Proceedings of the 10th annual international symposium onComputer architecture, pages 124–131, Los Alamitos, CA, USA, 1983. IEEEComputer Society Press.
[Hen00] John L. Henning. Spec cpu2000: Measuring cpu performance in the new mil-lennium.Computer, 33(7):28–35, 2000.
[Hig90] Lee Higbie. Quick and easy cache performance analysis. SIGARCH Comput.Archit. News, 18(2):33–44, 1990.
[Hil87] Mark Donald Hill. Aspects of cache memory and instruction buffer perfor-mance. PhD thesis, University of California, Berkeley, 1987.
[Hil88] Mark D. Hill. A case for direct-mapped caches.Computer, 21(12):25–40,1988.
[HKM02] Zhigang Hu, Stefanos Kaxiras, and Margaret Martonosi. Timekeeping in thememory system: predicting and optimizing memory behavior.In ISCA ’02:Proceedings of the 29th annual international symposium on Computer archi-tecture, pages 209–220, Washington, DC, USA, 2002. IEEE Computer Soci-ety.
[HP96] John L. Hennessy and David A. Patterson.Computer architecture (2nd ed.):a quantitative approach. Morgan Kaufmann Publishers Inc., 1996.
[HP02] John L. Hennessy and David A. Patterson.Computer architecture: a quantita-tive approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,2002.
[HR00] Erik G. Hallnor and Steven K. Reinhardt. A fully associative software-managed cache design. InProceedings of the 27th annual international sym-posium on Computer architecture, pages 107–116. ACM Press, 2000.
[HS89] M. D. Hill and A. J. Smith. Evaluating associativity in cpu caches.IEEE Trans.Comput., 38(12):1612–1630, 1989.
[Int04] Intel. The Microarchitecture of the Pentium 4 Processor, May 2004.
[Jac03] B. Jacob. A case for studying dram issues at the system level. Micro, IEEE,23(4):44–56, 2003.
120
[JmWH97] Teresa L. Johnson and Wen mei W. Hwu. Run-time adaptive cache hierarchymanagement via reference analysis. InProceedings of the 24th annual inter-national symposium on Computer architecture, pages 315–326. ACM Press,1997.
[Jou98] Norman P. Jouppi. Improving direct-mapped cache performance by the addi-tion of a small fully-associative cache prefetch buffers. In 25 years of the in-ternational symposia on Computer architecture (selected papers), pages 388–397. ACM Press, 1998.
[JS94] Theodore Johnson and Dennis Shasha. 2q: A low overhead high performancebuffer management replacement algorithm. InVLDB ’94: Proceedings ofthe 20th International Conference on Very Large Data Bases, pages 439–450.Morgan Kaufmann Publishers Inc., 1994.
[JZ02] Song Jiang and Xiaodong Zhang. Lirs: an efficient low inter-reference recencyset replacement policy to improve buffer cache performance. In Proceedingsof the 2002 ACM SIGMETRICS international conference on Measurement andmodeling of computer systems, pages 31–42. ACM Press, 2002.
[KBK02] Changkyu Kim, Doug Burger, and Stephen W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. InTenthinternational conference on architectural support for programming languagesand operating systems on Proceedings of the 10th international conferenceon architectural support for programming languages and operating systems(ASPLOS-X), pages 211–222. ACM Press, 2002.
[KS02] Gokul B. Kandiraju and Anand Sivasubramaniam. Goingthe distance for tlbprefetching: an application-driven study. InProceedings of the 29th annual in-ternational symposium on Computer architecture, pages 195–206. IEEE Com-puter Society, 2002.
[KS05] Mazen Kharbutli and Yan Solihin. Counter-based cache replacement algo-rithms. In ICCD ’05: Proceedings of the 2005 International ConferenceonComputer Design, pages 61–68, Washington, DC, USA, 2005. IEEE Com-puter Society.
[LCK+01] D. Lee, J. Choi, J. H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C.S. Kim. Lrfu: Aspectrum of policies that subsumes the least recently used and least frequentlyused policies.IEEE Trans. Comput., 50(12):1352–1361, 2001.
[LFF01] An-Chow Lai, Cem Fide, and Babak Falsafi. Dead-blockprediction and dead-block correlating prefetchers.isca, 00:0144, 2001.
121
[LKW02] Jung-Hoon Lee, Shin-Dug Kim, and Charles Weems. Application-adaptiveintelligent cache memory system.Trans. on Embedded Computing Sys.,1(1):56–78, 2002.
[LP98] Butler W. Lampson and Kenneth A. Pier. A processor fora high-performancepersonal computer. In25 years of the international symposia on Computerarchitecture (selected papers), pages 180–194. ACM Press, 1998.
[MCE+02] Peter S. Magnusson, Magnus Christensson, Jesper Eskilson, Daniel Forsgren,Gustav H lberg, Johan H berg, Fredrik Larsson, Andreas Moestedt, and BengtWerner. Simics: A full system simulation platform.Computer, 35(2):50–58,2002.
[MCFT99] Nicholas Mitchell, Larry Carter, Jeanne Ferrante, and Dean Tullsen. Ilp versustlp on smt. InProceedings of the 1999 ACM/IEEE conference on Supercom-puting (CDROM), page 37. ACM Press, 1999.
[MGS+70] R. L. Mattson, J. Gecsei, D. R. Slutz, I. L., and Traiger. Evaluation techniquesfor storage hierarchies.IBM Systems Journal, 9(2):78–117, 1970.
[MPBO96] David Mosberger, Larry L. Peterson, Patrick G. Bridges, and Sean O’Malley.Analysis of techniques to improve protocol processing latency. In Confer-ence proceedings on Applications, technologies, architectures, and protocolsfor computer communications, pages 73–84. ACM Press, 1996.
[NYKT97] Erich Nahum, David Yates, Jim Kurose, and Don Towsley. Cache behavior ofnetwork protocols. InProceedings of the 1997 ACM SIGMETRICS interna-tional conference on Measurement and modeling of computer systems, pages169–180. ACM Press, 1997.
[OOW93] Elizabeth J. O’Neil, Patrick E. O’Neil, and GerhardWeikum. The lru-k pagereplacement algorithm for database disk buffering. InProceedings of the 1993ACM SIGMOD international conference on Management of data, pages 297–306. ACM Press, 1993.
[PCD+01] P. R. Panda, F. Catthoor, N. D. Dutt, K. Danckaert, E. Brockmeyer, C. Kulka-rni, A. Vandercappelle, and P. G. Kjeldsberg. Data and memory optimizationtechniques for embedded systems.ACM Trans. Des. Autom. Electron. Syst.,6(2):149–206, 2001.
[PG95] Vidyadhar Phalke and Bhaskarpillai Gopinath. An inter-reference gap modelfor temporal locality in program behavior. InProceedings of the 1995 ACMSIGMETRICS joint international conference on Measurementand modelingof computer systems, pages 291–300. ACM Press, 1995.
122
[PH90a] Karl Pettis and Robert C. Hansen. Profile guided codepositioning. InPro-ceedings of the ACM SIGPLAN 1990 conference on Programming languagedesign and implementation, pages 16–27. ACM Press, 1990.
[PH90b] Dionisios N. Pnevmatikatos and Mark D. Hill. Cache performance of the inte-ger spec benchmarks on a risc.SIGARCH Comput. Archit. News, 18(2):53–68,1990.
[PH05] David A. Patterson and John L. Hennessy.Computer organization & de-sign: the hardware/software interface. Morgan Kaufmann Publishers Inc., SanFrancisco, CA, USA, 2005.
[PHH88] S. Prybylski, M. Horowitz, and J. Hennessy. Performance tradeoffs in cachedesign. InProceedings of the 15th Annual International Symposium on Com-puter architecture, pages 290–298. IEEE Computer Society Press, 1988.
[PHS98] Jih Peir, Windsor W. Hsu, and Alan J. Smith. Implementation issues in moderncache memory. Technical report, Berkeley, CA, USA, 1998.
[PK94] S. Palacharla and R. E. Kessler. Evaluating stream buffers as a secondary cachereplacement. InProceedings of the 21ST annual international symposium onComputer architecture, pages 24–33. IEEE Computer Society Press, 1994.
[Prz90] Steven A. Przybylski.Cache and memory hierarchy design: a performance-directed approach. Morgan Kaufmann Publishers Inc., 1990.
[Quo94] R. W. Quong. Expected i-cache miss rates via the gap model. InProceedingsof the 21ST annual international symposium on Computer architecture, pages372–383. IEEE Computer Society Press, 1994.
[RBC02] Shai Rubin, Rastislav Bodík, and Trishul Chilimbi. An efficient profile-analysis framework for data-layout optimizations. InProceedings of the29th ACM SIGPLAN-SIGACT symposium on Principles of programming lan-guages, pages 140–153. ACM Press, 2002.
[RD90] John T. Robinson and Murthy V. Devarakonda. Data cache management usingfrequency-based replacement. InProceedings of the 1990 ACM SIGMETRICSconference on Measurement and modeling of computer systems, pages 134–142. ACM Press, 1990.
[RSG93] Edward Rothberg, Jaswinder Pal Singh, and Anoop Gupta. Working sets,cache sizes, and node granularity issues for large-scale multiprocessors. InProceedings of the 20th annual international symposium on Computer archi-tecture, pages 14–26. ACM Press, 1993.
123
[RTT+98] Jude A. Rivers, Edward S. Tam, Gary S. Tyson, Edward S. Davidson, and MattFarrens. Utilizing reuse information in data cache management. InProceed-ings of the 12th international conference on Supercomputing, pages 449–456.ACM Press, 1998.
[SA93] Rabin A. Sugumar and Santosh G. Abraham. Efficient simulation of cachesunder optimal replacement with applications to miss characterization. InPro-ceedings of the 1993 ACM SIGMETRICS conference on Measurement andmodeling of computer systems, pages 24–35. ACM Press, 1993.
[SKT96] James D. Salehi, James F. Kurose, and Don Towsley. The effectiveness ofaffinity-based scheduling in multiprocessor network protocol processing (ex-tended version).IEEE/ACM Transactions on Networking, 4(4):516–530, Au-gust 1996.
[SKW99] Yannis Smaragdakis, Scott Kaplan, and Paul Wilson.Eelru: simple and ef-fective adaptive page replacement. InSIGMETRICS ’99: Proceedings of the1999 ACM SIGMETRICS international conference on Measurement and mod-eling of computer systems, pages 122–133. ACM Press, 1999.
[Smi82] Alan Jay Smith. Cache memories.ACM Comput. Surv., 14(3):473–530, 1982.
[Smi87] A. J. Smith. Line (block) size choice for cpu cache memories. IEEE Trans.Comput., 36(9):1063–1076, 1987.
[SS01] Nathan T. Slingerland and Alan Jay Smith. Cache performance for multime-dia applications. InProceedings of the 15th international conference on Su-percomputing, pages 204–217. ACM Press, 2001.
[ST85] Daniel D. Sleator and Robert E. Tarjan. Amortized efficiency of list updateand paging rules.Commun. ACM, 28(2):202–208, 1985.
[Sug93] Rabin A. Sugumar.Multi-Configuration Simulation Algorithms for the Evalu-ation of Computer Architecture Designs. PhD thesis, University of Michigan,1993. Technical Report CSE-TR-173-93 with Santosh G. Abraham.
[Tan87] Andrew S. Tanenbaum.Operating systems: design and implementation.Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1987.
[TFMP95] Gary Tyson, Matthew Farrens, John Matthews, and Andrew R. Pleszkun. Amodified approach to data cache management. InProceedings of the 28thannual international symposium on Microarchitecture, pages 93–103. IEEEComputer Society Press, 1995.
[Tho03] Mark Thorson. Internet nuggets.SIGARCH Comput. Archit. News, 31(4):26–32, 2003.
124
[Tor94] J. Torrellas. False sharing and spatial locality inmultiprocessor caches.IEEETrans. Comput., 43(6):651–663, June 1994.
[UM97] Richard A. Uhlig and Trevor N. Mudge. Trace-driven memory simulation: asurvey.ACM Comput. Surv., 29(2):128–170, 1997.
[VL00] Steven P. Vanderwiel and David J. Lilja. Data prefetch mechanisms.ACMComput. Surv., 32(2):174–199, 2000.
[VTG+99] Alexander V. Veidenbaum, Weiyu Tang, Rajesh Gupta, Alexandru Nicolau,and Xiaomei Ji. Adapting cache line size to application behavior. In Proceed-ings of the 13th international conference on Supercomputing, pages 145–154.ACM Press, 1999.
[WM95] Wm. A. Wulf and Sally A. McKee. Hitting the memory wall: implications ofthe obvious.SIGARCH Comput. Archit. News, 23(1):20–24, 1995.
[ZCL04] Yuanyuan Zhou, Zhifeng Chen, and Kai Li. Second-level buffer cache man-agement.IEEE Trans. Parallel Distrib. Syst., 15(6):505–519, 2004.
125
This page will be replaced!