WLRU CPU Cache Replacement Algorithm...THE UNIVERSITY OF WESTERN ONTARIO FACULTY OF GRADUATE STUDIES CERTIFICATE OF EXAMINATION Advisors Examining Board Dr. Hanan Lutﬁyya Dr. Marin

WLRU CPU Cache Replacement Algorithm

(Thesis Format: Monograph)

by

Qufei Wang

Graduate Program in Computer Science

Submitted in partial fulfilmentof the requirements for the degree of

Doctor of Philosophy

Faculty of Graduate StudiesThe University of Western Ontario

London, OntarioDecember, 2006

c© Qufei Wang2006

THE UNIVERSITY OF WESTERN ONTARIOFACULTY OF GRADUATE STUDIES

CERTIFICATE OF EXAMINATION

Advisors Examining Board

Dr. Hanan Lutfiyya Dr. Marin Litou

Dr. Abdallah Shami

Dr. Mark Daley

Dr. Mike Katchabaw

The thesis byQufei Wang

entitled

WLRU CPU CACHE REPLACEMENT ALGORITHM

is accepted in partial fulfilment of therequirements for the degree of

Doctor of Philosophy

Date Chair of Examining Board

ii

Abstract

A CPU consists of two parts, the CPU cores and the CPU caches. CPU caches are small but

fast memories usually on the same die as the CPU cores. Recently used instructions and

data are stored in CPU caches. Accessing CPU caches takes a quater to five nano seconds,

but accessing the main memory takes 100 to 150 nano seconds. The main memory is so

slow that the CPU is idle for more than 80% of the time waiting for memory accesses. This

problem is known as the memory wall. The memory wall implies that faster or more CPU

cores are of little use if the performance of CPU caches does not improve.

Generally, larger CPU caches have higher performance but the improvement is very small.

A smarter CPU cache replacement algorithm is of more potential. The CPU cache replace-

ment algorithm decides which cache content to be replaced. Currently, Least Recently Used

(LRU) replacement and its variants are most widely used in CPUs. However, the perfor-

mance of LRU is not satisfactory for applications of poor locality, such as network protocols

and applications. We found that there is a pattern in the memory references of these appli-

cations that makes LRU fails. Based on this discovery, we developed a new CPU cache

replacement called Weighted Least Recently Used (WLRU). Trace based simulations show

that WLRU has significant improvement over LRU for applications of poor locality. For

example, for web servers, WLRU has 50% fewer L2 cache misses than LRU. This means

WLRU can immediately improve the performance of web serversby more than 200%.

CPU caches have been intensively studied in the past thirty years. WLRU has by far the

biggest improvement. Our studies also indicate that WLRU isvery close to the theoretical

upper limit of cache replacement algorithms. This means anyfurther improvement in CPU

cache performance will have to come from changes to the software. In future work, we will

investigate how to write OS and software to have better CPU cache performance.

iii

Acknowledgements

I would like to gratefully acknowledge the supervision of Professor Hanan Lutfiyya during

this work. Many thanks to her for her patience, tolerance andsupport.

I am grateful to all my friends in Computer Science Department, University of Western On-

tario. From the staff, Janice Wiersma and Cheryl McGrath areespecially thanked for their

care and attention.

Finally, I am forever indebted to my wife Min and my parents. The support from Min is the

source of strength helped me through the many years.

iv

Table of Contents

CERTIFICATE OF EXAMINATION ii

TABLE OF CONTENTS v

LIST OF FIGURES ix

LIST OF TABLES xiii

1 Introduction 1

1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . .1

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Outline of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3

2 Background and Related Research 5

2.1 Background on CPU Caches . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 Cache Lines and Cache Hits . . . . . . . . . . . . . . . . . . . . . 6

2.1.3 Set Associative Caches . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.4 Multiple Level CPU Caches . . . . . . . . . . . . . . . . . . . . . 8

2.2 Efforts to Improve Cache Hit Rates . . . . . . . . . . . . . . . . . . .. . . 10

2.2.1 Cache Line Size, Prefetching and Stream Buffer . . . . . .. . . . 10

2.2.2 Cache Sizes and Hit Rates . . . . . . . . . . . . . . . . . . . . . . 12

v

2.2.3 Cache Associativity and Victim Cache . . . . . . . . . . . . . .. . 13

2.2.4 Split Instruction and Data Cache . . . . . . . . . . . . . . . . . .. 14

2.3 Cache Replacement Algorithms Other Than LRU . . . . . . . . . .. . . . 15

2.3.1 Pseudo-LRU Replacements . . . . . . . . . . . . . . . . . . . . . 15

2.3.2 First-In-First-Out and Random Replacements . . . . . . .. . . . . 16

2.3.3 LRU-k and LIRS Replacement Algorithms . . . . . . . . . . . . . 17

2.3.4 LFU and FBR Replacement Algorithms . . . . . . . . . . . . . . . 19

2.3.5 LRFU, Multi-Queue and EELRU Replacement Algorithms .. . . . 20

2.3.6 Dead Alive Prediction Replacement Algorithms . . . . . .. . . . . 21

2.3.7 Off-Line Optimal Replacement Algorithm . . . . . . . . . . .. . . 21

2.3.8 Summary of Replacements . . . . . . . . . . . . . . . . . . . . . . 23

2.4 CPU Cache Issues of Network Protocols and Applications .. . . . . . . . 23

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Principle of Locality and Property of Short Lifetime 26

3.1 Memory Reference Traces . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Principle of Temporal Locality and LRU . . . . . . . . . . . . . . .. . . . 28

3.3 Inter-Reference Gaps and Temporal Locality . . . . . . . . . .. . . . . . . 29

3.3.1 Inter-Reference Gaps and LRU . . . . . . . . . . . . . . . . . . . 30

3.3.2 Complete Program Stream and Per Set IRG Values . . . . . . .. . 30

3.3.3 Distributions of Per Set IRG Values and Temporal Locality . . . . . 32

3.4 Reference Counts and Property of Short Lifetime . . . . . . .. . . . . . . 34

3.4.1 Property of Short Lifetime . . . . . . . . . . . . . . . . . . . . . . 34

3.4.2 Reference Counts of Cache Lines . . . . . . . . . . . . . . . . . . 34

3.5 Relationship between Average Reference Counts and LRU Hit Rates . . . . 36

3.6 L2 IRG and Reference Count Distributions . . . . . . . . . . . . .. . . . 38

3.6.1 L2 Reference Count Distributions . . . . . . . . . . . . . . . . .. 38

3.6.2 L2 IRG Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

vi

4 Locality Characteristics of Network Protocols and Applications 43

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Memory Traces of Web Servers . . . . . . . . . . . . . . . . . . . . . . . 44

4.3 Average Reference Counts of Web Server Memory Traces . . .. . . . . . 45

4.4 Reference Count Distributions of Web Server Memory Traces . . . . . . . 46

4.5 L2 Distributions of Reference Counts of Web Server Memory Traces . . . . 47

4.6 L2 IRG Distributions of Web Server Memory Traces . . . . . . .. . . . . 48

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 WLRU Cache Replacement 53

5.1 Correlation of IRG and Reference Counts . . . . . . . . . . . . . .. . . . 53

5.2 Problems with LRU and LFU . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.3 WLRU Cache Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.4 Notation Used to Represent WLRU Parameter Settings . . . .. . . . . . . 57

5.5 WLRU Mimicking LRU . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.6 Comparison of WLRU with Other Cache Replacement Algorithms . . . . . 58

5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6 Hardware Implementations of WLRU 63

6.1 Space Requirements of WLRU . . . . . . . . . . . . . . . . . . . . . . . . 63

6.2 Overall Structure of WLRU CPU Cache . . . . . . . . . . . . . . . . . .. 63

6.3 Hit/Miss Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.4 Weight Control Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.5 Replacement and Line-Fill/Cast-Out Logic . . . . . . . . . . .. . . . . . . 71

6.6 Comparison of WLRU and LRU . . . . . . . . . . . . . . . . . . . . . . . 72

6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

vii

7 WLRU CPU Cache Simulator 78

7.1 Memory Trace Based CPU Cache Simulations . . . . . . . . . . . . .. . . 79

7.2 Architecture of CPU Cache Simulator . . . . . . . . . . . . . . . . .. . . 80

7.2.1 SimuEngine Object and Trace Synthesizing . . . . . . . . . .. . . 80

7.2.2 CacheDevice Interface . . . . . . . . . . . . . . . . . . . . . . . . 82

7.3 Cache Sets and Replacement Objects . . . . . . . . . . . . . . . . . .. . . 82

7.4 WLRU Replacements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7.5 Optimal Replacement Algorithm . . . . . . . . . . . . . . . . . . . . .. . 85

7.6 Victim Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

7.7 Validation of Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . .. 86

7.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

8 Simulation Results 90

8.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

8.2 WLRU on Web Server Memory Traces . . . . . . . . . . . . . . . . . . . . 93

8.3 WLRU on SPEC CPU2000 Benchmarks . . . . . . . . . . . . . . . . . . . 96

8.4 WLRU Performance on Multi-threaded Workloads . . . . . . . .. . . . . 99

8.5 Comparison of LRU and WLRU Using Victim Analysis . . . . . . .. . . 101

8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

9 Conclusions and Future Research 112

9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

9.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

9.2.1 Hardware Prototype . . . . . . . . . . . . . . . . . . . . . . . . . 114

9.2.2 Locality Analysis for More Applications Domains . . . .. . . . . 114

9.2.3 OS and algorithm Design Issues . . . . . . . . . . . . . . . . . . . 115

A Analysis and Simulation Results 116

viii

References 117

VITA 125

ix

List of Figures

2.1 The structure of an four levels memory hierarchy. . . . . . .. . . . . . . . 6

2.2 The structure of a CPU cache line. . . . . . . . . . . . . . . . . . . . .. . 7

2.3 The mapping of the main memory words into a direct-mappedcache and a

two-way associative cache. . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 The structure of an eight-way set associative cache. . . .. . . . . . . . . . 10

2.5 Storage arrangements of an eight-way associative cacheset using real LRU

and PLRU replacements. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6 Storage arrangements of an eight-way associative cacheset using PLRU-

tree replacement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.7 Storage arrangements of an eight-way associative cacheset using PLRU-

msb replacement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.8 An example of the Optimal replacement decision. . . . . . . .. . . . . . 22

3.1 Two per set IRG values and their corresponding whole stream IRG values. . 29

3.2 IRG strings of three addresses in the CC1 trace [PG95]. IRG index is the

index number of the first reference of an IRG gap. . . . . . . . . . . .. . . 31

3.3 The distributions of per set IRG values of eight SPEC benchmarks. . . . . . 33

3.4 The distributions of per address reference counts of eight SPEC benchmarks.

35

3.5 The distributions of reference counts of cache lines of eight SPEC bench-

marks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.6 The average reference counts of SPEC integer benchmarksand their miss

rates under LRU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

x

3.7 The average reference counts of SPEC floating point benchmarks and their

miss rates under LRU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.8 The distributions of L2 reference counts of eight SPEC benchmarks. . . . . 39

3.9 The distributions of L2 IRG values of eight SPEC benchmarks. . . . . . . . 40

3.10 The distributions of L2 IRG values of SPEC benchmarks onlog2 scale. . . 41

4.1 The distributions of per address reference counts of four web sever memory

traces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 The distributions of reference counts of cache lines of four web server mem-

ory traces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3 The distributions of reference counts of cache lines of four web server mem-

ory traces at the L2 cache. . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.4 The distributions of IRG values of four web server memorytraces at the L2

cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.1 Comparison of the replacement decision of WLRU and LRU. .. . . . . . . 59

6.1 Storage arrangement of an eight-way associative cache set using WLRU re-

placement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.2 The structure of a CPU cache using WLRU replacement. . . . .. . . . . . 65

6.3 The RAM memory arrays used in an associative of the WLRU CPU cache. 66

6.4 The data path, address path and control signals of the WLRU CPU cache. . 68

6.5 The hit/miss logic of the WLRU CPU cache. . . . . . . . . . . . . . .. . . 69

6.6 The weight control logic of the WLRU CPU cache. . . . . . . . . .. . . . 74

6.7 The weight arithmetic circuit of the weight control logic. . . . . . . . . . . 75

6.8 The line-fill/cast-out logic of the WLRU CPU cache. . . . . .. . . . . . . 76

6.9 The replacement logic of the WLRU CPU cache. . . . . . . . . . . .. . . 77

7.1 The architecture of the CPU cache simulator. . . . . . . . . . .. . . . . . 80

7.2 An example trace synthesizing scenario which includes context switching

effects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

xi

7.3 The UML graph ofCacheDeviceinterface. . . . . . . . . . . . . . . . . . 83

7.4 A flow chart of thecyclePing()method of classSetCache. . . . . . . . . . 84

7.5 The UML graph ofCacheSetclass, which is the base class of all replace-

ments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

7.6 The flow chart of thereferenced()andreplace()method ofWLRUclass. . . 89

8.1 Comparison of miss rates of WLRU, LRU and Optimal on Apache traces. . 94

8.2 Comparison of miss rates of WLRU, LRU and Optimal on Apache traces

with mixed web page sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . 95

8.3 Comparison of miss rates of WLRU, LRU and Optimal on thttpd traces. . . 96

8.4 Comparison of miss rates of WLRU, LRU and Optimal on thttpd traces with

mixed web page sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

8.5 Comparison of miss rates of OPT, LRU and WLRU on SPEC integer bench-

marks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

8.6 Comparison of miss rates of OPT, LRU and WLRU on SPEC floating point

benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

8.7 Comparison of miss rates of OPT, LRU and WLRU on SPEC integer bench-

marks, where WLRU usingi64r256b32. . . . . . . . . . . . . . . . . . . . 101

8.8 Comparison of miss rates of OPT, LRU and WLRU on SPEC floating point

benchmarks, where WLRU usingi64r256b32. . . . . . . . . . . . . . . . . 102

8.9 The miss rates of LRU, WLRU and OPT replacements on multi-threading

SPEC INT benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

8.10 The miss rates of LRU, WLRU and OPT replacements on multi-threading

SPEC FLT benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

8.11 The distributions of idle time of WLRU, LRU and OPT replacements of

network tracet20kr50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

8.12 The distributions of stay time of WLRU, LRU and OPT replacements of




xii




SPEC benchmarkcrafty. . . . . . . . . . . . . . . . . . . . . . . . . . . . 111



xiii

List of Tables

2.1 Typical miss rates of LRU and Random with different cachesizes and associativities[HP96]. 17

4.1 Names of network traces and their configurations. . . . . . .. . . . . . . . 46

4.2 Average reference counts of network traces. . . . . . . . . . .. . . . . . . 47

4.3 Percentages of IRG values< 16 and percentages of IRG values≥ 256 of

SPEC benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.4 Percentages of IRG values< 16 and percentages of IRG values≥ 256 of

network traces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.1 The IRG values of address tags mapping toset0 of SPEC benchmarkcrafty. 54

5.2 The IRG values of address tags mapping toset0 of network tracea20kr50. . 54

5.3 Comparison of total cache misses of LRU and weight formulas mimicking

LRU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

8.1 The IRG values of address tags with reference count of twoin set0 of SPEC

benchmarkswim. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

8.2 The distribution of victim hit counts of WLRU and LRU replacements on


8.3 The distribution of victim hit counts of WLRU and LRU replacements on


xiv

1

Chapter 1

Introduction

1.1 Background and Motivation

The speed of CPUs is much faster than the speed of the main memory. CPU caches are used

to bridge the speed gap. A CPU cache is a small memory which is usually on the same die as

the CPU [dLJ03]. A CPU cache is much faster than the main memory but much smaller in

size. Instructions and data recently accessed from the mainmemory are stored in the CPU

cache. When the CPU requests an address, the CPU cache is checked. If found in the cache,

it is called acache hitotherwise it is called acache miss. The proportion of addresses found

in the cache is called thecache hit rate. The difference in accessing time between the main

memory and the CPU cache is defined as thecache miss penalty. This work assumes that the

cache miss penalty is measured using the number of CPU cyclesneeded to retrieve the infor-

mation from the main memory. For example, if accessing the CPU cache requires only one

CPU cycle but accessing the main memory requires 100 CPU cycles, the cache miss penalty

is 100. Currently, the cache miss penalties of most CPUs are already much more than 100

[Jac03, Tho03, FH05]. Since most CPU caches are smaller thanthe program image in the

main memory, when the CPU cache is full, then an existing cache entry is chosen to be re-

placed. Acache replacement algorithmdecides the cache entry to be replaced. The most

commonly used CPU cache replacement algorithm is Least Recently Used (LRU) replace-

ment [PH05]. LRU replacement evicts the cache entry which isleast recently accessed. The

use of LRU is based on the assumption that programs exhibit the property of temporal local-

ity, which is phrased as ‘recently accessed items are likelyto be accessed in the near future

[HP96].

2

In the past twenty years, the speed of CPUs doubled every 18 months, but the memory speed

increased only 7% each year [HP02]. The speed gap between theCPU and the main mem-

ory keeps widening1, but the CPU cache hit rate is seldom higher than 99% [HP02]. As-

suming a cache hit rate of 99% and a cache penalty of 100, the CPU is idle for 50% of the

time. Currently, main stream CPU speeds are between 2 to 4 GHz, and the main memory is

clocked between 500 MHz to 800 MHz. Besides the data transfertime, the main memory

made of DRAM ( Dynamic Random Access Memory) also has a large latency. The latency

of DRAM is the delay between the receiving of the read requestand the readiness of data

for transfer. The latency of the current DDR DRAM memory is atleast 90 nano seconds,

and the total transfer time of a cache line is around 120 ns2. Assuming a CPU speed of 1

GHz, the cache miss penalty is 120 CPU cycles. Faster CPU speeds have even larger cache

miss penalties. Faster DRAM technologies helps little since the latency of these faster mem-

ory remains constant, if not even longer. The SemiconductorIndustry Association (ISA) is

now calculating cache miss penalties of more than 300 CPU cycles [FH05]. If the cache

hit rate can not be improved, as the speed gap reaches a specific point, further increasing

CPU speeds will not generate any gain in effective computingpower. This is known as the

Memory Wall problem [WM95].

The CPU cache is a dominant factor in computing power. Generally, a larger CPU cache

has higher hit rates. However, there is a limit on the die for CPU caches. Recent proces-

sors have already spent 50% of the die area and more than 80% ofthe transistors on CPU

caches [PHS98]. Larger CPU caches are unlikely unless revolutionary circuit technologies

are used. This suggests other approaches to improve the CPU cache performance besides

increasing the size of CPU caches should be examined.

One approach to improving CPU cache performance is to find better cache replacement al-

gorithms. LRU is currently the most widely used CPU cache replacement. LRU was devel-

oped decades ago, and current computing environments are very different from that time.

1.2 Contributions

The contributions of this work include the following:

1Although the CPU speed stagnated in recent years, there are always faster CPUs coming. For example,IBM Power6 is targeted around 5G Hz. (http://realworldtech.com/page.cfm?ArticleID=RWT101606194731)

2source: www.powerlogix.com/downloads/SDRDDR.pdf

3

Property of Short Lifetime. This work presents an analysis of the pattern of memory ref-

erences of programs. Of special interest is the study of inter-reference gaps (IRG) and ref-

erence counts of addresses. The reference count of an address is the number of times that

the address is referenced. An Inter-Reference Gap (IRG) is defined as the number of refer-

ences between two consecutive references of an address. Perset IRG values are IRG values

of an individual cache set. Our studies find that the majorityof per set IRG values are small.

This is especially true at the first-level (L1) cache where itis found that 90% of the per set

IRG values are of size one. At the level two (L2) cache, per setIRG values are still small.

This provides strong evidence of temporal locality. However, our studies also show that a

large portion of addresses have low reference counts. At theL2 cache, nearly 50% of all ad-

dresses are referenced only once, and nearly 90% of all addresses are referenced under ten

times. This pattern is named theproperty of short lifetime. This suggests that LRU is less

effective for programs that have a large portion of its addresses with low reference counts,

since LRU does not distinguish between addresses with low reference counts and addresses

with high reference counts, which turns out to be the case of many networked applications.

Development of a New Cache Replacement Algorithm. Based on the property of short

lifetime a new cache replacement algorithm, which is a modification of LRU, was devel-

oped. This new algorithm is referred to asWeighted Least Recently Used (WLRU). Simula-

tions show that WLRU has significantly fewer cache misses than LRU for network protocols

and applications. For other programs, such as SPEC benchmark programs, the difference in

the hit rates of WLRU and LRU is unnoticeable. This means the superiority of WLRU over

LRU for network protocols and applications does not harm theperformance of traditional

applications like SPEC benchmarks. WLRU can replace LRU in general purpose CPUs.

Example Circuit and Simulator. An example circuit of a CPU cache using WLRU re-

placement is presented in this work. The circuit shows that the cost of implementing WLRU

is minimal. WLRU is requires less than 3% of more space than LRU. A trace based simula-

tor is also developed. The simulator implements WLRU, LRU, pseudo-LRU replacements,

and off-line optimal replacement. The simulator is writtenin Java and contains bookkeep-

ing information not found in other simulators. This information is used to investigate the

behavior of different cache replacements and designs.

1.3 Outline of Dissertation

The rest of this work is organized as follows.

4

Chapter 2 describes related research in cache replacement algorithms. Some background

introduction to CPU cache designs is included. Cache replacements in fields other than CPU

caches, such as database buffer caches, are introduced in chapter 2. Chapter 2 also discusses

previous studies on the impact of cache performance on network protocols and applications.

Chapter 3 discusses the empirical analysis methods used forthe study of the memory ac-

cesses of programs and the results of the analysis. The property of short lifetime is intro-

duced in chapter 3.

Chapter 4 discusses the locality characteristics of network protocols and applications.

Chapter 5 presents a new CPU cache replacement algorithm called theWLRUreplacement

algorithm.

Chapter 6 presents an example hardware implementation of WLRU cache in CPU. The hard-

ware cost of implementing WLRU replacement is analyzed and compared with the cost of

implementing LRU replacement in CPU caches.

Chapter 7 describes the design of the CPU cache simulator. The CPU cache simulator in

this work is different from other CPU cache simulators in that its focus is on the cache re-

placement algorithms. Other unique features include the victim analysis and a fast imple-

mentation of the off-line optimal replacement algorithm.

Chapter 8 presents a simulation comparison of the hit rates of WLRU and LRU replacement

algorithms on the SPEC benchmark programs and network protocols and applications. Sim-

ulation results of the off-line optimal replacement (OPT) are provided to better understand

the improvement of WLRU over LRU.

Chapter 9 presents conclusions and a plan for future research.

5

Chapter 2

Background and Related Research

CPU caches have been intensively studied for the last thirtyyears. This chapter briefly ex-

amines the design issues of current CPU caches and research on CPU cache performance

of network protocols and applications.

2.1 Background on CPU Caches

This section introduces the basics of CPU cache design.

2.1.1 Memory Hierarchy

Modern CPUs have a hierarchy of memories. A higher level of memory is faster than a

lower level of memory, but the higher level memory is also smaller in size and more expen-

sive. The highest level or levels of memory are called the CPUcache. Currently, the CPU

cache is on the same die as the CPU execution unit. CPU caches are always made of SRAM

(Static Random Access Memory). The main memory is made of DRAM (Dynamic Random

Access Memory). Visiting the main memory incurs a long latency, typically around 100 ns,

and then fetching the data costs another 2 ns each word1. The time to visit the main memory

is equal to several hundred CPU cycles. CPU caches can reducethe time to a single CPU

cycle since CPU caches are made of SRAM and are usually on the same die as the CPU

execution units. Figure 2.1 shows a hierarchy of memories. The first and the second levels

1source: www.powerlogix.com/downloads/SDRDDR.pdf

6

of the hierarchy are CPU caches, and the third level is the main memory. The fourth level

of the hierarchy is the virtual memory on the disk storage. CPU caches contain a subset of

the main memory.

CPU

L1 Cache

L2 Cache

Main Memory

Virtual Memory

Circuit Technology

SRAM

SRAM

DRAM

Disk

Figure 2.1: The structure of an four levels memory hierarchy.

2.1.2 Cache Lines and Cache Hits

The unit of transfer of data between the CPU execution unit and the cache is a word. Data

transfer between the cache and the main memory is multiple memory words. This takes

advantage of the spatial locality principle in that if one memory location is read then nearby

memory locations are likely to be read [HP96]. Thus CPU caches are organized into cache

lines where each cache line consists of the words read in a single transfer of data between the

main memory and CPU cache. A cache line (depicted in Figure 2.2) consists of an address

tag, status bits and data from the main memory. Transfer of more than one word also has

advantages with respect to the memory bandwidth. The latency of visiting the main memory

is amortized among multiple words. Cache lines also save space since multiple words share

an address tag.

The part of the address of a main memory word is referred to as the address tag. When

the CPU references a main memory word, the address tag part ofthe address of the main

memory word is taken out. The tag of each cache line is compared with the address tag

of the memory word being referenced by the CPU. If there is a match between a tag of a

cache line and the address tag of the word then there is said tobe acache hit, otherwise it

is acache miss. In the case of a cache hit, the referenced word is directly accessed from

the cache, avoiding the latency in retrieving the memory word from the main memory. In

the case of a cache miss the referenced word is accessed from the main memory. There are

7

two status bits in a cache line. The valid status bit is used toindicate that a cache line is not

empty. The dirty status bit is set when the data in a cache linechanges.

Tag V DataD

Figure 2.2: The structure of a CPU cache line.

2.1.3 Set Associative Caches

An important design aspect of CPU caches is determining where in the cache the retrieved

data from main memory can be placed. If a main memory word can only be placed in a

single cache location then the cache is called adirect-mappedcache. Figure 2.3(a) shows

the mapping of a main memory word into a direct-mapped cache.The direct-mapped cache

hasm cache lines and the lowestlogm2 bits of the address is used to map a main memory

word into the cache. When deciding cache hits or misses, direct-mapped caches only need

to compare a single address tag. Thus direct-mapped caches are fast. The problem with

direct-mapped cache is that it incurs more cache misses, which can be illustrated with the

following example. Suppose a program generates a series of memory references such as the

following: 0x1CD, 0x3CD, 0x1CD, 0x3CD. Both of the two memory words are mapped to

the same cache line. This sequence of references causes a continuous stream of evictions

and replacements of cache lines. Thus, direct-mapped caches are fast but also have lower

hit rates. Studies [Prz90, HS89] found that reducing associativity from two-way to direct-

mapped increases the miss rate by 25%.

Another approach would allow a unit of data transfer to be placed in any one of the cache

lines in the cache. This is called afully-associativecache. Replacement of data in a cache

line only occurs when the entire cache has filled up. The replacement algorithm in a fully-

associative cache can replace any cache line in the cache with the incoming cache line.

Fully-associative caches are believed to have the highest hit rates [HR00] for a large number

of replacement algorithms. The address tag is compared in parallel with all of the tags of all

the cache lines in order to retrieve the data quickly. However, a CPU cache typically consists

of hundreds of thousands of cache lines. The circuitry needed to do the parallel comparison

of all tags is expensive. Thus, except for some very small caches, no CPU caches are fully

associative [PH05].

8

A set associative cache combines concepts from direct-mapped cache and fully-associative

cache. Cache lines are organized into cache sets. A main memory word can be placed in

only one cache set but may be placed into any of the cache linesin the cache set. Essentially

this means that a memory word can only be placed in a subset of the cache. A main memory

address is divided into three fields: an address tag, set index and block offset. The set index

field is used to determine the cache set. The address tag uniquely identifies the memory word

and the offset is used to find the word within cache line. A set associative cache becomes a

fully associative cache when the cache has only one cache set.

The number of cache lines in a cache set is referred to as the associativity of the cache. For

example, if there are four cache lines in a cache set, the associativity is four, and the cache is

called a four-way set associative cache. Figure 2.3(b) shows the mapping of a main memory

word into a two-way set associative cache. The cache has the same numberm of cache lines

and is arranged intom/2 cache set. Each main memory word has two possible locations in

the cache. The lowestlogm/2

2 bits of the address is used to map the word into the cache.

Figure 2.4 shows the structure an eight-way set associativeCPU cache. The cache has 1024

cache sets. Each cache set has eight cache lines. Each cache line stores eight words. The

address the CPU is currently referencing is stored in the address latch. The middle ten bits of

the address latch is mapped to one of the 1024 cache sets. The lowest three bits is the block

offset to index into the eight words of a cache line. The highest 19 bits form the address tag.

All the eight address tags of a cache set are compared with theaddress tag in the address

latch. If there is a match, the hit/miss signal indicates a cache hit or miss.

2.1.4 Multiple Level CPU Caches

Modern CPUs usually have a hierarchy of caches. Most of the current CPUs have two levels

of CPU caches. The first level CPU cache is called the L1 cache and can be accessed in one

or two cycles. The gate delay and wire delay limits the size ofthe L1 cache. Typically, the

L1 cache is only 32KB or 64KB. The same speed constraint also limits the associativity of

the L1 cache. To achieve high speed, the L1 cache may be direct-mapped.

The second level cache is called the L2 cache. L2 caches are typically accessed in around ten

CPU cycles and are much larger than L1 caches [PH05]. The L2 cache usually has higher

associativity. L2 caches can be 16 or 32 way associative. Besides the L1 and the L2 caches,

some CPUs have a level three cache. L3 caches are slower and larger than L1 and L2 caches.

9

0 1 2 3 m-1...

...

0 1 2 3 m-1...

...

m m+1m+2m+3 2m-1...

...

n

...

0 1 2 3 m/2 -1...

...

0 1 2 3 m/2-1...

...

m/2 m/2+2 m-1...

...

n

...

(a) Direct-mapped cache

(b) two-way set associative cache

Figure 2.3: The mapping of the main memory words into a direct-mapped cache and a two-way associative cache.

Currently, both the L1 and the L2 caches use LRU replacement.LRU replacement at L2 and

L3 caches are actually Least Recently Missed replacement. The references at the L1 cache

are invisible to the lower level caches. The most recently referenced or loaded address at

the L2 or L3 cache is not necessarily the address which the CPUmost recently referenced

but the address most recently missed in the higher level cache. Since references to items in

the L2 cache are not exactly what the CPU is currently referencing but misses from the L1

cache, LRU at L2 and lower level caches is actually least recently missed replacement algo-

rithm. LRU at L2 cache does not exactly follow the definition of temporal locality [PHS98].

The hit rates of the LRU replacement at the L2 or L3 cache are low. This is considered by

[PHS98] to be a problem of LRU.

10

tag set indexblock offset

01

...set 2

set 1023

set 1set 0

DE

CO

DE

R

...

Address Latch

TAG DATALine0

TAG DATALine1

TAG DATALine2

TAG DATALine3

TAG DATALine4

TAG DATALine5

TAG DATALine6

TAG DATALine7

COMPARATORIncoming TAG

TAGS

HIT/MISSto CPU

Figure 2.4: The structure of an eight-way set associative cache.

2.2 Efforts to Improve Cache Hit Rates

CPU cache hit rates are critical to the performance of CPUs. Many efforts have been made

to improve the CPU cache hit rates. This section briefly describes some of the efforts to

improve the CPU cache hit rate.

2.2.1 Cache Line Size, Prefetching and Stream Buffer

A study [Smi87] found that the optimal cache line size is between 16 to 64 bytes. This con-

clusion may currently not be accurate. Due to the advances inCPU speed and the almost

constant memory latency, larger cache lines are favored. The slight success ofalways-fetch

pre-fetch policy suggests that larger cache lines should befavored [Smi82]. Inalways-fetch

pre-fetching, the CPU cache always reads in the cache line next to the cache line that con-

tains the word that the CPU is currently referencing.Always-fetchshows slight but notice-

able and consistent improvement in cache hit rates. Studies[SS01] of the cache performance

of multimedia workload suggest that larger cache lines, forexample 128 bytes or more, have

better performance for multimedia applications, since audio and video data are in large pack-

ets.

However, larger cache lines also increases the cache pollution. Cache pollution [Smi82]

refers to the effect that useful cache contents are flushed out by content that is almost never

11

re-used. The larger the cache line, the higher the chance of cache pollution is. This has

a negative impact on multi-programming computations [LP98, VL00, Smi82, Hig90]. In

multi-programming environments, larger cache lines causethe cache content of one process

be flushed more quickly by the other processes. Extra cache misses are thus introduced.

The advantage of larger cache lines is that it provides better memory bandwidth. One ap-

proach that allows the use of larger cache lines but reduces the cache pollution of larger

cache lines is to separate transfer sizes from cache line sizes [Prz90]. Small cache line

sizes are used inside the cache, but on cache misses, multiple adjacent cache lines are trans-

ferred together. Work described in [Jou98] suggests that the extra cache lines fetched should

be stored in a small special on-chip buffer instead of in caches and thus reduce the cache

pollution. If the cache line stored in the buffer is referenced, it will then be put into the

cache. This buffer is called thestream buffer. Studies [TFMP95, RTT+98, VL00, PCD+01,

BAYT01, KS02, CH02] show that the stream buffer works well. The possibility of using

stream buffers to replace the entire L2 cache is discussed in[PK94]. The stream buffer with

only 10 entries achieves more than 50% hit rates and in many cases is comparable to a L2

cache.

Different transfer sizes and a fixed small cache line size have been adopted by adaptive

cache systems [VTG+99, JmWH97, KBK02, LKW02]. One of the main purposes of these

adaptive cache systems is to provide better service for multimedia workloads [LKW02], as

multimedia data requires large cache lines [SS01].

The problem of cache line sizes is that there is no single fixedoptimal cache line size for

all applications, since the probability of visiting adjacent words varies among applications.

This leads to an adaptive cache line size approach [VTG+99], where the cache line size is

dynamically increased or decreased, if the recorded probability of visiting the adjacent lines

is high or low enough. Initially, the cache line size is large. The cache line size decreases by

a half or doubles to approach the optimal size. The use of adaptive line sizes showed great

improvement both in hit rates and memory traffic. The decrease in memory traffic for some

applications which need smaller cache line sizes is as much as 50% for 32 byte fixed line

sizes and even more for larger cache line sizes. These results can be viewed as evidence

favoring smaller line sizes. Due to the complexity and high costs of implementing variable

size cache lines, adaptive cache lines are seldom used in CPUcaches.

False sharing in multiprocessors suggests that smaller cache lines are preferable [EK89].

False sharing refers to the situation that two or more processors do not refer to the same ad-

dresses, but the addresses visited happen to be in the same line. The larger the cache line,

12

the higher the probability of false sharing is. On the other hand, other work [RSG93] sug-

gests that false sharing is not a problem, since multiple processors seldom share data at the

granularity of a cache line. However, for multi-threading,which operates in the same ad-

dress space and has a high frequency of sharing, false sharing is an issue [EK89]. However,

studies by [Tor94] show that the poor spatial locality in theshared data has even a larger

impact on CPU cache performances.

The key point in deciding the optimal line size is whether thepredicted probability of vis-

iting the adjacent addresses is high enough to compensate for the latency of reading many

words and the cache pollution. This is related to the accuracy of the cache replacement algo-

rithm. A better cache replacement which can tolerate the cache pollution will benefit from

larger cache line sizes.

The cache line size is important because it is actually the simplest and most efficient form

of cache pre-fetching. Many studies have tried to exploit the most out of the cache line

size design [RBC02, CMT94, PH90a, CHL99, CDL99]. For example, the work in [CHL99,

CDL99] packs data structure elements likely to be accessed together into a cache line . How-

ever, this requires the programmer to modify the code. Section 2.4 introduces a code place-

ment method [MPBO96] that simply puts frequently used code and data in the same cache

lines.

2.2.2 Cache Sizes and Hit Rates

The CPU cache size is the most important factor on cache hit rates. Larger cache sizes will

certainly be better. However, there is no simple way to calculate the cache hit rate given

a cache size since the hit rate varies with workloads. In the 1970s, many formulas were

developed to define the cache hit rate as function of cache size[Smi82]. These formulas

seldom had any real value in estimating cache performance since the workloads used were

not representative of the real workload. The formulas were valid in the workload where it

was derived but was usually not valid in other workloads.

In 1968, aworking set modelwas proposed in [Den68]. Theworking setis defined as the

subset of the program image currently running. The assumption is that if the cache size is

smaller than the working set, there will be a large number of cache misses. The working set

model predicts that the cache size must be larger than some threshold, which is the size of

the working set, and increasing the cache size after reaching the threshold will only gener-

ate marginal gains. For years, the working set behavior, or more accurately the threshold

13

behavior of cache sizes, has been widely observed [Smi82].

A study in 1983 [Goo83] claimed that small caches are surprisingly effective. The effec-

tiveness of small caches is attributed to the small size of the working set of applications at

that time. This may be true for the applications at that time,but may not be still true for

applications these days. Current applications such as database and network based applica-

tions are huge in size, thousands of times larger than applications 20 years ago. As a general

rule, the miss rate of current cache is still around 1% regardless of the cache replacement

algorithm used [HP96].

A set of well known rules of thumb for analyzing cache performance [Hig90] includes the

following rule for cache sizes:

doubling the cache size decreases the miss rate by 25%.

Larger cache sizes are beneficial in multi-programming environments. In multi-programming

environments, several independent streams of executions compete for the limited cache space

and cause extra cache misses. It is possible for larger caches to accommodate them all.

However up to now, there is no known study which can quantitatively relate cache sizes

with multi-programming performance.

2.2.3 Cache Associativity and Victim Cache

The associativity of cache also has an important impact on cache performances. One rule

of thumb [HP96] states that ‘ the miss rate of a direct-mappedcache of size N is about the

same as a two-way set-associative cache of size N/2.’ Higherassociativities are believed

to have better hit rates but come with a cost. Before using CMOS(Complementary Metal

Oxide Semiconductor)technology, the cost of doubling cache associativity was almost the

same as doubling the cache size [Hil88].

In a direct mapped cache, one main memory word can go to one cache line. The only ad-

dress comparison is between the cache line tag and the address tag of the word the CPU is

referencing. There is no need for a replacement algorithm. For set associative caches, more

comparisons between the cache line tags and the address tag of the memory word that the

CPU is referencing are needed. A replacement algorithm is required to determine the cache

line to be replaced. The Least Recently Used (LRU) replacement is typically used. LRU

can be effectively implemented in hardware [Smi82].

14

Direct mapped and set associative caches causeconflict misses. Conflict misses are cache

misses that do not happen in a fully associative cache. Conflict misses are caused by the

limitations of replacement decisions in a single cache set.Conflict misses typically account

for 20% to 40% of all misses of direct mapped cache[Smi82].

Fully associative caches avoid conflict misses but come witha high cost. Fully associative

caches requires comparisons of all tags of cache lines with the address tag of the memory

word the CPU is referencing. It is feasible to do all these comparisons in parallel. An alter-

native is a software based fully associative cache [HR00]. Modern CPUs seldom have fully

associative caches or higher than 32-way associative caches. For example, Pentium 4 has a

four-way set associative L1 cache and an 8 way L2 cache [Int04].

It is observed [Smi82] that the optimal associativity is four to eight. It was also observed

that beyond eight, the miss rate decreases very little, if any. One work [HS89] provides

quantitative values about the miss rate changes with associativities. It shows that when as-

sociativities decrease from eight to four, from four to two,and from two to direct-mapped,

the increases in miss rates are about 5, 10 and 30 percent. This conforms with other stud-

ies[PHH88].

Direct-mapped caches are fast but also have high miss rates [Hil88]. One approach to having

the benefits of using direct-mapped caches and still have lowmiss rates comparable to set-

associative caches is suggested in [Jou98]. A small fully associative cache calledvictim

cacheis attached to a direct-mapped cache. When a line is evicted from the direct-mapped

cache, it is temporally placed in the victim cache for a while. The victim cache uses LRU

replacement. Thevictim cacheis usually only one to five entries in size but can provide good

results. For example, a five entryvictim cachecan remove an average of 50%, in some cases

90%, of the conflict misses.

2.2.4 Split Instruction and Data Cache

Instructions and data can be stored in different caches. Separating instruction and data cache

has many advantages[Smi82]. First, it permits parallel access of the instruction and data

cache by CPU pipelines. Second, separated instruction and data caches are smaller and thus

can be placed near the execution units and thus have higher access speeds. Finally, instruc-

tions and data display different characteristics of locality [Prz90, FTP94, ASW+93, Smi82].

Split instruction and data cache can reduce the possible interferences.

15

Currently, instruction cache and data cache are often of thesame size. However, this de-

cision is more out of simplicity than based on research. One study showed evidence that a

balanced cache should have a larger instruction cache [FTP94]. Another study analyzed the

optimal replacement and suggested larger instruction caches are appropriate[ASW+93].

2.3 Cache Replacement Algorithms Other Than LRU

LRU is the most widely used cache replacement algorithm. LRUis based on the property

of temporal locality [Smi82]. Temporal locality [Smi82] refers to ‘the information that will

used in the near future is likely to be in use already.’ A corollary of temporal locality is

thatrecently used blocks are likely to be used again. Thus the best candidate cache line for

eviction is the least-recently used block[HP96]. The LRU replacement algorithm works as

a stack and always put the just referenced cache line on top ofthe stack. In CPU caches,

LRU can be efficiently implemented with a small number of bitssince the associativities of

CPU caches are very limited [Smi82]. For example, in a two-way associative cache, only

one bit is required. This section discusses other cache replacement algorithms.

2.3.1 Pseudo-LRU Replacements

LRU keeps all cache entries in the order of their last reference time. To keep a strict LRU

order, a relatively large number of bits are needed [Tan87].For example, for an eight-way

associative cache,8 ∗ 8 = 64 bits are needed to keep all the eight cache lines in LRU order.

For a higher associativity, for example 16 way associative,LRU requires16∗16 = 256 bits.

Figure 2.5 shows the space needed to implement LRU on an eight-way associative cache.

To save space, Pseudo-LRU replacement algorithms are proposed. Two kinds of Pseudo-

LRU replacements are widely used, Pseudo-LRU-tree (PLRU-tree) and Pseudo-LRU-msb

(PLRU-msb). In Pseudo-LRU replacement algorithms, the LRUorder of cache lines are

only approximately kept. For example, in PLRU-tree, only the just referenced line is accu-

rately recorded, and the order of other cache lines is not precise. At the cost of precision,

Pseudo LRU replacement algorithms need fewer bits for replacement decision making. For

an eight-way associative cache, PLRU-tree needs seven bitsand PLRU-msb needs eight

bits. Pseudo LRU replacement algorithms need only one bit per cache line. Figures 2.6 and

2.7 shows the space arrangement of PLRU-tree and PLRU-msb replacements respectively.

16

One study [AZMM04] shows that pseudo-LRU replacement algorithms do not significantly

impact the hit rate.

Line 0

Line 3

Line 2

Line 1

Tag + Data + Status (19 + 256 + 2 = 277 bits)

Line 4

Line 7

Line 6

Line 5

LRU Matrix

Figure 2.5: Storage arrangements of an eight-way associative cache set using real LRU andPLRU replacements.

2.3.2 First-In-First-Out and Random Replacements

The majority of CPU caches use the Least Recently Used (LRU) replacement or its variants.

Other frequently mentioned but seldom used CPU caches replacement algorithms include

Least Frequently Used (LFU), First In First Out (FIFO), and Random.

FIFO evicts the oldest cache line, even if the oldest line is just visited. Random does not

exhibit any pattern in selecting a cache line to evict. Thesetwo replacement algorithms

are considered inferior to LRU, but differences in the hit rates of these cache replacement

algorithms and LRU are small except for LFU. Random replacement is less than one percent

worse than LRU in CPU cache hit rates [HP96]. Table 2.3.2 is a comparison of hit rates of

Random and LRU replacement algorithms.

17

Line 0

Line 3

Line 2

Line 1


Line 4

Line 7

Line 6

Line 5

PLRU-Tree

Figure 2.6: Storage arrangements of an eight-way associative cache set using PLRU-treereplacement.

AssociativityTwo-way Four-way Eight-way

Size LRU Random LRU Random LRU Random16 KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96%64 KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53%

256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

Table 2.1: Typical miss rates of LRU and Random with different cache sizes and associa-tivities[HP96].

2.3.3 LRU-k and LIRS Replacement Algorithms

In the field of database disk buffer caches, many replacementalgorithms other than LRU and

LFU are proposed. In database buffer caches, the locality ofreference is less dominant than

in CPU caches [OOW93]. The LRU replacement algorithm is found incapable of handling

some scenarios. Modifications of LRU and LFU have been developed. This includes LRU-

k[OOW93], LIRS[JZ02], FBR[RD90], LRFU[LCK+01], 2Q [JS94] and Multi-Queue [ZCL04].

In the field of virtual memory page buffers, a modification of LRU called EELRU[SKW99]

is proposed.

One modification of LRU is LRU-k[OOW93]. LRU-k replaces the cache item whosekth

reference in the past is the oldest. The valuek is called thebackward k-distance. If a cache

18

Line 0

Line 3

Line 2

Line 1


Line 4

Line 7

Line 6

Line 5

MSBs

Figure 2.7: Storage arrangements of an eight-way associative cache set using PLRU-msbreplacement.

item has not been referenced up tok times, its backward k-distance is defined as infinite. The

cache item with the highest backward k-distance is replaced. The LRU replacement algo-

rithm can be viewed as a special case of LRU-k, wherek equals one. The work in [OOW93]

suggests that for most cases LRU-2 is sufficient. LRU-2 shows no difference with LRU-3,

but LRU-2 has higher hit rates than LRU for database traces. Experiments in [OOW93]

shows LRU-2 usually requires smaller cache sizes than LRU to reach higher hit rates.

A buffer cache replacement algorithm called LIRS [JZ02] is developed to deal with the fol-

lowing scenarios where LRU fails:

• Sequential scans, which refers to a burst of references to some infrequently used blocks;

• Cyclic (loop-like) accesses, where the interval of loop access may be larger than the

cache size;

• Multi-user applications, which exhibits a property similar to cyclic accesses but is

caused by massive independent user inputs.

In these scenarios, the just referenced item will not be hit again in a short time. LIRS divides

addresses into two groups: good cache-able group and bad cache-able group. LIRS makes

its replacement decisions based on the count of blocks between two consecutive references

19

to the same block. LIRS assumes that future references will likely have the same scale of

distances. LIRS calls the number of other blocks between thelast two references of a block

the IRR value of the block. LIRS replacement records the IRR value of each block and

divides all blocks into two groups. Blocks with high IRR (HIR) value are in the bad group,

and blocks with low IRR (LIR) values are in the good group. Thecache is also partitioned

into two parts, one for the HIR blocks and the other for LIR blocks. The part for LIR blocks

is larger than the part for HIR blocks. The LIR partition receives 99% of the cache size, and

only one percent of the cache size is given to the HIR partition. When replacing a cache

block, one entry in the HIR partition is evicted, and the LIR partition remains intact. In

LIRS, blocks are moved between the LIR/HIR partitions. Whena block in the HIR part is

hit, its IRR value is recalculated. If the new IRR value is smaller than the maximum IRR

block in the LIR part, the two blocks are exchanged.

In LIRS, if a block is referenced only once, its IRR value is defined as infinite, just like in

LRU-k, where if a cache line has not been referencedk times, itskth distance is set as infi-

nite. LIRS and LRU-k are not suitable for CPU caches, since these replacement algorithms

are either too expensive to implement in CPU cache. Due to theheavy locality of references

of CPU caches, these replacement algorithms may not be able outperform LRU.

2.3.4 LFU and FBR Replacement Algorithms

In the Least Frequently Used (LFU) replacement algorithm, the reference count of an ad-

dress is recorded. When replacing, the address with the minimal reference count is evicted.

Addresses which are not hit immediately after being broughtinto the cache are replaced

quickly by LFU. However, LFU tends to keep some addresses in the cache too long. If some

addresses are referenced heavily only in a specific period and never again, these addresses

will be fixed in the cache for a long time by LFU. This reduces the cache size and results

in poor hit rates of LFU. Variants of LFU useagingmechanisms that reduce the reference

counts of addresses periodically.

The Frequency Based Replacement (FBR) algorithm is a modification of the LFU replace-

ment algorithm, which is also used in database buffer caches[RD90]. FBR is different from

LFU in that ‘locality is factored out.’ In FBR, blocks are kept in LRU order, and the top

portion of the LRU stack is defined as thenew sector. The reference count of blocks is not

increased, if the block is hit in the new sector. By doing this, each reference within a burst

to a block is not counted, and the reference count of the blockis not increased. The locality

20

is factored out.

The Least Frequently Used replacement algorithm and its variants are never used in CPU

caches, since CPU caches are too small compared with the mainmemory. LFU and its vari-

ants keep frequently visited items too long. If used in CPU caches, the hit rates of LFU and

its variants would be very low.

2.3.5 LRFU, Multi-Queue and EELRU Replacement Algorithms

The Least Recently Frequency Used (LRFU) replacement algorithm [LCK+01] uses nu-

merical values to represent the replacement priority of cache lines. LRFU is proposed to be

used in database buffer caches. LRFU is targeted to subsume both LRU and LFU. LRFU

assigns a value called the Combined Recency and Frequency (CRF) to each cache block,

and when replacing a block, the block with the minimal CRF value is chosen. The CRF

value is calculated by summing up the weights of references to a block. Every reference

is assigned a weight based on the time span from point of reference up to the current time.

Newer references have higher weights, so the function used to calculate the weight should

be a descending function. All these weights are summed up andthe sum is determined as

the CRF value of the block. Depending on how the weight is calculated, the CRF value can

be a real number. For example, LRFU mimics LRU by using exponential functions smaller

than one for each reference. A shortcoming of LRFU is that thecalculation of weights in

LRFU is very complex and costly, since the entire reference history of a block is needed.

A further drawback of LRFU is that the CRF value of all blocks and the weights of all past

references have to be calculated again for every new reference. For example, suppose a

block was referenced at time 1, 2, 5, 8 and the current time is 10. The CRF value C of the

block at time 10 is calculated asC = f(10 − 1) + f(10 − 2) + f(10 − 5) + f(10 − 8) =

f(9) + f(8) + f(5) + f(2). When the current time is 12, the above CRF value is recal-

culated asC = f(11) + f(10) + f(7) + f(4). To simplify calculations, functions where

f(x + y) = f(x) ∗ f(y) should be used.

Early Eviction LRU (EELRU) addresses LRUs failure to predict cyclic references [SKW99].

In EELRU, for most of the time, uses the LRU order for replacements. When it is detected

that recently evicted pages are referenced again in a short time, theeth most recently refer-

enced pages will be evicted instead of the least recently used one, wheree is a pre-determined

recency position.

Another replacement algorithm for database buffer caches is multi-queue replacement at

21

the second level buffer caches [ZCL04]. Multi-queue is a further improvement of the 2Q

replacement. Based on the belief that LRU-2 is too expensive to implement, the 2Q replace-

ment is developed [JS94]. In 2Q, one FIFO queueA1in, and two LRU listsA1out andAm are

used. Blocks are first put intoA1in, and when evicted put intoA1out. When referenced in

A1out, its identifier is moved toAm. Multi-queue further develops the idea and changes the

Am queue into many ranked LRU queues. Blocks, if referenced in the lower queue, will be

promoted to higher ranked queues. A function is used to control how many references will

put the block in which rank of queues. When replacing, entries from lower ranked queue

will be evicted first. Multi-Queue is found to out-perform FBR, 2Q, LRFU, LRU-2, LRU,

and LFU in L2 database buffer caches [ZCL04].

2.3.6 Dead Alive Prediction Replacement Algorithms

As the associativity of a CPU cache increase, the time that LRU keeps a dead cache line

increases. A dead cache line is a cache line that will not be referenced later. Limiting the

cache stay time of dead cache line can give space to cache lines that are still alive. A set of re-

placement algorithms predicting the dead/alive status of cache lines are proposed [HKM02,

LFF01, KS05]. In [KS05], two prediction methods, AIP and LvP, are proposed. AIP/LvP

uses a counter per cache line to record the events that happened to the cache line, such as the

number of hits or the time when the hits happen. An event threshold is set. If the number of

recorded events exceed the threshold, the cache line is marked for eviction. In AIP, the event

recorded is the number of references to the cache set betweenconsecutive accesses to the

cache line. In LvP, the event recorded is number of hits. AIP/LvP evicts the predicted dead

line faster than LRU. For L2 cache sizes of 512KB and 1MB, AIP/LvP shows 5% improve-

ment in hit rates over LRU for a number of SPEC CPU2000 benchmarks. A shortcoming

of AIP/LvP is that it requires extra space to store events.

2.3.7 Off-Line Optimal Replacement Algorithm

If the future references of the CPU is known, a replacement decision can be made by looking

forward and choose to evict a cache entry which results in thelowest miss rate. This is called

the off-line optimal replacement algorithm. Off-line optimal replacement has the highest

possible hit rate for a CPU cache configuration.

The optimal replacement algorithm is defined in [ADU71]as one that always evicts the block

22

‘which has the longest expected time until next reference.’When replacing, the cache line

whose next reference is most far away is chosen. This replacement is called the Optimal

(OPT) replacement algorithm, whose hit rate is the theoretic upper limit of all replacements.

Figure 2.8 illustrates how the Optimal replacement decision is made. In the example, there

are four cache lines, containing addressA, B, C, andD. Cache lineA is to be referenced

again in the farthest future, and thus it is chosen to be replaced.

A

B

C

D

E B F E B G C B C D F B C A

Figure 2.8: An example of the Optimal replacement decision.

Off-line optimal replacement is used in the competitive analysis of on-line replacement al-

gorithms [ST85]. The hit rate of optimal replacement is usedto evaluate the performance

of cache replacement algorithms.

One result observed when comparing the miss rate of the optimal replacement and LRU is

that LRU makes incorrect choices in choosing the evicted block [SA93]. For fully associa-

tive caches, optimal replacement has 70% fewer cache missesand for two-way associative

cache, optimal replacement has 32% fewer cache misses. Other findings are that the miss

rate of fully associative LRU caches on the SPEC benchmark had been worse than those of

direct-mapped or set-associative caches. In these cases, limited choices of direct-mapped

caches and low set-associative caches actually helped LRU to make better choices than fully

associative or high associative caches. With set-associative and direct-mapped caches, LRU

can not choose the least recently referenced line and will keep it long. It seems this is actu-

ally the right choice. This phenomenon suggests the oldest is still of value.

Optimal replacement is unrealistic for real world caches. It is useful in providing an upper

bound on the cache hit rate possible. Chapter 8 presents the comparison of the hit rates of

two replacement algorithms against the hit rates of the optimal replacement algorithm to

better understand the improvements.

23

2.3.8 Summary of Replacements

For CPU caches, there are few alternatives to LRU replacements. Either LRU out-performs

the new replacement algorithm, or the new replacement algorithm is too expensive to im-

plement in CPU caches. The superiority of LRU is exhibited when the property of temporal

locality is much more obvious and dominant in the memory references of CPU caches than

in database buffer caches. Hit rates of even a small CPU cacheis higher than 90%, but hit

rates of database buffer caches are much lower. For example,in [OOW93], the hit rates of

OLTP trace experiments are no more than 50%.

2.4 CPU Cache Issues of Network Protocols and Applica-

tions

Network protocols and applications are believed to be programs with poor locality and thus

with bad CPU cache hit rates. Network protocols and applications are inherently un-cacheable

for two reasons. First, network protocols and applicationswork on multi-programming plat-

forms, and multiprogramming has negative impact on CPU cache performances. In multi-

programming, multiple processes or threads compete for thelimited CPU cache and may

flush the cache content of each other. This causes extra cachemisses. Second, network

protocols and applications lack substantial computations. There is always a huge amount of

data involved in network protocols and applications, but the operations on the data are sim-

ple and few. The lower reuse of data means intrinsic fewer cache hits. One study [CJRS89]

examined the TCP code by profiling and counting the instructions of TCP processing. The

study found that the number of instructions TCP executed wasvery small. For example,

receiving a packet only involves 335 instructions. TCP has not significantly changed since

this paper was published and thus the conclusions of this paper are still valid.

Network protocols themselves are simple but may interact with the operating system and

may cause big problems. Network protocols involve a small number of instructions [CJRS89],

but studies [NYKT97] show that instruction cache misses have the greatest impact on pro-

tocol latency. Optimization methods targeting the instruction locality of protocols gener-

ated great improvements [Bla96, MPBO96]. The study [Bla96]reduced 90% of instruction

cache misses.

One study [NYKT97] compared the instruction cache references and data cache references

24

of both TCP and UDP. The study found that instruction references are two times more than

data references without check-summing and three times morewith check-summing and the

contribution of instructions references to latency also outweighs data references for both

UDP and TCP. This finding indicates that the simplicity of protocol code does not imply

that the protocol code causes fewer CPU cache problems than what is caused by the data.

Another study suggests that the many instruction cache misses of network protocols are be-

cause of the lack of locality in protocol code [Bla96]. This work studied the trace of TCP

code and found distinct phases. The processing circle of TCPreceiving and acknowledging

is divided into three phases: entry, device interrupt, and exit. The code traces of differ-

ent phases do not overlap. The TCP processing instructions are not reused across phases.

Recorded cache misses caused by processing each packet are found to have the cache misses

of both instruction and data cache misses remain constant with different arrival rates. This

means that even if packets are arriving fast, the processingof each packet is still independent

and there is no increased reuse of instructions in the cache.

An algorithm calledLocality Driven Layer Processing (LDLP)is used to increase the reuse

of TCP code [Bla96]. Each TCP code stage is allowed to processas many packets as possi-

ble before entering to the next code stage. The instruction cache (I-cache) misses per mes-

sage were significantly decreased from 900 to 100 for fast arrival rates. The data cache (D-

cache) misses increased somewhat with faster arrival rate,but the increase can be neglected.

The works shows that before optimization I-cache misses were ten times more than D-cache

misses and with optimization the number of I-cache misses isalmost the same as the number

of D-cache misses. The LDLP scheduling greatly improved CPUcache hit rates. However,

LDLP scheduling can not be used in a real network since it relies on cooperative packet

arrival rates.

A similar idea was proposed to increase the reused of protocol code in cache for a multi-

processor environment [SKT96]. This work proposed to schedule processors based on the

affinity of code running. Another approach to improving the bad locality of protocol code

was proposed in [MPBO96]. The TCP code is re-arranged byout-lining andcloning: fre-

quently executed instructions are compacted together. Thus, the reuse of cache lines in-

crease. This work reported improvements in TCP latency by a factor of 1.35 to 5.8.

25

2.5 Summary

CPU caches relies on the property of locality in the memory references of programs. LRU

replacement is widely used in CPU caches because it is believed to best exploit the tem-

poral locality. In databases, where the locality of references is not as intensive as in CPU

caches, cache replacements other than LRU are developed. These replacement algorithms

have significant improvements over LRU. However, these replacement algorithm are not

used in CPU caches. Either these replacement algorithms aretoo expensive to implement

in CPU caches, or LRU outperforms them.

Network protocols and applications have poor CPU cache hit rates. The LDLP scheduling

developed by Blackwell significantly reduced the cache misses of TCP processing. Other

locality optimization also showed improvement in the CPU cache hit rates of network pro-

tocols and applications. These studies show that there is room for optimization the CPU

cache design to achieve better performance for network protocols and applications. Due

to the large CPU cache miss penalty, optimizing the CPU cacheperformance of network

protocols and applications can greatly improvement the performance of servers.

26

Chapter 3

Principle of Locality and Property of

Short Lifetime

Currently, LRU is the CPU cache replacement algorithm most widely used in computing

systems. LRU is based on the concept of temporal locality that states that ‘recently accessed

items are likely to be accessed in the near future’ [HP96]. LRU replaces the cache entry

that has not been referenced for the longest time. Under LRU replacement, if an address

is just referenced, it will stay in the cache for the longest possible time. If the program ex-

hibits strong temporal locality, then the address is most likely to be referenced in the future.

Chapter 2 introduces research that suggests that not all programs (specifically networked

applications) strongly exhibit temporal locality. This chapter and the next chapter describe

the research conducted in analyzing memory access patterns. This work is the basis for a

new cache replacement algorithm described in Chapter 5.

3.1 Memory Reference Traces

This study analyzes memory reference traces. A memory reference trace of a program (hence

referred to as memory trace) is a stream of main memory addresses issued by the CPU when

executing the program. Standard memory traces, such as the traces of SPEC CPU2000

benchmarks [Hen00], provide a consistent foundation to compare meaningfully and accept-

ably different CPU cache designs. SPEC benchmarks are the de-facto benchmarks for CPU

performance comparisons, and CPU2000 is most updated version.

27

In this work, memory traces are empirically analyzed to better understand the locality char-

acteristics of the programs. The memory traces used includethose of the SPEC CPU2000

benchmarks and memory traces of web servers (this analysis is in chapter 4). The SPEC

CPU2000 benchmarks consist of 12 integer programs and 14 floating point programs. The

SPEC CPU2000 integer programs are the following:Name Remarks

gzip Data compression utility

vpr FPGA circuit placement and routing

gcc C compiler

mcf Minimum cost network flow solver

crafty Chess program

parser Natural language processing

eon Ray tracing

perl Perl

gap Computational group theory

vortex Object Oriented Database

bzip Data compression utility

twolf Place and route simulator

The SPEC CPU2000 floating point programs are the following:

28

Name Remarks

wupwise Quantum chromodynamics

swim Shallow water modeling

mgrid Multi-grid solver in 3D potential field

applu Parabolic/elliptic partial differential equations

mesa 3D Graphics library

galgel Fluid dynamics: analysis of oscillatory instability

art Neural network simulation; adaptive resonance theory

equake Finite element simulation; earthquake modeling

facerec Computer vision: recognizes faces

ammp Computational chemistry

lucas Number theory: primality testing

fma3d Finite element crash simulation

sixtrack Particle accelerator model

apsi Solves problems regarding temperature, wind, velocity

and distribution of pollutants

The SPEC CPU2000 benchmarks are frequently used as standardworkloads in CPU and

CPU cache studies[CH01, PH90b, AZMM04]. The memory traces of SPEC benchmarks

are from the BYU (Brigham Young University) trace archive1. The BYU SPEC bench-

marks traces are generated with the CPU caches disabled and captured by hardware. Each

BYU SPEC benchmark trace is slightly more than ten million references long. The BYU

trace archive also has L2 cache traces but does not offer manychoices of the L1 cache con-

figurations. The L2 cache traces used in this chapter are generated from the BYU L1 traces

by simulating the L1 cache. We wrote a L1 cache simulator and fed the BYU L1 traces into

the simulator. The L1 cache simulator offers a wide range of L1 cache configurations. The

cache misses of the L1 cache simulator are dumped as the L2 cache traces.

3.2 Principle of Temporal Locality and LRU

ThePrinciple of Locality[HP96] refers to the belief that ‘programs tend to reuse dataand

instructions they have used recently’.Principle of localityconsists of two kinds of locality:

1http://tds.cs.byu.edu/tds/

29

temporal localityandspatial locality. Temporal localityis defined [HP96] as ‘recently ac-

cessed items are likely to be accessed in the near future.’Spatial localityis defined as ‘items

whose addresses are near one another tend to be referenced close together in time [HP96].’

The application of spatial locality is the cache line. Multiple adjacent memory words are

loaded as a cache line to increase the possibility of hits. For CPU cache replacement algo-

rithm, temporal locality is more important. This work focuses on temporal locality. Unless

specified otherwise, in this work the termlocality refers to temporal locality.

The Least Recently Used (LRU) cache replacement algorithm is based on the above defini-

tion of temporal locality. LRU selects the cache entry that has not been referenced for the

longest time. If an address is just referenced, it stays in the cache for the longest possible

time. Doing so, if the address is to be re-referenced, it is most likely to result in a cache hit.

3.3 Inter-Reference Gaps and Temporal Locality

The term Inter-Reference Gap (IRG) [PG95, Quo94] refers to the number of memory ref-

erences between two consecutive references of the same address. For example, if there are

100 memory accesses between two references of an address then the IRG value is 100. Fig-

ure 3.1 illustrates the calculation of IRG values through anexample sequence of main mem-

ory references. This section presents the analysis of IRG values for the address traces de-

scribed in section 3.1.

a b c b d a c e b f d a d c b e b c f d

5 7

Set0: a c a c e a c e c

Set1: b b d b f d d b b f d

4

2

Figure 3.1: Two per set IRG values and their corresponding whole stream IRG values.

30

3.3.1 Inter-Reference Gaps and LRU

If using the LRU cache replacement algorithm, IRG values determine if a memory reference

is a cache hit or a miss. Assuming a fully associative (see section 2.1.3) CPU cache of size

256KB, all IRG values below 256K can be satisfied in cache. IRGvalues larger than the

cache size may result in cache hits since it is the number of unique addresses in a gap that

decides whether it is a cache hit or a miss. For two IRG gaps of the same value, the IRG

gap with more unique addresses is more likely to result in a cache miss than the one with

fewer unique addresses.

The study of IRG values in this chapter is not the first study onIRG values. One study

[PG95] found that the distribution of IRG values of a single address is highly clustered. A

Markov chain model is proposed to predict IRG values. Cache replacement decisions are

based on these predictions. Figure 3.2 shows the IRG values of three addresses in the mem-

ory trace of programCC1, which is a C compiler [PG95]. The IRG index in Figure 3.2 is

the time of the reference. The first reference to an address has an IRG index of one. The

IRG value is the size of the IRG gap in units of memory references. The three addresses in

the figure are the most referenced, the10th most referenced, and the100th most referenced

address of the memory trace. The clustered distribution of IRG values is seen in Figure 3.2.

However, the three addresses chosen in [PG95] are among the most intensely referenced

addresses. The IRG distribution of less intensely referenced addresses, especially the IRG

values of those addresses which are referenced only severaltimes, are not provided in the

study. The work presented in this chapter is the first investigative study that analyzes the

IRG values of all addresses.

3.3.2 Complete Program Stream and Per Set IRG Values

The analysis described in [PG95] measures IRG values of the entire program memory ref-

erence stream, in which every memory reference counts in calculating a IRG value. How-

ever, CPU caches are set associative. Memory references to one cache set have no impact

on the cache hit rate of the other cache sets. The analysis conducted in this work measures

IRG values on a per set basis i.e., only memory references mapped to the same cache set is

counted in calculating an IRG value. For example, suppose addressa is referenced again

after ten other memory addresses have been referenced. Measured in the complete program

memory reference stream, the IRG value is ten. However, if the ten memory references are

31

Figure 3.2: IRG strings of three addresses in the CC1 trace [PG95]. IRG index is the indexnumber of the first reference of an IRG gap.

all mapped to different cache sets than addressa, the per set IRG value is one. In this work,

IRG values measured on a single cache set is referred to asper set IRGvalues and the IRG

values measured in [PG95] are referred to aswhole stream IRGvalues.

To get the per set IRG values, we first split the trace of memoryreferences into sub traces of

sets. Each sub trace represents the memory references to a single cache set. The set index

part of the address is used to map a reference to a cache set. Later, IRG values are calcu-

lated in the sub traces, which become per set IRG values. Figure 3.1 presents an example

illustrating the derivation of per set IRG values from the whole stream IRG values. In the

example, there are two cache sets:set0 andset1. Addressesa, c, ande are mapped toset0,

and addressesb, d, andf are mapped toset1. Two whole stream IRG values of size five and

seven, after being mapped to the two cache sets, become per set IRG values of size two and

four. In this work, the term IRG refers to per set IRG unless specified otherwise.

We wrote a software tool to map the whole stream IRG values into per set IRG values and to

measure the distribution of per set IRG values. With this tool, we gathered the distributions

of per set IRG values of all SPEC CPU2000 benchmarks. All of these results can be found

32

in appendix A.

3.3.3 Distributions of Per Set IRG Values and Temporal Locality

The property of temporal locality is seen in the distributions of per set IRG values. Fig-

ure 3.3 is the distribution of per set IRG values of eight SPECCPU2000 benchmarks. The

eight SPEC benchmarks are chosen randomly from the 26 SPEC CPU2000 benchmarks.

The four programs,gcc, gzip, craftyandperlare integer programs from the SPEC CPU2000

INT suit. The other four programs,wupwise, ammp, apsiandfma3dare floating point pro-

grams. In appendix A, the distributions of per set IRG valuesof all 26 SPEC CPU2000

benchmarks are provided.

The IRG values in Figure 3.3 are measured on a cache line basisinstead of on an individual

address. To save space, a CPU cache line always consists of multiple words. Currently, a

CPU cache usually has eight or more words. The eight contiguous words of a cache line are

loaded into and evicted from the cache as a whole. A hit of one word in the cache line is also

a hit for the other words in the cache line. IRG values measured on a cache line basis are

smaller than those measured on individual address. The results presented in Figure 3.3 is

based on a cache configuration of 256 cache sets with 32 byte cache lines. The cache lines

of all cache sets are of the same size [PH05].

As seen in Figure 3.3, most of the IRG values are small. More than 90% of per set IRG

gaps are equal to one. Out of 26 SPEC benchmarks, there are twomemory traces,mcfand

vortex, where the number of IRG values that equal to one is less than 90%. Some programs

have per set IRG values of size one as high as 98%.

The small IRG values are direct evidence of the temporal locality. Since the majority of per

set IRG values are so small, even small CPU caches, for example a 2 KB cache, will have

high hit rates using cache replacement algorithms such as FIFO and Random. It has been

observed in [HP96, page 379] that the difference in hit ratesbetween LRU and RANDOM

are minimal with LRU having slightly better hit rates. The small IRG can explain this. IRG

values are so small that most addresses are re-referenced immediately before there are any

replacement decisions. Thus any replacement algorithms can have high hit rates.

Using LRU replacement, IRG values are directly related to cache hits and misses. LRU

replacement guarantees that every address stays in the cache for a time span that is at least

the size of the associativity of the cache. All IRG values below the associativity of the cache

33

1 2 to 9 10~100 100~1000 10^3~10^4 10^4~10^5:

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Per Set IRG Distributions of SPEC INT Benchmarks, L1

Crafty Gcc Gzip Perl

IRG value

perc

ent

1 2 to 9 10~100 100~1000 10^3~10^4 10^4~10^5:

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Per Set IRG Distributions of SPEC FLT Benchmarks, L1

Ammp Apsi Fma3d Wupwise

IRG value

perc

ent

Figure 3.3: The distributions of per set IRG values of eight SPEC benchmarks.

results in cache hits under LRU. As the result of potentiallytoo many comparisons needed to

compare the address tag with the cache line tag it is difficultfor set associative CPU caches

to achieve more than 32-way associativity. Assuming a 16-way associativity, using LRU

replacement, IRG values below 16 are cache hits. These IRG values account for more than

95% of all per set IRG values (see table 4.3). However, increasing the associativity of LRU

caches, for example from 16 to 32 way associative, may not be agood idea. The number

of IRG values between 16 and 32 is very limited, and there is a considerable percentage of

IRG values as high as one hundred. These IRG values are not easily accommodated by LRU

replacement by increasing the associativity or the size of the cache.

34

3.4 Reference Counts and Property of Short Lifetime

This section presents the result of analyzing reference counts of programs.

3.4.1 Property of Short Lifetime

The reference countof an address is the number of times the address is referencedin the

lifetime of the program. The analysis presented in this workon reference counts finds that

the distribution of reference counts of addresses is heavily long tailed and the majority of

addresses have low reference counts. Reference counts tendto pack around small numbers,

which has a negative impact on the hit rates of CPU caches using LRU replacement. LRU

assumes that all addresses are equally likely to be referenced again. The analysis presented

in this work shows that the majority of addresses are not likely to be re-referenced often. A

very small portion of addresses are heavily re-referenced.

Figure 3.4 is the distribution of reference counts of eight SPEC benchmarks. In appendix

A, the distribution of reference counts of all 26 SPEC CPU2000 benchmarks are provided.

The figure demonstrates that a large portion of addresses, 10% to 75%, are referenced only

once, and the majority of addresses, nearly 90%, are referenced less than 10 times. We name

this phenomenon theproperty of short lifetime.

The property of short lifetime is found in all 26 SPEC benchmarks. The majority of ad-

dresses have small reference counts. On the other hand, eachof the SPEC benchmark pro-

grams has a small quantity of addresses very intensively referenced. There are always some

addresses in each of the 26 SPEC benchmarks that are referenced as high as hundreds of

thousands of times. Three SPEC benchmarks,lucas, mesaandmgrid, have two or three

addresses referenced more than a million times. For each of these benchmarks, this rep-

resents more than one tenth of the memory trace. The distribution of reference counts is

heavily long tailed. Frequently referenced addresses onlyaccount for a small portion, no

more than 10%, of all addresses in number, but these frequently referenced addresses form

the majority of the memory references.

3.4.2 Reference Counts of Cache Lines

To save space, a CPU cache line typically consists of multiple words. Currently, a CPU

cache line often has eight or more words. The eight contiguous words of a cache line are

35

1 2 to 9 10~100 100~1000 10^3~10^4 10^4~10^5: 10^5~10^6:

0.0000%5.0000%

10.0000%15.0000%20.0000%25.0000%30.0000%35.0000%40.0000%45.0000%50.0000%55.0000%60.0000%65.0000%70.0000%

Distributions of Per-address Reference Counts of SPEC INT


reference counts

perc

ent

1 2 to 9 10~100 100~1000 10^3~10^4 10^4~10^5: 10^5~10^6:

0.0000%

10.0000%

20.0000%

30.0000%

40.0000%

50.0000%

60.0000%

70.0000%

80.0000%

Distributions of Per-address Reference Counts of SPEC FLT

Ammp Apsi Fma3d Galgel

reference counts

perc

ent

Figure 3.4: The distributions of per address reference counts of eight SPEC benchmarks.

loaded into and evicted from the cache as a whole. Even if a word is referenced for the first

time, if the cache line containing the word is already in the cache, it is still a cache hit. Thus

the analysis focuses on calculating reference counts of cache lines. The reference count of

a cache line is the sum of the reference counts of all the eightwords in the cache line.

Figure 3.5 is the distributions of reference counts of cachelines. Figure 3.5 shows eight

SPEC benchmarks. In appendix A, the distributions of cache line reference counts of all

26 SPEC CPU 2000 benchmarks are provided. Compared with the per address reference

counts shown in Figure 3.4, reference counts of cache lines tend to be a little bit larger. The

proportions of small reference counts are smaller, but the long tailed trend is still obvious in

the distribution of reference counts of cache lines in that 90% of cache lines have reference

counts less than 100. The property of short lifetime still holds in the distribution of reference

counts of cache lines.

36

1 2 to 9 10~100 100~1000 10^3~10^4 10^4~10^5: 10^5~10^6:

0.000%5.000%

10.000%15.000%20.000%25.000%30.000%35.000%40.000%45.000%50.000%55.000%60.000%65.000%70.000%

Distributions of Per-line Reference Counts of SPEC INT Benchmarks


reference counts

perc

ent

1 2 to 9 10~100 100~1000 10^3~10^4 10^4~10^5: 10^5~10^6:

0.000%

10.000%

20.000%

30.000%

40.000%

50.000%

60.000%

70.000%

80.000%

90.000%

Distributions of Per-line Reference Counts of SPEC FLT Benchmarks


reference counts

perc

ent

Figure 3.5: The distributions of reference counts of cache lines of eight SPEC benchmarks.

3.5 Relationship between Average Reference Counts and

LRU Hit Rates

The study of memory traces done in this work shows that the average reference count is

representative of the temporal locality of the program. Programs with good temporal lo-

cality have high average reference counts. Programs with poor temporal locality tend to

have small average reference counts. Programs with high average reference counts also

have higher hit rates in CPU caches using LRU replacement than programs with lower av-

erage reference counts. Figures 3.6 and 3.7 show the averagereference count of the SPEC

integer and floating point benchmarks and their cache miss rates under LRU replacement.

The average reference counts of SPEC benchmarks vary from 12to 539.

Programs with higher average reference counts have higher cache hit rates. The reason for

37

the correlation of average reference counts and cache hit rates is that for high average ref-

erence counts, there are fewer unique addresses or cache lines in an IRG gap than for low

average reference counts. In section 3.3.2, it is seen that the distributions of IRG values

are similar for the programs under examination. However, average reference counts of pro-

grams vary a great deal. High average reference counts implies a larger coverage of IRG

values with the same cache size, and thus higher CPU cache hitrates. The average refer-

ence count of a program is determined by the nature of the computation.

Bzip Crafty Eon Gap Gcc Gzip Mcf Parser Perl Twolf Vortex Vpr

0

50

100

150

200

250

300

350

400

450

500

550

Average Reference Counts of SPEC INT Benchmarks

SPEC INT Benchmarks

Ave

rage

Ref

eren

ce C

ount

Bzip Crafty Eon Gap Gcc Gzip Mcf Parser Perl Twolf Vortex Vpr

0.00%0.50%1.00%1.50%2.00%2.50%3.00%3.50%4.00%4.50%5.00%5.50%6.00%6.50%7.00%7.50%8.00%

LRU Miss Rates of SPEC INT Benchmarks

SPEC INT Benchmark

mis

s ra

te

Figure 3.6: The average reference counts of SPEC integer benchmarks and their miss ratesunder LRU.

38

Ammp

Applu Apsi Art Equake

Facerec

Fma3d

Galgel

Lucas Mesa Mgrid Six-track

Swim Wupwise

0255075

100125150175200225250275300325350

Average Reference Counts of SPEC FLT Benchmarks

SPEC FLT Benchmark

Ave

rage

Ref

eren

ce C

ount

Ammp

Applu Apsi Art Equake

Facerec

Fma3d

Galgel

Lucas Mesa Mgrid Six-track

Swim Wupwise

0.00%

0.25%

0.50%

0.75%

1.00%

1.25%

1.50%

1.75%

2.00%

2.25%

LRU Miss Rates of SPEC FLT Benchmarks

SPEC FLT Benchmark

mis

s ra

te

Figure 3.7: The average reference counts of SPEC floating point benchmarks and their missrates under LRU.

3.6 L2 IRG and Reference Count Distributions

This section presents the results of the analysis of reference counts and IRG distributions

for L2 cache memory references.

3.6.1 L2 Reference Count Distributions

As shown in Figure 3.3, the majority of IRG values at the L1 cache are small. Small IRG

values result in cache hits in the L1 cache. The L2 cache only knows the first reference to a

cache line, which loads it from the main memory into both the L1 and the L2 cache. The hits,

39

if any, of the cache line in the L1 cache are not visible to the L2 cache. Thus, the number

of references to cache lines seen at the L2 are much fewer thanat the L1 cache. Figure 3.8

is the distribution of reference counts of eight SPEC benchmarks at the L2 cache. The L1

cache in Figure 3.8 is a two-way associative 8 KB cache. In appendix A, the distributions

of L2 reference counts of all the 26 SPEC benchmarks are provided.

1 2 to 9 10~100 100~1000 10^3~10^40.00%5.00%

10.00%15.00%20.00%25.00%30.00%35.00%40.00%45.00%50.00%55.00%60.00%

Distributions of L2 RefCounts of SPEC INT Benchmarks


reference counts

perc

ent

1 2 to 9 10~100 100~1000 10^3~10^4

0.00%5.00%

10.00%15.00%20.00%25.00%30.00%35.00%40.00%45.00%50.00%55.00%60.00%65.00%70.00%

Distributions of L2 RefCounts of SPEC FLT Benchmarks


reference counts

perc

ent

Figure 3.8: The distributions of L2 reference counts of eight SPEC benchmarks.

Figure 3.8 shows that a higher percentages of addresses havesmall reference counts, and

specifically, the proportion of addresses referenced only once increases dramatically. The

distribution of reference counts are the same at the L1 and the L2 caches. The property of

short lifetime is more obvious at the L2 cache.

40

3.6.2 L2 IRG Distributions

Figure 3.9 represents the distributions of L2 IRG values of eight SPEC benchmarks. Since

the size and associativity of a CPU cache is always to the power of two, the distributions of

L2 IRG values is presented in Figure 3.10 on alog2 scale. In appendix A, the distributions

of L2 IRG values of all 26 SPEC benchmarks are provided.

1 2 to 9 10~100 100~1000 10^3~10^4

0.0000%

10.0000%

20.0000%

30.0000%

40.0000%

50.0000%

60.0000%

70.0000%

80.0000%

IRG Distributions of SPEC INT Benchmarks at L2 Cache


IRG value

perc

ent

1 2 to 9 10~100 100~1000 10^3~10^4

0.0000%

10.0000%

20.0000%

30.0000%

40.0000%

50.0000%

60.0000%

70.0000%

80.0000%

90.0000%

IRG Distributions of SPEC FLT Benchmarks at L2 Cache


IRG value

perc

ent

Figure 3.9: The distributions of L2 IRG values of eight SPEC benchmarks.

From the figures it is observed that, at the L2 cache, there arefewer IRG values of size one.

The proportion of IRG values of size one is less than 30% for most of the SPEC benchmarks.

However, at the L2 cache, the majority of IRG values are stillsmall. More than 60% of L2

IRG values are between two to eight.

L2 IRG values that are higher than the associativity of the L2cache are more likely to result

in cache misses rather than cache hits. At the L1 cache, most IRG values are of size one,

which means an address is immediately and repeatedly referenced. Even with a large IRG

41

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384

0.0000%2.5000%5.0000%7.5000%

10.0000%12.5000%15.0000%17.5000%20.0000%22.5000%25.0000%27.5000%30.0000%32.5000%

Distributions of L2 IRGs of SPEC INT Benchmarks


IRG value

perc

ent

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384

0.0000%

5.0000%

10.0000%

15.0000%

20.0000%

25.0000%

30.0000%

35.0000%

40.0000%

45.0000%

Distributions of L2 IRGs of SPEC FLT Benchmarks


IRG value

perc

ent

Figure 3.10: The distributions of L2 IRG values of SPEC benchmarks onlog2 scale.

value, at the L1 cache, there are a fewer number of unique addresses in the IRG, and thus

the IRG are likely to cause a cache hit. However, at the L2 cache, the percentage of L2

IRG values of size one is much smaller, and for a same IRG value, there are more unique

addresses, and thus, L2 IRG values larger than the associativity are likely to be cache misses.

In appendix A, the L2 IRG distributions of all 26 SPEC benchmarks are presented.

3.7 Summary

The distribution of per set IRG values clearly shows the property of temporal locality in that

nearly 90% of per set IRG values are only one, and the majorityof IRG values are small.

42

Besides the temporal locality, it is observed that a large portion of addresses are referenced

only once and the majority of addresses are referenced a small number of times. This phe-

nomenon is namedthe property of short lifetime. All SPEC benchmarks, both integer and

floating point, exhibit the property of short lifetime.

As indicated by the distribution of L1 IRG values, addressesare re-referenced immediately

and repeatedly. The memory references at the L1 cache exhibits heavy locality. LRU re-

placement has nearly perfect hit rates at the L1 cache for small and low associative caches,

since given a small cache LRU can provide longest coverage ofIRG for every address. The

temporal locality is so overwhelming that any cache replacements, such as FIFO and RAN-

DOM, have more than 90% hit rates [HP96].

Memory references at the L2 cache show more obviously the property of short lifetime.

Even with a very small L1 cache, such as two-way associative 8KB, there is a higher per-

centage of addresses with low reference counts and a lower percentage of small IRG values

at the L2 cache than at the L1 cache. This implies LRU replacement is in-appropriate at the

L2 cache.

43

Chapter 4

Locality Characteristics of Network

Protocols and Applications

This section presents analysis of the IRG values and reference counts on the memory traces

of network protocols and applications. The results show that network protocols and appli-

cations have different locality characteristics than SPECbenchmarks.

4.1 Motivation

This work began with the investigation of web server performance dropping when over-

loaded. The abrupt dropping of web server throughput and responsiveness is a commonly

seen phenomenon. Experiments were done that flooded a web server with requests from

a client. The server and the client were connected through a 100Mbps Ethernet. To avoid

disk activities, the client requested the same static web page repeatedly. The web servers,

Apache1 andthttpd2, are compiled withgcc3 without loop-nest optimization. When over-

loading happened, the servers network usage is only half of the capacity, and there were

no packet losses, nor page faults. The server memory image iswell below the main mem-

ory size. However, the CPU utilization rate of the server is always below 50% in periods of

overloading. The low CPU utilization rates of web servers, even under overload conditions,

suggest poor CPU cache performance.

1http://www.apache.com/2http://www.acme.com/software/thttpd/3http://gcc.gnu.org/

44

Programs such as the SPEC CPU2000 benchmarks have much higher CPU utilization rates.

SPEC CPU2000 benchmarks are not I/O bound. There are more intensive computation in

SPEC CPU2000 benchmarks than web servers. SPEC CPU2000 benchmarks more strongly

exhibit the property of temporal locality, which results inhigher than 99% CPU cache hit

rates. Network protocols and applications are believed to be of poor temporal locality (see

section 2.4). This chapter applies the locality analysis methods discussed in chapter 3 to

network memory traces. The empirical analysis of network memory traces shows that net-

work protocols and applications have less temporal locality than SPEC benchmarks. Net-

work memory traces more strongly show the property of short lifetime.

4.2 Memory Traces of Web Servers

One of the major obstacles in studying the CPU cache performance of network protocols

and applications is the lack of suitable memory traces. Currently, network applications run

on multiprogramming platforms, and thus the interference of operating systems on network

performance can not be ignored. The network memory reference trace must be a full-system

one, which should include interrupt handlers, OS kernel tasks, and user programs.

A full system simulator calledSimics[MCE+02] is used to generate the network memory

traces. Since the web is the most prevalent network application, a set of memory traces of

web severs, 32 memory traces in total, were generated. The three factors used in configuring

the generation of web server traces are the following: server architectures, request rates and

web page sizes. The analysis is based on two web page sizes, 20KB and 200KB, and request

rates are set from 50 requests per second to 150 requests per second for each web page size.

Larger file sizes or higher request rates are not used since the simulator does not have the

appropriate computing power. Web page size 20KB is chosen toreflect average size of text

based web pages, and web page size 200KB represents an typical image file size. Memory

traces for mixed request file sizes are also generated. In mixed file sizes scenarios, two web

pages of size 20KB and 200KB respectively, are requested at rates of 50 requests per second

to 150 requests per second. Two kinds of arbitrarily chosen file mixture ratios are used. For

one mixture ratio, one 200KB web page is requested in every four 20KB web pages, and

for the other ratio, one 200KB page is request in every nine 20KB pages.

The server architecture refers to the mechanism of handlingconcurrent connections in web

servers:select()based orfork() based. Concurrent connection handling withselect()re-

quires only one process for all connnection, butfork() requires one process for each connec-

45

tion. The two web architectures are very different from eachother.Fork()based web servers

have much larger process images thanselect()based servers, and the internal scheduling of

the two web server architectures are different. We are expecting the differences in web ar-

chitecture may exhibit different CPU cache behaviors. An example of a web server using

select()is thttpdby Poskanzer4. An example of afork()based web server isApache5, which

is currently the most widely used web server. Memory traces are generated for each com-

bination of request rates and web page sizes for boththttpdandApacheweb servers.

The combination of choices in web architectures, web page sizes and request rates is 32, and

32 web server memory traces are generated. Examples of notations used to denote memory

traces of web servers are ‘a20kr50’ and ‘t200kr50’. The ‘a’ represents Apache web server,

and the ‘t’ stands for thttpd web server. ‘20k’ refers to the file size, and ‘r50’ is the request

rate. Table 4.1 represents the names of the memory traces andtheir configurations.

4.3 Average Reference Counts of Web Server Memory Traces

Table 4.2 shows the average reference counts of the 32 web server memory traces. The av-

erage reference counts of the web server memory traces vary to some extent but not as much

as the average reference counts of SPEC CPU2000 benchmarks (see Section 3.5). Except

for one trace, the average reference counts of web server memory traces are between 20 to

60, but the average reference counts of SPEC CPU2000 benchmark traces range from 12

to 539. Since the average reference count is correlated withthe LRU hit rate of programs,

and higher average reference counts have higher hit rates (see section 3.5), the average ref-

erence counts of web server memory traces imply that web servers are of poor temporal

locality compared with some SPEC benchmarks and thus have higher miss rates than the

SPEC benchmarks. This is supported by the simulation results (see chapter 8). Table 4.2

shows that for boththttpdandApacheweb servers, small request file sizes have higher av-

erage reference counts and thttpd traces have higher average reference counts than Apache

traces.

4http://www.acme.com/software/thttpd/5http://www.apache.org/

46

Trace Name server type file size(KB) request rate (r/s)a20kr50 apache 20 50a20kr90 apache 20 90a20kr120 apache 20 120a20kr150 apache 20 150a200kr50 apache 200 50a200kr90 apache 200 90a200kr120 apache 200 120a200kr150 apache 200 150mixapr50-ap apache mixed 1 200KB 4 20KB 50mixapr90-ap apache mixed 1 200KB 4 20KB 90mixapr120-ap apache mixed 1 200KB 4 20KB 120mixapr150-ap apache mixed 1 200KB 4 20KB 150mix1-9apr50-ap apache mixed 1 200KB 9 20KB 50mix1-9apr90-ap apache mixed 1 200KB 9 20KB 90mix1-9apr120-ap apache mixed 1 200KB 9 20KB 120mix1-9apr150-ap apache mixed 1 200KB 9 20KB 150t20kr50 thttpd 20 50t20kr90 thttpd 20 90t20kr120 thttpd 20 120t20kr150 thttpd 20 150t200kr50 thttpd 200 50t200kr90 thttpd 200 90t200kr120 thttpd 200 120t200kr150 thttpd 200 150mixthr50-th thttpd mixed 1 200KB 4 20KB 50mixthr90-th thttpd mixed 1 200KB 4 20KB 90mixthr120-th thttpd mixed 1 200KB 4 20KB 120mixthr150-th thttpd mixed 1 200KB 4 20KB 150mix1-9thr50-th thttpd mixed 1 200KB 9 20KB 50mix1-9thr90-th thttpd mixed 1 200KB 9 20KB 90mix1-9thr120-th thttpd mixed 1 200KB 9 20KB 120mix1-9thr150-th thttpd mixed 1 200KB 9 20KB 150

Table 4.1: Names of network traces and their configurations.

4.4 Reference Count Distributions of Web Server Memory

Traces

Figure 4.1 shows the distribution of per address reference counts of four web server memory

traces on a per address basis. The distributions of all 32 webserver memory traces can be

found in appendix A. The distributions shows that the property of short lifetime is obvious

in web server memory traces. The distributions of referencecounts of web server mem-

ory traces do not show much difference from the distributions of reference counts of SPEC

benchmark traces.

47

avg refcount avg refcounta20kr50 61 t20kr50 150a20kr90 40 t20kr90 106a20kr120 41 t20kr120 93a20kr150 51 t20kr150 80a200kr50 294 t200kr50 21a200kr90 22 t200kr90 21a200kr120 28 t200kr120 18a200kr150 23 t200kr150 20mixapr50 37 mixthr50 51mixapr90 37 mixthr90 36mixapr120 27 mixthr120 29mixapr150 35 mixthr150 31mix1-9-apr50 32 mix1-9-thr50 56mix1-9-apr90 21 mix1-9-thr90 44mix1-9-apr120 32 mix1-9-thr120 37mix1-9-apr150 37 mix1-9-thr150 31

Table 4.2: Average reference counts of network traces.

Figure 4.2 is the distribution of reference counts of cache lines of the same four web server

memory traces as Figure 4.1. To show more precision, the distributions are onlog2 scales.

This shows that different configurations of web server types, request rates and web page

sizes exhibit great differences in the distributions of reference counts.

4.5 L2 Distributions of Reference Counts of Web Server

Memory Traces

Figure 4.3 shows the distribution of reference counts of cache lines of four web server mem-

ory traces at the L2 cache. Compared with the L2 reference count distributions of SPEC

benchmarks, web server memory traces have a lower percentage of cache lines with low

reference counts than SPEC benchmarks. This should not be interpreted as web server mem-

ory traces have better locality than SPEC benchmarks. The property of short lifetime is still

very obvious in the distributions of reference counts of webserver memory traces. Given

that web server memory traces have low average reference counts, even if there are smaller

portions of cache lines with small reference counts, web server memory traces are still of

poor temporal locality.

48

1 2 to 9: 10^1 to 10^2:

10^2 to 10^3:

10^3 to 10^4:

10^4 to 10^5:

10^5 to 10^6:

0.0000%5.0000%

10.0000%15.0000%20.0000%25.0000%30.0000%35.0000%40.0000%45.0000%50.0000%55.0000%60.0000%65.0000%70.0000%

Distributions of Per-addr Reference Counts of Network Traces, L1

A20kr90 A200kr90 T20kr90 T200kr90

reference counts

perc

ent

Figure 4.1: The distributions of per address reference counts of four web sever memorytraces.

4.6 L2 IRG Distributions of Web Server Memory Traces

The poor temporal locality of web server memory traces is also seen by examining the distri-

butions of L2 IRG values. Figure 4.4 graphically depicts L2 IRG values of four web server

memory traces. Distributions of other web server memory traces and the distributions of

IRG values at the L1 cache of web server memory traces can be found in appendix A. Com-

pared with the distributions of IRG values of SPEC benchmarks, the L2 IRG values of web

server memory traces are noticeably larger. Assuming a 16-way set associative cache, the

percentages of IRG values larger than the associativity, which is 16, in web server memory

traces are much larger than in the SPEC benchmarks.

Tables 4.3 and 4.4 present the percentages of IRG values of web server memory traces and

SPEC benchmarks which are smaller than 16 and percentages ofIRG values greater than

256. Except fora200kr50, web server memory traces generally have 60% to 70% IRG val-

ues smaller or equal to 16. In comparison, SPEC benchmarks typically have more than 85%

IRG values smaller or equal to 16, exceptmcfwhich has only 25% IRG values under 16.

Assuming a 16-way associative L2 cache using LRU replacement, IRG values under 16 are

guaranteed to result in cache hits. Web server memory traceshave fewer IRG values below

the associativity of the CPU cache, and thus fewer guaranteed cache hits. Even worse is

that due to the low average reference counts, the large IRG values in web server memory

traces are more likely to result in cache misses than those SPEC benchmarks which have

huge average reference counts.

49

1 2 4 8 16 32 64 128 256 512 1024

2048

4096

8192

16384

32768

65536

131072

262144

524288

0.0000%2.5000%5.0000%7.5000%

10.0000%12.5000%15.0000%17.5000%20.0000%22.5000%25.0000%27.5000%30.0000%32.5000%35.0000%

Distributions of L1 Per-line Reference Counts of Network Traces

A20kr90 A200kr90 T20kr90 T200kr90

reference counts

perc

ent

Figure 4.2: The distributions of reference counts of cache lines of four web server memorytraces.

Meanwhile, web server memory traces have more IRG values above 256 than SPEC bench-

mark traces do. Among the web server memory traces, traces ofthe small web page size,

20KB, for both Apache and thttpd, have less than 10% IRG values larger than 256. Web

sever memory tracea200kr50has the second smallest percentage, 1.75%, of IRG values

larger than 256. However, it also has by far the largest percentage of IRG values below 16,

which is 79%. This is reflected in very good cache hit rate for tracea200kr50.

4.7 Summary

Memory traces of web servers are different from the memory traces of SPEC CPU2000

benchmarks. Web server memory traces have lower percentages of small IRG values and

also have higher percentages of large IRG values. The average reference counts of web

server memory traces are smaller than the average referencecounts of good locality SPEC

benchmarks. Since the average reference count of a program are representative of a pro-

gram’s temporal locality, this suggests that web servers are of poorer temporal locality than

SPEC CPU2000 benchmarks.

50

1 2 4 8 16 32 64 128 256 512 1024 2048 4096

0.00%2.50%5.00%7.50%

10.00%12.50%15.00%17.50%20.00%22.50%25.00%27.50%30.00%32.50%35.00%

Distributions of L2 Per-line Reference Counts of Network Traces

A20kr90 A200kr90 T20kr90 T200kr90

reference counts

perc

ent

Figure 4.3: The distributions of reference counts of cache lines of four web server memorytraces at the L2 cache.

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384

0.00%

2.50%

5.00%

7.50%

10.00%

12.50%

15.00%

17.50%

20.00%

22.50%

25.00%

27.50%

Distributions of L2 IRGs of Network Traces

A20kr90 A200kr90 T20kr90 T200kr90

IRG value

perc

ent

Figure 4.4: The distributions of IRG values of four web server memory traces at the L2cache.

51

ammp applu apsi art bzip crafty eon% of IRG values< 16 93.11% 97.37% 94.02% 70.83% 95.17% 89.24% 96.53%

equake facerec fma3d galgel gap gcc gzip% of IRG values< 16 89.66% 96.86% 97.31% 90.40% 93.02% 90.80% 71.05%

lucas mcf mesa mgrid parser perl sixtrack% of IRG values< 16 92.69% 23.12% 96.77% 97.65% 75.83% 84.58% 89.61%

swim twolf vortex vpr wupwise% of IRG values< 16 74.35% 70.05% 85.90% 56.40% 94.75%

ammp applu apsi art bzip crafty eon% of IRG values≥ 256 0.22% 0.25% 0.64% 16.40% 0.03% 1.21% 0.58%

equake facerec fma3d galgel gap gcc gzip% of IRG values≥ 256 0.59% 0.11% 0.30% 0.08% 0.70% 0.64% 1.33%

lucas mcf mesa mgrid parser perl sixtrack% of IRG values≥ 256 0.17% 58.91% 0.15% 0.39% 0.49% 1.11% 5.99%

swim twolf vortex vpr wupwise% of IRG values≥ 256 0.41% 8.98% 1.99% 11.30% 0.33%

Table 4.3: Percentages of IRG values< 16 and percentages of IRG values≥ 256 of SPECbenchmarks.

52

a20kr50 a20kr90 a20kr120 a20kr150% of IRG values< 16 60.81% 59.76% 60.62% 61.94%

a200kr50 a200kr90 a200kr120 a200kr150% of IRG values< 16 79.40% 65.35% 66.47% 65.31%

mixapr50-ap mixapr90-ap mixapr120-ap mixapr150-ap% of IRG values< 16 63.16% 62.48% 62.52% 62.20%

mix1-9apr50-ap mix1-9apr90-ap mix1-9apr120-ap mix1-9apr150-ap% of IRG values< 16 62.85% 59.88% 59.69% 60.85%

t20kr50 t20kr90 t20kr120 t20kr150% of IRG values< 16 63.23% 59.82% 61.59% 61.83%

t200kr50 t200kr90 t200kr120 t200kr150% of IRG values< 16 66.87% 69.00% 68.41% 68.69%

mixthr50-th mixthr90-th mixthr120-th mixthr150-th% of IRG values< 16 63.68% 62.82% 64.52% 63.46%

mix1-9thr50-th mix1-9thr90-th mix1-9thr120-th mix1-9thr150-th% of IRG values< 16 63.84% 62.72% 61.81% 65.21%

a20kr50 a20kr90 a20kr120 a20kr150% of IRG values≥ 256 3.51% 7.60% 6.87% 7.39%

a200kr50 a200kr90 a200kr120 a200kr150% of IRG values≥ 256 1.75% 13.97% 11.98% 15.01%

mixapr50-ap mixapr90-ap mixapr120-ap mixapr150-ap% of IRG values≥ 256 12.43% 12.04% 11.10% 11.34%

mix1-9apr50-ap mix1-9apr90-ap mix1-9apr120-ap mix1-9apr150-ap% of IRG values≥ 256 8.09% 11.37% 11.51% 9.41%

t20kr50 t20kr90 t20kr120 t20kr150% of IRG values≥ 256 1.21% 2.72% 4.07% 5.07%

t200kr50 t200kr90 t200kr120 t200kr150% of IRG values≥ 256 15.95% 11.65% 10.71% 12.00%

mixthr50-th mixthr90-th mixthr120-th mixthr150-th% of IRG values≥ 256 9.12% 11.57% 12.28% 11.78%

mix1-9thr50-th mix1-9thr90-th mix1-9thr120-th mix1-9thr150-th% of IRG values≥ 256 6.14% 8.46% 9.57% 8.73%

Table 4.4: Percentages of IRG values< 16 and percentages of IRG values≥ 256 of networktraces.

53

Chapter 5

WLRU Cache Replacement

The studies presented in Chapters 3 and 4 suggest that good temporal locality is exhibited

by relatively few addresses. The property of short lifetimeshows that there are a large num-

ber of addresses that will not be re-referenced. The LRU replacement algorithm does not

distinguish between the heavily referenced addresses and the infrequently referenced ad-

dresses. This chapter describes a new cache replacement algorithm called WLRU. WLRU

is a modification of the LRU cache replacement that differentiates addresses.

5.1 Correlation of IRG and Reference Counts

Addresses are of different cache values. The cache value of an address is its possibility of

being re-referenced. A good replacement algorithm would always try to keep in the cache

addresses that are most likely to be referenced again. Thoseaddresses that are referenced

only once are of no cache value. Generally, addresses of higher reference counts are of

higher cache value than address with lower reference counts. However, IRG values need

to be taken into consideration. For two addresses, which areboth referenced again, the one

that will be referenced again before the other is of higher cache value.

The off-line optimal cache replacement looks forward in thememory trace and replaces the

cache line that will be referenced in the longest future [ADU71]. Details of the off-line

optimal cache replacement can be found in Section 7.5. In terms of IRG values, the off-

line optimal cache replacement replaces the address with the largest IRG value. The IRG

values of an address are correlated with the reference countof the address. Addresses that

54

have higher reference counts also have more IRG of small sizes, and large IRG gaps are

more likely to be associated with addresses with low reference counts. Table 5.1 shows the

IRG values of addresses that map to cache set 0 of SPEC benchmark crafty. Table 5.2 shows

the IRG values of addresses that map to cache set 0 of network tracea20kr50. All the IRG

values are at the L2 cache, and the L1 is two-way set associative 8 KB cache. Due to space

limitations, only a portion of the addresses are shown in tables 5.1 and 5.2.

address reference count 1 2 < 4 < 8 < 16 < 32 < 64 < 128 < 256 < 5126f0e00 279 8 40 56 108 65 1 0 0 0 04e3900 275 5 102 93 41 18 12 2 1 0 04f2700 262 6 81 91 50 21 9 2 1 0 0

......

44ed00 2 0 0 0 0 0 0 0 0 1 044ef00 2 0 0 0 0 0 0 0 0 1 047b300 2 0 0 0 0 0 1 0 0 0 07aa800 2 0 0 0 0 0 0 0 0 0 17aa900 2 0 0 0 0 0 0 0 0 0 17aaa00 2 0 0 0 0 0 0 0 0 0 1

Table 5.1: The IRG values of address tags mapping toset0 of SPEC benchmarkcrafty.

address reference count 1 2 < 4 < 8 < 16 < 32 < 64 < 128 < 256 < 51211b00 854 0 170 203 346 90 0 0 43 0 010e00 850 0 48 151 340 253 9 47 1 0 0

602100 383 46 53 164 62 2 0 16 10 28 160d600 363 47 43 142 69 7 0 16 9 28 1

......

608500 4 0 0 0 0 0 0 0 0 3 0609400 4 0 0 0 0 0 0 0 0 3 068fb00 4 0 0 0 0 0 0 0 1 2 05fee00 3 0 0 0 0 0 0 0 0 2 05ff300 3 0 0 0 0 1 0 0 0 1 0

Table 5.2: The IRG values of address tags mapping toset0 of network tracea20kr50.

Tables 5.1 and 5.2 show that the IRG values of addresses with low reference counts are large

and small IRG values are bound to addresses with high reference counts. We gathered the

distributions of IRG values at L2 cache of addresses of all SPEC CPU2000 benchmarks (the

distributions of IRG values of addresses of all SPEC CPU2000benchmarks can be found

55

in Appendix A). It was found that, except for one programswim(see Section 8.3 and Ta-

ble 8.1), the reference count of an address and the IRG valuesof the address are related

with each other. Large IRG values tend to be associated with addresses with low reference

counts, and small IRG values are likely to be associated withaddresses with high reference

counts. This reveals the basis of WLRU replacement. Good cache value addresses shows

themselves quickly. If an address is not re-referenced again in a short time since it is brought

into the cache, it may never be re-referenced. However, if anaddress is hit quickly after be-

ing brought into the cache, it is likely to be hit again and again. WLRU judges the cache

value of an address by its number of hits immediately after itis brought into the cache. If

an address is not hit after being brought into the cache in a short time, the address can be

evicted fast.

5.2 Problems with LRU and LFU

LRU does not differentiate the two kinds of references. A hitin the cache and the initial

reference to an address are treated in the same way. The initial reference to an address is

the first time the address is brought into the cache. Initial references and hits are of different

cache values. The correlation of reference counts and IRG values suggests that hits in a

short time after initial references are likely to representhigh reference count addresses. LRU

keeps addresses not hit for too long a period of time. If the cache size is limited, under LRU

replacement, high cache values addresses are likely to be flushed out by a large number of

addresses of low reference counts or large IRG gaps.

The Least Frequently Used (LFU) replacement and its variants differentiate addresses with

different cache values [EH84]. Addresses that are not hit immediately after being brought

into the cache are replaced quickly using LFU replacement. However, LFU tends to keep

addresses of good cache value in the cache too long. If some addresses are referenced thou-

sands of times in a certain period of time only and never again, these addresses will be fixed

in the cache for a long time by LFU. This results in poor hit rates for LFU. To address this

problem, variants of LFU haveagingmechanisms that reduce the reference counts of ad-

dresses periodically. Even with the aging mechanism, LFU still can not adapt to the phase

change of programs quickly enough. During a phase change, a large number of new ad-

dresses replace the old ones, and it takes LFU too long to evict the good old addresses. LFU

and its variants are never used in CPU caches.

56

5.3 WLRU Cache Replacement

WLRU addresses the limitations of LRU and LFU by limiting thecache stay time of ad-

dresses not hit frequently and addresses hit frequently. WLRU uses weights to achieve these

goals. Weights in WLRU are integer numbers representing thereplacement priority of cache

lines. When replacing, the cache line with the minimal weight is chosen. Weights change

based on the reference history of cache lines.

Weights have been used in the LFU cache replacement and its variants. The reference count

of addresses in LFU behaves as a weight. When choosing an entry to be replaced, the en-

try with the minimum reference count is chosen. The difference between WLRU and LFU

is in the calculation of weights. Weights in WLRU have clearly defined upper limits, but

reference counts in LFU do not have upper limits. LFU may havesome limit on the largest

possible reference count number, but it is the result of the physical constraint on the size

of the storage of reference counts of LFU. The upper limit of weights in WLRU prevents

cache lines being fixed in the cache.

The following equation shows the changing of the weight of a cache line as a function of

previous weight and the hit/miss status of the current reference. There are three configurable

parameters in WLRU. These are the increment of weights,i, the upper limit of weights,r,

and the initial weight,b. The calculation of weights is illustrated in the equation,wherewt

is the weight of a cache line for thetth reference to the cache set, andwt+1 is the weight at

the(t + 1)th reference. The equation reflects that if a cache line is hit, its weight increases

until reaching the upper limit, and for every reference to a cache set, if a cache line is not

hit, its weight is deducted by one until the weight reaches zero.

wt+1 =

b if referenced for the first time,

wt + i if hit andwt + i ≤ r,

r if hit andwt + i > r,

wt − 1 if not hit andwt > 1,

0 if not hit andwt ≤ 1.

(5.0)

In Equation 5.0, the weight of a cache line is always decremented by one if not hit. It is

possible that weights are decremented by a number more than one. However, doing so will

57

require a larger increment of weights and also larger upper limits of weights. This will in-

crease the complexity of the circuits.

As a general rule, programs that exhibit poor temporal locality have a large portion of ad-

dresses that will not be referenced again. For these programs, the setting of the initial weight

should be small and the increment of weights and the upper limit of weights should be large

enough when compared with the initial weight. Using small initial weights, cache lines

that will not be referenced again are replaced more quickly than when using LRU. Valu-

able cache contents are kept longer by the large weight increment and the large upper limit

of weights.

The upper limit of weights controls the length of time a previously frequently used address

stays in the cache without being hit again. Due to phase changes in programs, where the

set of frequently referenced addresses changes, the upper limit should not be set too large

compared with the increment of weights, or some addresses may be fixed in the CPU cache.

The upper limit of weights in WLRU should be no more than several times larger than the

weight increment, and thus during a phase change, new heavily referenced addresses can

easily replace the old heavily referenced addresses.

For programs of poor temporal locality, an initial weight that is set low will purge addresses

not referenced again faster. Initial weights, although small, are necessary. A small initial

weight, as low as two or four, is enough for good cache value addresses to show themselves.

The exact settings of the parameters of WLRU also depend on the size and the associativity

of the CPU cache.

5.4 Notation Used to Represent WLRU Parameter Settings

There are three configurable parameters in WLRU. Notation inthe form of i64r128b2is

used to represent the settings of the parameters of WLRU. These notations are called the

weight formula. Thei in i64r128b2stands for increment of weights, and, in the example, the

weight is increased by 64 if the cache line is hit. Ther represents the upper limit, and, in the

above example, the upper limit is set at 128. Theb in weight formulas stands for the initial

weight, and the initial weight is set at 2 ini64r128b2. Using weight formulai64r128b2,

when an address is first loaded into the cache, its weight is two. Every time the cache line

is hit, its weight increases by 64 until it reaches 128. Everyreference, if the cache line is

not hit, its weight is deducted by one until it reaches zero.

58

Figure 5.1 is the comparison of the WLRU and LRU replacement decision on an example

reference string. The reference string exhibits the property of short lifetime scenario, where

a few addresses, A and B, are referenced very often and a lot ofother addresses referenced

only one or two times. The weight formula used in figure 5.1 isi6r8b1.

5.5 WLRU Mimicking LRU

WLRU is very versatile. It can be configured to behave as LFU, FIFO (First In First Out),

and LRU. When the upper limit of weights is very large, the initial weight and the incre-

ment of weights is one, and weights never decrease, WLRU becomes LFU. When the ini-

tial weight is equal to the upper limit of weights and both theinitial weight and the upper

limit of weights are large and the weight increment is zero orvery small, WLRU is a FIFO

replacement. When the upper limit, the initial weight and the increment of weight when hit

are all the same and are large enough, WLRU behaves exactly asLRU replacement. Ta-

ble 5.5 is the comparison of the total cache misses of a set of WLRU weight formulas and

the LRU replacement. Weight formulasi512r512b512andi256r256b256have exactly the

same number of cache misses as LRU replacement.

128KB i512r512b512 i256r256b256 i128r128b128 i64r64b64 i32r32b32 i16r16b16LRU 211743 211743 211743 211743 211743 211743WLRU 211743 211743 211742 211748 211728 203653

256KB i512r512b512 i256r256b256 i128r128b128 i64r64b64 i32r32b32 i16r16b16LRU 120817 120817 120817 120817 120817 120817WLRU 120817 120820 120800 120613 117482 110028

Table 5.3: Comparison of total cache misses of LRU and weightformulas mimicking LRU.

5.6 Comparison of WLRU with Other Cache Replacement

Algorithms

The difference between WLRU and LRU is that WLRU discriminates against addresses

with low reference counts, especially addresses which are referenced only once. There are

other cache replacements which discriminate low referencecount addresses by side effects.

LRU-k [OOW93]and LIRS [JZ02] are such examples. In LRU-k, the most recent reference

59

A B C A B D H E F A C G B H A B K A C J L B

A B

A

C

B

A

A

C

B

B

A

C

D

B

A

C

H

D

B

A

E

H

D

B

F

E

H

D

A

F

E

H

C

A

F

E

G

C

A

F

B

G

C

A

H

B

G

C

A

H

B

G

B

A

H

G

K

B

A

H

A

K

B

H

C

A

K

B

J

C

A

K

L

J

C

A

B

L

J

C

A, 1 B, 1

A, 0

C, 1

B, 0

A, 0

A, 6

B, 0

C, 0

B, 6

A, 5

C, 0

B, 5

A, 4

D, 1

C, 0

B, 4

A, 3

H, 1

D, 0

B, 3

A, 2

E, 1

H, 0

B, 2

F, 1

A, 1

E, 0

A, 7

B, 1

F, 0

E, 0

A, 6

C, 1

B, 0

F, 0

A, 5

G, 1

C, 0

B, 0

B, 6

A, 4

G, 0

C, 0

B, 5

A, 3

H, 1

G, 0

A, 8

B, 4

H, 0

G, 0

B, 8

A, 7

H, 0

G, 0

B, 7

A, 6

K, 1

H, 0

A, 8

B, 6

K, 0

H, 0

A, 7

B, 5

C, 1

K, 0

A, 6

B, 4

J, 1

C, 0

A, 5

B, 3

L, 1

J, 0

B, 8

A, 4

L, 0

J, 0

Reference String:

LRU: 14 misses

top

bottom

WLRU: (with weight formula i6r8b1)

top

bottom

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21Step:

C A B D H E F A C G H B K AReplaced:

C D H E F C G H K CReplaced:

10 misses

Figure 5.1: Comparison of the replacement decision of WLRU and LRU.

60

to a cache line is not used to make a replacement decision, butthekth backward reference

is used to make a replacement decision. If a cache line has notbeen referencedk times, its

priority is minimal. A side effect of this arrangement is that addresses not hit in the cache

are always discriminated against. These addresses always have the lowest priority and are

frequently replaced. In LIRS, the time span between two consecutive references to the same

cache line is used to represent the replacement priority of acache line. If a cache line is not

hit at least once in the cache, it has the lowest priority possible. LIRS is against cache lines

with no hit history in the cache. Those not-hit cache lines are evicted fast.

The problems with LRU-k and LIRS are that these replacement algorithms are too expen-

sive to implement in CPU caches. LRU-k needs to store the reference history up to the lastk

references and also needs to compare the order of the lastk references of cache lines. LIRS

stores the last two references and maintains two ordered queues. Assuming storing one ref-

erence history requires eight bits, LRU-2 and LIRS need to store two reference histories,

and they require at least 16 bits per cache line, which is twice the space than WLRU. An-

other difference of LRU-k and LIRS than WLRU is that, although LRU-k and LIRS can

evict addresses that are not hit in the cache faster than LRU,the ways LRU-k and LIRS

treat frequently referenced addresses are not as accurate as WLRU. For LRU-2, cache lines

hit three times have no advantage against cache lines hit twice, and for LIRS, the number

of times cache lines are hit do not matter if the number of hitsis more than one. This makes

LRU-k and LIRS worse than WLRU.

A Least Recently Frequency Used (LRFU) replacement algorithm [LCK+01] is also a re-

placement using weights. LRFU is proposed to be used in database buffer caches. LRFU

assigns a value called CRF (Combined Recency and Frequency)to each cache block, and

when replacing a block, the block with the minimal CRF value is chosen (see section 2.3).

The calculation of CRF values involves multiplying of real numbers. Floating point calcu-

lations are much more expensive than integer calculations.An integer ALU requires around

600 to 700 transistors1, but a floating point unit requires 40K transistors2. The CRF value

of a block needs to be calculated for each reference in the history. In comparison, WLRU

involves simple adding and minus by one and only calculates once. LRFU requires storage

of the whole reference history, while WLRU does not store anyreferences. Although LRFU

is proposed to subsume both LRU and LFU, it is based on totallydifferent observations and

theories than WLRU.

1http://ieeexplore.ieee.org/iel5/8641/27381/01217869.pdf2http://www.intel.com/standards/floatingpoint.pdf

61

WLRU is different from the replacement algorithms that predict dead and alive cache lines

[HKM02, LFF01, KS05] in that WLRU does not predict that a cache line is dead or alive.

WLRU predicts whether the distance of the next reference of acache line is large or short.

WLRU tries to evict addresses with large IRG gaps, but these addresses are not necessarily

dead. In fact, a considerable portion of victim cache lines of WLRU are still alive. Dead

cache lines are good choices for replacements, but when the size of cache is small compared

with the size of the working set of workloads, the possibility of finding a dead line also

decreases. In this scenario, the best choice is the largest IRG gap.

AIP/LvP [KS05] is a CPU cache replacement algorithm that predicts dead cache lines. AIP/LvP

tries to evict a cache line referenced more than a specified threshold faster than LRU. When

the L2 cache is small, for example 128KB, the time WLRU keeps adead cache line is within

a small constant of the optimal replacement algorithm. The maximum time WLRU keeps

dead cache lines is equal to the upper limit of weights. In this case, WLRU can be viewed as

a simplified version of AIP/LvP. AIP/LvP adapts the keep timeof dead lines. WLRU uses

a fixed setting and thus has lower costs.

5.7 Summary

The property of short lifetime shows only a small portion of addresses are worth being kept

in the cache. However, LRU does not distinguish between addresses with high reference

counts and the addresses with low reference counts. LRU always keeps the just referenced

addresses as long as possible. For programs with poor locality, where a large number of

addresses will not be referenced again, LRU is not optimal. WLRU addresses this short-

coming of LRU by differentiating addresses.

The IRG values of an address is found to be correlated with thereference count of the ad-

dress. Addresses with high reference counts tend to have small IRG values, and large IRG

values are likely to be associated with addresses of low reference counts. This finding is

exploited by WLRU to identify the high cache value addresses. In WLRU, if an address

is not hit in a short time after it is brought into the cache, the address is evicted fast. For

programs with a lot of addresses never re-referenced, WLRU can keeps the valuable cache

contents longer than LRU.

WLRU uses integers called weights to denote the cache replacement priority of cache lines.

When replacing, the cache line with the minimal weight is chosen. Weights changes ac-

62

cording to the reference history of cache lines. By controlling the changing of weights,

WLRU can suit a wide range of locality and mimic other cache replacements such as LRU,

LFU, FIFO. However, WLRU is intended to be mainly used in programs with poor locality.

Guidelines on how to set the WLRU parameters are provided.

63

Chapter 6

Hardware Implementations of WLRU

In this chapter, an example implementation of WLRU replacement in a CPU cache is pre-

sented.

6.1 Space Requirements of WLRU

Besides space for the data, the address tag and the status bits of a cache line, each cache line

needs circuitry to implement the replacement. LRU needs space to store the LRU order, and

WLRU need space to store the weight. Figure 6.1 shows an example storage requirement

of a cache set using WLRU replacement. Eight bits are used by each cache line to store the

weight. This implies the upper limit of the weight is less than 255. Simulations in Chapter

8 show that an upper limit of weights of 256 have good hit ratesfor both SPEC CPU2000

benchmarks and web server memory traces. In this example implementation of circuits, we

used the upper limit of weight of 256 only. Besides the space used by each cache line to

store the weight, a CPU cache using WLRU replacement needs two registers to store the

increment of weight when hit and the initial weight.

6.2 Overall Structure of WLRU CPU Cache

Figure 6.2 is the overall structure of a CPU cache implementing the WLRU replacement

algorithm. The WLRU CPU cache in the figure is a L1 cache. Without any internal changes,

64

VALID

<1> <1> <19> <8> <256>

line0

line1

line2

line3

line4

line5

line6

line7

DIRTY TAG WEIGHT DATA

Figure 6.1: Storage arrangement of an eight-way associative cache set using WLRU re-placement.

the WLRU CPU cache can be used as a L2 cache. The CPU cache in figure 6.2 is an eight-

way set associative cache, 256KB in size, using 32 byte cachelines. The cache lines are

organized in 1024 cache sets, and each cache set has eight cache lines, identified asline0

to line7. The storage of the CPU cache is organized into eight ways, identified asway0 to

way7. A way is a collection storage for the cache lines of the same number from all cache

sets. Each way has five RAM memory arrays that are referred to as follows: tag RAM, data

RAM, weight RAM, dirty status RAM and valid status RAM. TheseRAM arrays provide

storage for the tag, weight, data, dirty status and valid status part of cache lines. The dirty

status bit is set when the cache line has been written. The valid status bit indicates that the

cache line contains valid cache content i.e., not empty or being invalidated explicitly.

Figure 6.3 shows the weight RAM, the data RAM, the tag RAM, thedirty status RAM and

the valid status RAM of a way. Each RAM array has 1024 storage cells. A 10 bit wide

address line is used by the RAM arrays to index to a single storage cell among the 1024

cells. The number of storage cells in the RAM arrays corresponds to the number of cache

sets. The weight RAM cells are eight bits wide, the tag RAM cells are 19 bits wide, and the

dirty and valid status RAM cells are one bit wide. It is also possible to merge the Dirty and

Valid status RAM arrays into a single RAM array with two bit wide cells. The data RAM

supports two modes of operation modes: line mode and word mode. In the line mode, a 256

bit cache line is read or written. In the word mode, a 32 bit word is accessed. The data RAM

has a 13 bit wide address line. In the line mode, only the highest 10 bits of the address are

used. Cells in the Data RAM are 256 bits wide, if in the line mode, otherwise, 32 bits wide.

65

Internal Bus

CPU MMU

Write Buffer

RegisterFile

Cache Load

Queue

tagset

indexblock offset tag data

Cache

Address Latch Castout Buffer

set index

Weight Increment Register

Initial Weight Register

1

...

associative2 / line2

associative7 / line7

associative1 / line1associative0 / line0

Weight RAM

Valid Status RAM

Dirty Status RAM

Data RAMTag RAM

Hit/MissLogic

Read/WriteLogic

WeightControlLogic

Replace-mentLogic

Linefill/CastoutLogic

10

Fig

ure

6.2

:T

he

structu

reo

faC

PU

cache

usin

gW

LRU

replacem

en

t.

66

Tag RAM1024X19

(19)

(19)

(10)

RD/WR

ADDR

CLOCK

DATA OUT

DATA IN

Data RAM1024X256

(256/32)

(256/32)

10/13

RD/WR

ADDR

CLOCK

DATA OUT

DATA IN

W/L

Dirty Status RAM1024X1

(1)

(1)

(10)

RD/WR

ADDR

CLOCK

DATA OUT

DATA IN

Valid Status RAM1024X1

(1)

(1)

(10)

RD/WR

ADDR

CLOCK

DATA OUT

DATA IN

Weight RAM1024X8

(8)

(8)

(10)

RD/WR

ADDR

CLOCK

DATA OUT

DATA IN

Figure 6.3: The RAM memory arrays used in an associative of the WLRU CPU cache.

The WLRU CPU cache in figure 6.2 supports cache lookup and line-fill operations. Cache

lookups are initiated by the CPU, when the CPU is reading or writing to an address. Line-

fills refer to the loading of cache lines from the main memory.Line-fills are initiated by

the Memory Management Unit (MMU), when MMU has finished a mainmemory read. In

cache lookup operations, the address of the word the CPU is currently referencing is stored

in the Address Latch. The set index part of the address,bit3 to bit12 , is used by the RAM

arrays to index to a single cache set. The eight tag RAM arraysare read and compared in

parallel with the tag part of the address latch. If a match is found, it is a cache hit, otherwise a

cache miss. A hit/miss signal is sent back to the CPU indicating the result of the comparison.

67

In the case of a cache hit, a 32 bit word is read from or written to the Data RAM cell of the

matching way. If it is a cache miss, the CPU sends the memory request to MMU, and MMU

retrieves that data from the main memory. This incurs long delays. Cache lookup operations

cause the weights of cache lines to change based on the hit or miss results of lookups.

A line-fill operation is the result of MMU visiting the main memory. A cache line, 256 bits

in size, is read from the main memory and stored in the Cache Load Queue of MMU. The

MMU then initiates a line-fill operation which replace the cache line chosen by the replace-

ment algorithm with a new cache line stored in the Cache Load Queue of the MMU. If the

replaced cache line is valid and marked dirty, there is a cast-out operation that first transfers

the contents of the replaced cache line into the Cast-out Buffer and then writes the contents

in the Cast-out Buffer into the main memory. If the replaced cache line is not valid or the

dirty bit is not set, then the selected cache line is simply overwritten. The address of the

cache line containing the word the CPU is referencing is stored in the Address Latch. The

highest 19 bits of the address are stored in the tag RAM array.The 256 bit data is stored in

the data RAM. The weight RAM and the status RAMs of the cache line are also initialized.

The WLRU cache has two software configurable registers. These are theweight increment

registerand theinitial weight register. The value stored in the weight increment register

represents the value ofi (see weight formula in chapter 5). The initial weight register stores

the value ofb. The upper limit of weights is implicitly implemented by thesize of the weight

storage. In this example, eight bits per cache line are used to store the weight, and accord-

ingly the upper limit of weights is fixed at 255.

Figure 6.4 is the data path, the address path and the control signals of the WLRU CPU cache.

This includes a bidirectional 256 bit wide data path, a bidirectional 32 bit wide address path,

a read/write (RD/WR) signal, a lookup/line-fill signal, a hit/miss signal and a cast-out signal.

The lookup/line-fill signal indicates two types of cache operations: cache lookup operations

and cache line-fill operations. In cache lookup operations,the RD/WR signal indicates read

or write accesses to the cache. The hit/miss signal is used toindicate whether a cache line

of the same address as the referenced address is found in the cache. For cache line-fill op-

erations, the cast-out signal indicates whether the replaced cache line needs to be written to

the main memory. In this example, the data part of a cache lineis 256 bits long, and thus

the data path is 256 bits wide. In cache lookup operations, only one word is transferred at a

time, and thus only the first 32 bits of the data path are used. In cache line-fill operations, all

256 bits are used. Assuming an 32 bit CPU core, the address path is 32 bits wide. In cache

line-fill operations, the lowest three bits of the address path are not used.

68

Cache

256

data

32

addr RD/WR lookup/linefill hit/miss castout

Internal Bus

Figure 6.4: The data path, address path and control signals of the WLRU CPU cache.

6.3 Hit/Miss Logic

Figure 6.5 graphically depicts the hit/miss control logic of the WLRU CPU cache. The

hit/miss logic is involved in cache lookup operations and changes the weights of cache lines.

The address of the word that the CPU is currently referencingis stored in the Address Latch.

The middle 10 bits of the address is the set index bits. Set index bits are connected with all

the RAM memory arrays except the data RAM memory arrays of theeight ways. The data

RAM memory arrays are connected to the lower 13 bits of the Address Latch.

The core of the hit/miss logic is a parallel comparator. The comparator simultaneously com-

pares the stored address tags of the eight cache lines with the tag part of the Address Latch

and outputs the hit/miss signal and the lineselect signal. Only valid cache lines are involved

in the tag comparison. If one of the eight tags stored in the Tag RAMs is the same as that

in the Address Latch and the valid bit of the cache line is set,there is a cache hit and the

hit/miss signal is set high. The lineselect signal indicates the number of the cache line the

cache hit.

If no matching address tag is found, it is a cache miss. In sucha case, the hit/miss signal is

set low, and the lineselect signal is not used. There will be no reading from nor writing to

cache lines. The lineselect signal and the hit/miss signal are needed in the weight control

logic to update the weights of all cache lines in the cache set, as illustrated in Figure 6.6.

When a cache hit happens, if it is a cache read, the lineselect signal controls the output of

69

COMPARATOR

EN

B

EN

B

EN

B

...

...

. . .

v0

v1

v7

tag0

tag1

tag7

d0

d1

d7

...

...data0

data1

data7

hit/miss

3:8 DECODER

RD/WR

. . .MUX

. . .3

1

Data Output

32

tagset

indexoffset

19

Address Latch

line_select

V bit RAM

1024X1

(1)

(1)

10Tag RAM1024X19

(19)

(19)

10

V bit RAM

1024X1

(1)

(1)

10

Tag RAM1024X19

(19)

(19)

10

V bit RAM

1024X1

(1)

(1)

10Tag RAM1024X19

(19)

(19)

10

D bit RAM

1024X1

(1)

(1)

10

D bit RAM

1024X1

(1)

(1)

10

D bit RAM

1024X1

(1)

(1)

10

Data RAM1024X256

(32)

(32)

13

Data RAM1024X256

(32)

(32)

13

32

32

32

32

32

32

Data RAM1024X256

(32)

(32)

13

19

19

19

Data Inputlookup/linefill

3

13

10

Fig

ure

6.5

:T

he

hit/m

isslo

gic

ofth

eW

LRU

CP

Ucach

e.

70

a multiplexer and one word in the hit cache line is read out. Inthe case of a cache write,

the lineselect signal is decoded in a decoder. The output of the decoder activates only one

cache line, the hit cache line, and one word of the cache line is overwritten by the value on

the data input bus. When writing occurs, the dirty status bitof the hit cache line is set. Later,

when the line is replaced, the content of the cache line, whose dirty status bit is set, will be

written to the Cast-out Buffer, as illustrated in Figure 6.8. In cache lookup operations, only

one word is read from or written to the cache line. The block offset part of the address,

which is stored in the lowest 3 bits of the Address Latch, controls which one of the eight

words in a cache line data is accessed.

6.4 Weight Control Logic

Figure 6.6 graphically depicts weight control logic involved in cache lookup operations.

Weights are recalculated for every reference to the cache set. The set index bits of the Ad-

dress Latch are connected to the weight RAM memory arrays of each of the eight ways.

Only the weights of the selected cache set are changed. If there is a cache hit, as indicated

by the hit/miss signal, a decoder decodes the lineselect signal and selects the hit cache as-

sociative. The weight of the hit cache line is increased, butthe weights of all other cache

lines are deducted by one. In the case of cache misses, when the hit/miss signal is low, the

weights of all cache lines in the cache set are deducted by one. The weight arithmetic circuit

takes the old weight as input and outputs the new weight. Thisis illustrated in Figure 6.7.

Figure 6.7 is a diagram showing an example weight arithmeticcircuit. The core of the cir-

cuit is an eight bit wide full adder. The weight arithmetic circuit performs operations for

increasing the weight and deducting the weight by one. Inputsignal Inc/deduct controls the

choice of the two operations. Input linesw0 to w7 are bits of the old weight. Output lines

o0 to o7 are bits of the new weight.

If a cache line is hit, the Inc/deduct signal is set high. Its weight arithmetic circuit performs

an unsigned adding of the old weight and the value of the weight increment register. The

weight increment register is software configurable. The size of the register is eight bits,

which can support weight formulas whose increments are equal to or less than 255. If the

sum of the old weight and the number in the weight increment register overflows, which is

indicated by setting of the carryout signal of the adder, theweight-out bits,o0 to o7, are all

set to one. The new weight becomes 255, which is the upper limit of weights.

71

In the case of cache misses, the Inc/deduct signal is set low.The weight arithmetic circuit

of a cache line performs a signed deduction of the old weight by one. When the old weight

is already zero, deducting it by one results in underflow. In the case of underflow, the new

weight is zero. Underflow is detected if all bits of the old weight are zero. Thus, weights in

this example are always greater than or equal to zero and lessthan or equal the upper limit.

6.5 Replacement and Line-Fill/Cast-Out Logic

Figure 6.8 is a block diagram illustrating the replacement and line-fill/cast-out logic circuits

of a WLRU cache. Cache line-fill operations are initiated by MMU. The Address Latch

contains the address of the new cache line. The 256 bit wide data input has the data of the

new cache line to be filled. All eight words of the data of the new cache line are loaded

simultaneously. The replacement circuit uses all eight weights and their valid status bits as

inputs and outputs the number of the cache line to be replaced. The 3-bit wide victimline

signal contains the number of the line to be replaced. The victim line signal is decoded at a

decoder to enable writing to the chosen cache lines. The words in the RAM memory arrays

of the selected cache line are overwritten. The tag and the data will have new contents. The

valid status bit will be set and the dirty status bit will be cleared. The newly loaded cache

line is assigned an initial weight from the initial weight register. The Initial Weight Register

is software configurable.

If the selected cache line to be replaced is valid and its dirty status is set, the data and the

address tag of the cache line are written to the Cast-out Buffer before the cache line is over-

written. The Cast-out Buffer consists of the tag, the set index and the data. The tag and the

data are read from the RAM memory arrays of the cache line to bereplaced, but the set in-

dex is read from the Address Latch. Although the address latch contains the address of the

new cache line, the set index part is the same as the set index of the address of the to-be-

replaced cache line. The castout signal is set high if the Cast-out Buffer is loaded. If the

replaced cache line is not valid, nor dirty, the existing cache line is simply overwritten and

there is no cast-out, in which case the castout signal is set low. If the castout signal is high,

MMU reads the content of the Cast-out Buffer to its write buffer and will later write it to

the main memory when the memory bus is not busy.

Figure 6.9 is a diagram showing an example of the replacementcircuit used in Figure 6.8.

Input signalw0 to w7, each is eight bits wide, are the weights of cacheline0 to cacheline7.

Input signalv0 tov7 are the valid status bits of the eight cache lines. Output signalso0, o1, o2

72

encode the number of the cache line to be replaced. The valid signal indicates whether the

to-be-replaced cache line is valid or not. The core of the replacement circuit is a comparator.

The comparator takes the weights of all cache lines in the cache set as inputs and outputs

the number of the cache line with the minimum weight. If thereis more than one cache line

with the minimum weight, the cache line with the smallest number is chosen.

If there are invalid cache lines, the replacement circuit chooses an invalid cache line and

ignores the output of the comparator. Invalid cache lines can be empty cache lines or cache

lines being invalidated explicitly by other processors in amulti-processor environment. In

both cases, invalid cache lines are replaced first. If there is more than one invalid cache line,

the cache line with the smallest number is chosen. If all cache lines are valid, the output of

the comparator is used, and the valid signal is set high. The valid signal is used in figure 6.8

to determine if the Cast-out Buffer should be loaded. If the to-be-replaced cache line is an

invalid one, the valid signal is set low. In this case, there is no need to copy the content of

to be replaced cache line to the Cast-out Buffer.

6.6 Comparison of WLRU and LRU

Compared to LRU, the primary cost of WLRU is the space to storethe weight. WLRU

needs eight bits per cache line to store the weight. Assumingan eight bit weight storage

per cache line, for an-way associative cache, WLRU requires8 ∗ n bits per cache set. In

comparison, LRU needsn2 bits to store the LRU order of cache lines [Tan87]. For an eight-

way associative cache set, WLRU replacement needs8 ∗ 8 = 64 bits per cache set. LRU

replacement also needs82 = 64 bits. For higher associativity, WLRU requires less space

than LRU. For example, for a 16-way associative cache set, LRU requires162 = 256 bits,

and WLRU requires8 ∗ 16 = 128 bits.

Pseudo-LRU replacement algorithms require much less spaceto store the pseudo-LRU or-

ders. For an-way set associative cache, pseudo-LRU-tree requiresn− 1 bits, and pseudo-

LRU-msb needsn bits. For an eight-way associative cache set, Pseudo-LRU-tree requires

a storage of seven bits, and Pseudo-LRU-msb requires eight bits to store the pseudo-LRU

order (see section 2.3.1).

WLRU has different weight control logic than LRU. Each reference to the WLRU CPU

cache, all of the weights of the affected cache set are changed. Most of the weights are

deducted by one. The weight of the hit line is increased. There aren such weight con-

73

trol circuits for an-way set associative cache. The core of the weight control logic is an

adder. Since the weight change is simple, it does not need a full adder, and the circuit of the

weight control logic can be simplified. LRU and the Pseudo-LRU replacement algorithms

have similar circuits too. Every reference, the LRU or Pseudo-LRU order needs to be up-

dated. The circuit for implementing the LRU or Pseudo-LRU order is of the same order of

complexity as WLRU weight control logic.

The only difference of the replacement circuit of WLRU and the replacement circuits of

LRU and the Pseudo-LRU replacement algorithms is the comparator of WLRU weights.

However, only one comparator is needed for the entire cache.It is expected that the com-

parison of weights in the replacement circuit of WLRU does not slow down the CPU cache

for two reasons. First cache line replacements occur infrequently, especially at the L2 cache.

Second the replacement of the old cache line by a new cache line happens in parallel with

the loading of the CPU execution units with the word in the newcache line or the loading of

the L1 cache. When the memory word arrives at the cache load queue, the CPU execution

unit is to load the word. The stopped pipeline is immediatelyresumed. The loading of the

cache happens in parallel with the resumption of the pipeline. No CPU cycles are wasted

in replacing and loading the cache.

6.7 Summary

The example implementation of a WLRU CPU cache described in this chapter is very simi-

lar to a standard set associative CPU cache using LRU replacement. The CPU cache lookup

logics are exactly the same. The other logics are slightly different. Compared with LRU and

Pseudo-LRU caches, WLRU CPU cache requires a little more storage for weights. Pseudo

LRU replacements need one bit per cache line for replacementinformation, and WLRU re-

quires 8 bits per cache line. However, the total storage of a cache line, including storage for

the tag, the data, the weight and the status bits, are 285 bits. The storage for weights, which

is eight bits per cache line, only amounts for less than8/285 = 2.8% of all the 285 bits of

storage of a cache line.

74

8

8 weight0

weight1

...

weight7

3:8 DE

CO

DE

R

3line_select ...

01

7

Hit/Miss

Weight Arithmetic Circuit

WinWout

Inc/deduct

Weight RAM

1024X8

(8)

(8)

10

8

8


WinWout

Inc/deduct

Weight RAM

1024X8

(8)

(8)

10

8

8


WinWout

Inc/deduct

Weight RAM

1024X8

(8)

(8)

10

lookup/linefill

tagset

indexoffset

Address Latch

Figure 6.6: The weight control logic of the WLRU CPU cache.

75

�

a0

a1

a2

b0

b1

b2

a3

b3

a4

b4

a5

b5

a6

b6

a7

b7

c0

c1

c2

c3

c4

c5

c6

c7

carry in

carry out

7 6 5 4 3 2 1 0

w1

w2

w3

w4

w5

w6

w7

w0

Inc/deduct

weight-in

weight-out

o7 o6 o5 o4 o3 o2 o1 o0

Weight Increment Register

Figure 6.7: The weight arithmetic circuit of the weight control logic.

76

weight6

weight7

weight0

v6

v7

v1

v0

d6

d7

d1

d0

tag6

tag7

tag1

tag0

data6

data7

data1

data0

...

...

...

...

...

replacementcircuit

MUX

19

MUX

256

data input

castout

...

w0w1w6w7v0v1v6v7

� � �

victim lineinitial weight

1 0

3:8 DE

CO

DE

R

Castout Buffer

10

valid

Initial Weight Register

weight1

Weight RAM

1024X8

(8)

(8)

10

Weight RAM

1024X8

(8)

(8)

10

Weight RAM

1024X8

(8)

(8)

10

Weight RAM

1024X8

(8)

(8)

10

V bit RAM

1024X1

(1)

(1)

10

V bit RAM

1024X1

(1)

(1)

10

V bit RAM

1024X1

(1)

(1)

10

V bit RAM

1024X1

(1)

(1)

10

tag set index offset

D bit RAM

1024X1

(1)

(1)

10

D bit RAM

1024X1

(1)

(1)

10

D bit RAM

1024X1

(1)

(1)

10

D bit RAM

1024X1

(1)

(1)

10

Tag RAM

1024X19

(19)

(19)

10

Tag RAM

1024X19

(19)

(19)

10

Tag RAM

1024X19

(19)

(19)

10

Tag RAM

1024X19

(19)

(19)

10

Data RAM

1024X256

(256)

(256)

10

Data RAM

1024X256

(256)

(256)

10

Data RAM

1024X256

(256)

(256)

10

Data RAM

1024X256

(256)

(256)

10

Address Latchlookup/linefill

3

8

8

8

8

8

8

8

8 19

19

19

19

19

19

19

19

256

256

256

256

256

256

256

256

256

3 3

3

256

8

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

Fig

ure

6.8

:T

he

line-fill/cast-o

utlo

gic

ofth

eW

LRU

CP

Ucach

e.

77

v0

v1

v2

v3

v4

v5

v6

v7

1 2

3 4

5 6

7 8

9 10

COMPARATOR

w0

w1

w2

w3

w4

w5

w6

w7

8

8

8

8

8

8

8

8

11 12

13 14

1

15 16 17

c0 c1 c2

2 3

o0 o1 o2 valid

replacement circuit

Figure 6.9: The replacement logic of the WLRU CPU cache.

78

Chapter 7

WLRU CPU Cache Simulator

Simulation is a commonly used technique for analyzing the performance of computer sys-

tems. A simulation using a memory trace as its input is calleda trace based simulation

[UM97]. A memory trace is the sequence of addresses of memoryreferences made during

the execution of a program. Trace-based simulations are a standard technique for studying

CPU cache behavior since the 1960’s [UM97]. There are a number of CPU cache simu-

lators available including Dinero IV1, Cheetah [Sug93], Tycho [Hil87] and Fast-Cache2.

A detailed survey on trace based simulation of CPU caches canbe found in [UM97]. This

work did not use an existing CPU cache simulator for the following reasons. Existing CPU

cache simulators focus on cache designs (e.g., the size, theline size, and associativity of

caches, and multiple levels of caches) with relatively little support for evaluating cache re-

placement algorithms. As a result much of the analysis needed for comparing CPU cache

algorithms (e.g., victim analysis) is not included. This chapter describes the cache simula-

tor designed and implemented for this work that addresses the weaknesses of existing CPU

cache simulators.

Trace based simulations are used in this work to compare the WLRU and LRU replace-

ments. An Object-Oriented CPU cache simulator, written in Java, is developed that is capa-

ble of testing multiple cache configurations simultaneously. It generates the miss rates or hit

rates does victim analysis and provides statistics. The cache simulator can be configured to

mimic multi-threading environments, especially simultaneous multi-threading (SMT) [MCFT99].

SMT is a multi-threading technology used in multi-issue CPUs. Multi-issue CPUs can ex-

1http://www.cs.wisc.edu/ markhill/DineroIV2http://www.cs.duke.edu/ alvy/fast-cache/

79

ecute multiple instructions every cycle. In SMT, the CPU issues in a single cycle multiple

instructions from multiple threads simultaneously.

7.1 Memory Trace Based CPU Cache Simulations

Trace based CPU cache studies consist of the trace generation, trace reduction and trace pro-

cessing stages [UM97]. In trace generation, the addresses of memory references and other

information are dumped to files that will be referred to as memory trace files. Trace gen-

erations is done by hardware or software simulations. The traces in the BYU trace archive

[FNAG92] are generated using hardware. The challenge of thetrace generation is to find a

representative workload. To study CPU caches, SPEC benchmarks are frequently used as

a standard workload. In this work, web servers are used as workloads for network memory

traces.

Memory traces are very large in size and need to be trimmed of extra information. This

is called trace reduction. Good reductions may shrink the trace size by a factor of 10 to

100 [UM97]. In trace processing, a CPU cache simulator is able to simulate one or more

cache configurations. Addresses are read from the trace filesand are provided as input. The

simulator generates results including miss rates, misses per instruction (MPI), and cycles

per instruction (CPI) of the cache configurations.

Trace based CPU cache simulations are time consuming. Thereare many factors to consider

in the CPU cache design including the cache size, the cache associativity, cache line size, the

replacement algorithm, the fetch policy, the write policy,split or uniform caches and multi-

level caches. The number of combinations of these factors may exceed several million. To

find the best configuration for a workload may require hundreds of thousands of simula-

tions. Each simulation may take tens of minutes to run since memory traces usually have

millions of references. Depending on the size of available main memory, several simula-

tions can be run in parallel. In parallel simulations, several CPU caches are constructed and

independently provided with the same inputs and the miss rates and hit rates of the caches

are gathered separately. The time reading the same memory trace is saved. Depending on

the available memory, the simulator of this work can run up toeight simulations in parallel.

Multi-configuration simulation algorithms were developedto effectively compare a large

number of cache configurations in one pass. Multi-configuration simulation is different from

parallel simulations. In multi-configuration simulation,a single run of simulation can gen-

80

erate miss rate results for a set of cache configurations.Stack processingis the first and the

most important multi-configuration simulation methods [MGS+70]. Stack processing and

its variants can achieve less than 5% the overhead of single configuration simulations, but

stack processing primarily assumes LRU replacement. For replacements other than LRU,

multi-configuration simulations are not applicable. This work does not use multi-configuration

simulations.

7.2 Architecture of CPU Cache Simulator

This section describes the architecture of the CPU cache simulator developed in this work.

The architecture is graphically depicted in figure 7.1.

SimuEngine

DM Cache

...

Trace Files

SynthesizingPolicy

Set Associative Cache

Two LevelCache

Multi-levelCache

Simulation Results

CacheDevice Objects

Figure 7.1: The architecture of the CPU cache simulator.

7.2.1 SimuEngine Object and Trace Synthesizing

The simulator starts with the SimuEngine object opening a trace file. The SimuEngine then

carries out the trace synthesizing operation. A unique feature of the simulator is its trace

synthesizing ability. For most simulation tasks, there is only one trace file. However, since

multi-threading, especially Simultaneous Multi-threading [MCFT99], is frequently used in

modern CPUs. There is a need for a simulator that can support multi-threading. The SimuEngine

81

object can mimic a multi-threading reference stream by merging multiple single thread trace

files.

Multi-threading memory trace is accurate but very difficultto generate. A compromise ap-

proach is trace synthesizing. To mimic a multi-threading memory trace, multiple single

thread memory traces are merged into a single memory reference stream. A synthesized

memory trace introduces the cache flush of threads, but it cannot reflect the unseen inter-

dependencies among threads. A major shortcoming of synthesized memory trace is the ab-

sence of the operating system.

To synthesize multi-threading traces, the SimuEngine usesa WorkLoadobject to provide

the synthesizing policy guiding the synthesizing of multi-thread traces. The simplest syn-

thesizing policy is to read in a round-robin manner from a setof single thread memory trace

files. The SimuEngine supports complex reading policies that can support context switch

effects. For example, trace files are arranged into groups. These groups represent a set of

simultaneously running threads. The SimuEngine reads froma group of trace files a spe-

cific number of references and then switches to another groupof trace files. Inside a group,

the files are read in a round-robin manner. Figure 7.2 presents an example of a trace syn-

thesizing arrangement . Figure 7.2 mimics a SMT CPU issuing two instructions from two

threads in a cycle. Traces A and B form a group, and traces C andD form another group. A

specific number of references are read from traces A and B. Context switching then occurs,

which results in reading a specific number of references fromtraces C and D.

An …. A3 A2 A1

Bn …. B3 B2 B1

Cn …. C3 C2 C1

Dn …. D3 D2 D1

Dn Cn … D3 C3 D2 C2 D1 C1 B3 A3 B2 A2 B1 A1

Figure 7.2: An example trace synthesizing scenario which includes context switching ef-fects.

A memory reference is encapsulated as anAddrobject inside the SimuEngine. Addr objects

are sent to the CacheDevice objects with an integer calledseqTime, which represents the

time of the reference. This begins with zero and increases byone every time the SimuEngine

reads a memory reference record from a trace file. The integerseqTime represents the time

cycle of the CPU. However, since the exact execution of the CPU is unpredictable due to

branching, data dependencies, and cache misses. Thus, seqTime should not to be interpreted

as the CPU execution time or main memory visiting time.

82

7.2.2 CacheDevice Interface

CacheDeviceis the interface of all CPU cache objects. CacheDevice is theabstraction of

all CPU caches including direct-mapped and set associativecaches, and single level and

multi-level caches. The interface consists of thebeginSimulation()method, thecyclePing()

method, thecontextSwitching(), and theendSimulation()method. When the SimuEngine

object opens the trace files and begins reading the first memory reference record, it calls the

beginSimulation() method of each CacheDevice object.

The most frequently called method of the CacheDevice interface is thecyclePing(Addr a,

int seqTime). Each memory reference record read from a trace file is converted to an Addr

object. The SimuEngine then calls the cyclePing() method ofeach CacheDevice object. The

cyclePing() method represents a cache lookup operation to the CPU cache, during which the

CacheDevice object records whether there is a cache hit or miss and updates the content of

the cache if a replacement is needed.

When the SimuEngine object is configured to mimic multi-threading or multi-programming

environments, thecontextSwitch()method is called when the SimuEngine switches to a new

thread. The default action of a CacheDevice in the contextSwitching() method is to report

the hit rate or miss rates of the thread/process being scheduled out.

At the end of the simulation, SimuEngine calls theendSimulation()method of every CacheDe-

vice object. This signals the caches to report the simulation results such as hit or miss rates

and process the bookkeeping records. Miss rates and bookkeeping information of each cache

is written to a text file. The bookkeeping information includes the stay time, idle time, and

hit counts of every addresses. This will be used in the victimanalysis (see section 7.6).

For this work, direct-mapped cache and set associative caches, one level and multi-level

CPU caches were implemented . Figure 7.3 is the UML graph of CacheDevice interface.

A two-level CPU cache object is implemented as two cache objects. The Addr objects need

to be dispatched from the L1 to the L2 cache. The L2 cache is checked only when the L1

cache generates a miss.

7.3 Cache Sets and Replacement Objects

Since candidates for a replacement is limited to a single cache set, the replacement logic is

implemented in the cache set object. The abstract base classof all cache sets in set associa-

83

Figure 7.3: The UML graph ofCacheDeviceinterface.

tive caches is theCacheSetclass. TheSetCacheclass implements the CacheDevice inter-

face. It represents a set associative cache. The interaction with the SimuEngine is done in a

SetCache object. A CacheSet object is only concerned with the replacement of cache lines.

A SetCache object constructs all cache sets, initializes the bookkeeping in thebeginSimula-

tion() method, dispatches the Addr object to the appropriate cacheset based on its address

in the cyclePing()method, and gathers the miss and hit results or other statistics such as

victim analysis (see section 7.6) from each cache set in theendSimulation()method. Fig-

ure 7.4 is a flow chart of thecyclePing()method ofSetCacheclass. Thereplace()method

of theCacheSetclass is called when thereferenced()method returns false. This causes the

corresponding replacement logic of the CacheSet object to be evoked.

The CacheSetclass is the base class of all cache sets each of which may use adifferent

replacement algorithm. A CacheSet object consists of an array of CacheLineobjects with

each object encapsulating a CPU cache line. In trace based cache simulations, the data of a

CPU cache line is not of interest. Thus a CacheLine object stores only the address tag of a

cache line and the bookkeeping information of a cache line.

The CacheSet interface has two methods related to the replacement: referenced()andre-

place(). The referenced() method returns aBooleanvalue indicating the cache hit or miss

of a memory reference. The replace() method is called only when the referenced() method

returnsfalseindicating a cache miss. The replace() method returns aVictim object that is

84

START

END

Mapping to a single cache set with the index part of the

address

Call the referenced() method of the cache set with Addr and

seqTime arguments

Referenced() return false?

Cache hit, update the weights of all cache lines

of the cache set.

N

Cache miss, update the weights of all cache lines of

the cache set.

Y

Call the replace() method of the cache set

Figure 7.4: A flow chart of thecyclePing()method of classSetCache.

null if the cache set is not yet full. The Victim object returned by the replace() method con-

sists of the address tag of the cache line to be replaced and a copy of the bookkeeping in-

formation of the cache line. Victim objects are gathered andput into a disk file for further

analysis. The separation of the referenced() and replace()methods is to supportcache by-

passing[Smi82], where some cache misses do not cause replacements.In cache bypassing,

some main memory addresses are explicitly marked uncache-able. These addresses will not

replace existing cache contents.

Subclasses of theCacheSetclass implement different replacement algorithms. Three LRU

replacements, and a general WLRU replacement with configurable weight formula are im-

plemented. Figure 7.5 is the UML graph of CacheSet base class.

7.4 WLRU Replacements

The WLRUsubclass ofCacheSetimplements the WLRU replacement. The referenced()

method of the WLRU class calculates the weights of all cache lines of the set are calculated.

85

The hit line has its weight increased, and other lines have their weight deducted by one un-

less the weight is zero. Figure 7.6 is the flow chart of the referenced() and the replaced()

methods of the WLRU class.

Each WLRU cache set is given aWeightFunctionobject, which controls how weights are

calculated. The increment of weight, the upper limit of weights, and the initial weight are

all in the weight function object.

7.5 Optimal Replacement Algorithm

To meaningfully compare the CPU cache replacement algorithms, the hit rates or miss rates

of the off-line optimal replacement are used for comparison. The optimal replacement al-

gorithms(OPT) requires knowledge of future references. The miss rate of the optimal re-

placement algorithm is important in comparing cache replacements. It shows potential im-

provement. In Chapter 8, the miss rates of WLRU is compared with the miss rates of LRU.

The miss rates of the optimal replacement algorithm are provided to show the improvement

WLRU has made.

An implementation of the optimal replacement algorithm requires reading ahead every ref-

erence. It takes many days to finish a memory trace of ten million references. In this work,

two methods are used to reduce the simulation time of the optimal replacement algorithm.

First, the memory trace is split into smaller trace files by cache sets before the simulation.

Doing so the length of each looking forward is significantly reduced. The references map-

ping to the same cache set are stored in a single trace file. Simulations are done on one cache

set at a time. The miss rates of all cache sets are averaged as the final miss rate. Second, a

java databasejdbm is used to store the entire reference history of every address. Instead of

looking ahead when making replacement decisions, the database is checked for the time of

the next reference of the cache line. This significantly reduces the length of each disk ac-

cess. The use of a database is made possible because of the splitting of the trace into cache

sets, which reduces the trace length by a factor of more than 1000. Otherwise, no simple

database can easily hold the whole trace file or files. With these two optimizations, the op-

timal replacement simulation which takes days can be finished in half an hour.

86

7.6 Victim Analysis

Victims in CPU cache replacements refer to cache lines beingreplaced. The CPU simulator

developed does statistical analysis on victims of a replacement. These statistics are helpful

in evaluating the accuracy of replacements and may also provide hints for improving the

performance of a replacement algorithm.

In the CPU cache simulator, bookkeeping information of a cache line, such as the hit count,

stay time, idle time, and the weight of each cache line when evicted, are recorded. This in-

formation is copied to aVictim object, when the cache line is evicted from the cache. For

LRU replacement, which does not use weight, the weight information is empty. The statis-

tic analysis of the victim objects is called theVictim Analysis. Victim analysis currently in-

cludes the study of the distribution of idle time, stay time,and hit counts of all cache lines.

Victim analysis is a unique feature of the simulator of this work. It is also one of the ma-

jor motivations of developing a CPU cache simulator from scratch. In section 8.5, we will

present victim analysis results on SPEC benchmark traces and network traces.

7.7 Validation of Simulator

The simulator in this work consists of three cache replacement algorithms: LRU, WLRU,

and OPT(the optimal replacement algorithm). The simulatoris validated in several ways.

The overall architecture of the simulator and the implementation of the LRU replacement

algorithm is ensured by checking the hit rate against other cache simulators. The reading and

dispatching mechanism of memory trace records and the LRU cache simulation is compared

with a cache simulator written in C on Unix3. The two simulators generate exactly the same

results for the BYU traces.

For OPT, two versions of the optimal replacement algorithm were implemented in this work.

One version uses the traditional read-ahead approach, and the other version uses a database.

The results of the two implementations are the same. The firstversion of the OPT replace-

ment algorithm reads ahead in the trace file to decide replacement, but reading ahead is very

slow (see section 7.5). The database version of OPT is much faster and is used in the sim-

ulation. The only purpose of the look-ahead version of OPT isto check the correctness of

the database version of OPT. The two versions of OPT do not share any common code.

3http://traces.byu.edu/new/Tools/b2asrc/byucache.c

87

Since the WLRU replacement algorithm is new, there is no other simulator to check against.

A set of test cases of WLRU is applied to the WLRU simulator to ensure its correctness. The

test cases are manually generated and checked. Besides testcases, there are also a large

number of assertions, such as assertions on the calculationof the weight of a cache line, in

the code of WLRU to ensure that the WLRU cache always has the correct status.

7.8 Summary

Trace based simulation is the main tool to study CPU caches. The main challenge is to gen-

erate accurate memory traces. The simulator developed in this works is written in Java and

can be very easily expanded. The simulator contains a set of unique features. A fast imple-

mentation of the optimal replacement algorithm and the victim analysis are two important

inventions of the simulator. The victim analysis is very useful in studying the detail behav-

ior of CPU cache replacement algorithms.

88

Figure 7.5: The UML graph ofCacheSetclass, which is the base class of all replacements.

89

START START

Compare the address tags of cache lines with the incoming address tag

Choose the cache line with the

minimal weight.

Increase the weight of the hit cache line and deduce the weights of other lines

by one.

Load the first empty line with

new address tag.

Found?

Empty lines?

Replace the cache line with new address tag.

Assign initial weight to the new

cache line.

END

Return true.

Deduce the weights of all lines by one.

Return false.

END

Y

Y

N

N

A B

Figure 7.6: The flow chart of thereferenced()andreplace()method ofWLRUclass.

90

Chapter 8

Simulation Results

This chapter presents the results of the simulations.

8.1 Experimental Design

The first goal of the experiments is to compare the WLRU, LRU and OPT (optimal) replace-

ment algorithms. The metric used for the comparison is the hit rate. The second goal of the

experiments is to gain an understanding of the behavior of each replacement algorithm by

victim analysis.

The factors of the experiments are divided into three groups: WLRU weight formula pa-

rameters, the workloads and the CPU cache configurations. The workloads include web

servers, SPEC benchmarks and the combination of the two by multi-threading. The full

combination of all these factors results in a large number ofexperiments that may require

efforts that can be better spent somewhere else. However, WLRU is a new cache replace-

ment algorithm whose design space has not been investigated. If only a small fraction of the

combinations are tested, important discoveries may be missed. Due to these considerations,

the following approach is used in the experimental design ofthis work. For WLRU weight

formula parameters and the web server workloads, a full combination of factors was tested.

The advantage of this experimental design is that it provides a nearly thorough examination

of WLRU and the number of experiments is manageable.

The factors associated with CPU cache configurations include cache sizes, the associativity

of the cache, the cache line size, the cache levels and the size and associativity of the levels

91

(a complete list can be found in chapter 2). The number of combinations is high. However,

constraints such as circuit costs limit the choice of CPU cache configurations. Thus only a

small number of the most commonly seen CPU cache configurations are tested.

Since current CPUs usually use two level CPU caches, the CPU cache configurations used

in the experiments are two-level caches. Due to the overwhelming locality exhibited in the

L1 memory reference, LRU is nearly optimal for the L1 cache ifthe L1 cache size is small

and has low associativity (see chapter 3). WLRU is used only in the L2 cache. The L1 cache

uses LRU. Since the number pins of a CPU is a primary constraint [FH05], larger cache lines

require more pins and impose hard challenges to CPU designs.Almost all current CPUs use

32 byte cache line1. All experiments assume that 32 byte cache lines are used.

Three sizes of the L1 cache are tested: 8 KB, 16 KB and 32 KB. Since the L1 cache needs

to be visited in one or two cycles, current CPUs seldom have larger than 64 KB L1 caches.

For higher frequencies, current CPUs tend to have L1 caches of smaller sizes and lower

associativities, such as two or four way. For example, the Intel Pentium 4 Prescott CPU has

16 KB L1 data cache2, and the Sun Sparc T1 has a 16 KB L1 instruction cache and an 8 KB

L1 data cache3. The L1 cache in the experiments is a two-way set associativecache. For

most workloads, an arbitrarily chosen number of four-way set associative L1 caches were

also experimented with. The results of these experiments show no differences between the

hit rates of the L2 cache when comparing two-way and four-wayset associative L1 caches.

These results can be found in appendix A.

Since the L2 cache has longer delay, typically 10 CPU cycles,it can support higher associa-

tivities. 16-way associative caches are frequently used incurrent L2 caches, and higher than

16-way associative caches are seldom seen. The L2 cache in the experiments is a 16-way

set associative cache. Two sizes of the L2 cache, 128 KB and 256 KB, are tested. Larger

L2 cache sizes are not tested due to the limit of the length of the memory traces. Large L2

caches, such as a 1 or 2 MB, requires long memory traces of 100 million or 1 billion refer-

ences. Otherwise, a large L2 caches may have empty entries. Memory traces of 1 billion

references will be more than 20 G byte long, even in compact format. The disk storage for

all 32 web server memory traces exceeds 100 G bytes. Simulations also need storage for

data such as index files and trace splits (see section 7.5) andspace for the databases of the

optimal replacement algorithm. A disk array of 1000 G bytes is required. The simulation

1http://www.geek.com/procspec/2http://www.geek.com/procspec/intel/prescott.htm3http://www.sun.com/processors/UltraSPARC-T1/specs.xml

92

time also increases exponentially. At this stage, only 128 KB and 256 KB L2 cache sizes

are tested. In fact, L2 caches of size 128 KB and 256 KB are commonly seen configurations

for current low end CPUs. For example, the Intel Pentium 4 Willamette CPUs have a L2

cache of 256 KB. It is just recently CPUs have larger than 512 KB L2 caches. The total

number of CPU cache configurations in the experiments is3 ∗ 2 = 6.

WLRU weight formulas have three configurable parameters. These are the initial weight,

the increment of weights and the upper limit of weights. In the experiments, these param-

eters are set from 1 to 1024, increasing exponentially. Thuseach parameter of WLRU for-

mula has 11 settings. For an upper limit of weights of valuen, wheren = 2i, the number of

combinations of the increment of weights and the initial weight is(i + 1)2. The total num-

ber of WLRU weight formulas tested for each pair of workload and cache configuration is

Σ10i=1(i + 1)2, which is506.

The number of web server traces generated and tested is 32 (see chapter 4). The web server

traces use static web pages to eliminate the impact of disk I/O. Two kinds of web servers,

Apacheandthttpd, are tested. The total number of simulation experiments on web server

traces is32 ∗ 6 ∗ 506 = 97152.

For SPEC benchmarks, due to limitation of time, only a small number of WLRU weight for-

mulas are tested. In future work, all WLRU weight formulas will be tested against SPEC

benchmark traces. These weight formulas are chosen since these formulas result in good

improvement in hit rates for web servers. Some other arbitrarily chosen weight formulas

are also tested to see the trend of hit rate changing for different settings of WLRU weight

formula parameters. For each SPEC benchmark trace, about 10to 20 weight formulas are

tested. The cache configurations are the same as those used for web server memory traces.

WLRU is designed to exploit the unique locality feature of web server memory references.

The goal of the experiments using SPEC benchmark is to show that the speed up of WLRU

on web servers does not harm traditional applications as represented by the SPEC bench-

marks. If WLRU does not show noticeably lower hit rates than LRU for a large number of

SPEC benchmarks, WLRU can be said to be a success in improvingthe CPU cache perfor-

mance of web servers.

It is possible, in a multi-threaded environment, both networked applications and traditional

applications are running. Experiments are designed to testthe performance of WLRU on

multi-threading computing platforms by synthesizing the SPEC benchmark traces and the

web server traces ( see Section 7.2.1). Each SPEC benchmark trace is synthesized with the

same web server memory trace. The synthesized multi-threading traces are simulated on

93

a CPU cache configuration of a two-way 32 KB L1 cache and a 16-way 128KB L2 cache.

This CPU cache configuration is the same as one of those used inweb server and SPEC

benchmark experiments. The weight formulas that were tested on SPEC benchmarks were

tested in the multi-threading simulations. The hit rates ofWLRU, LRU and OPT replace-

ment algorithms on multi-threading experiments are compared.

For every pair of the combinations of the CPU cache configuration and the workloads, both

web servers and SPEC benchmarks, the hit rate of the optimal replacement algorithm is gen-

erated. The number of optimal experiments is6 ∗ (32 + 26) = 348. The hit rates of OPT

can be found in appendix A. The number of experiments on SPEC benchmarks and multi-

threaded workloads, including WLRU, LRU and OPT, is more than one and a half thousand.

8.2 WLRU on Web Server Memory Traces

The experiments do not show that a single weight formula always has the best hit rate for

all cache configurations and web server memory traces. However, the experiments show

that the same small number, 8 to 12, of weight formulas are always among the best for all

combinations of cache configurations and web server traces.Out of them, weight formula

i64r256b4has the most consistent performance for all scenarios. Weight formulai64r256b4

has the best hit rates for nearly half of the combinations of cache configurations and web

server memory traces. For the remaining half, the differences betweeni64r256b4and the

weight formula with the best hit rate is within 1%. Weight formulai64r256b4is chosen for

presentation. The hit rates of other weight formulas can be found in appendix A. The hit rate

of WLRU i64r256b4is compared with hit rates of LRU and OPT replacement algorithms.

Figures 8.1 to 8.4 compare the miss rates for the WLRUi64r256b4and LRU replacement

algorithms on the web server memory traces. The CPU cache configurations in the figures

are two level caches with a two-way 32 KB L1 cache and 16-way 128 KB and 256 KB L2

caches.

The reason that WLRUi64r256b4shows significant improvement over LRU for web servers

can be found in the IRG distributions of web server memory traces (see chapter 4). It is seen

in table 4.6 that web server memory traces have a smaller percentage of small IRG values,

less than 16, and a higher percentage of large IRG values, larger than 256, than SPEC bench-

marks. There is also a higher percentage of IRG values of medium sizes, 32 to 256, for web

server memory traces. The medium sized IRG gaps are likely toresult in cache misses using

94

50 90 120 150

0.00%

0.25%

0.50%

0.75%

1.00%

1.25%

1.50%

1.75%

2.00%

2.25%

2.50%

2.75%

Apache, 20K, L2 128KB

LRU WLRU OPT

request rate

mis

s ra

te

50 90 120 150

0.00%

0.20%

0.40%

0.60%

0.80%

1.00%

1.20%

1.40%

1.60%


LRU WLRU OPT

request rate

mis

s ra

te

50 90 120 150

0.00%

0.25%

0.50%

0.75%

1.00%

1.25%

1.50%

1.75%

2.00%

2.25%

2.50%

2.75%


LRU WLRU OPT

request rate

mis

s ra

te

50 90 120 150

0.00%

0.25%

0.50%

0.75%

1.00%

1.25%

1.50%

1.75%

2.00%

2.25%


LRU WLRU OPT

request rate

mis

s ra

te

Figure 8.1: Comparison of miss rates of WLRU, LRU and Optimalon Apache traces.

LRU. Under WLRUi64r256b4, the small initial weight evicts low cache value addresses

faster than LRU. The result is that these medium sized IRG gaps are more likely to become

cache hits.

WLRU shows better improvements over LRU when the web page size is small. The dif-

ferences in the miss rates of WLRU and LRU are larger when the L2 cache size is 128KB

instead of 256KB. The figures show that WLRU has a higher improvement in hit rates for

thethttpdweb server than for theApacheweb server. The experiments show that WLRU has

more improvements for thethttpdweb server than for theApacheweb server when the web

page sizes are small. There are nearly 50% fewer L2 cache misses. For large web pages, the

improvement of WLRU forthttpd is 5%. The improvements of WLRU onApacheserver

are less sensitive to web page sizes. The different behaviors of thttpdandApachesevers on

95

50 90 120 150

0.00%

0.25%

0.50%

0.75%

1.00%

1.25%

1.50%

1.75%

2.00%

2.25%

2.50%

2.75%

Apache, mixed1-4, L2 128KB

LRU WLRU OPT

request rate

mis

s ra

te

50 90 120 150

0.00%

0.25%

0.50%

0.75%

1.00%

1.25%

1.50%

1.75%

2.00%

2.25%


LRU WLRU OPT

request rate

mis

s ra

te

50 90 120 150

0.00%

0.25%0.50%

0.75%

1.00%1.25%

1.50%

1.75%2.00%

2.25%2.50%

2.75%

3.00%


LRU WLRU OPT

request rate

mis

s ra

te

50 90 120 150

0.00%

0.25%

0.50%

0.75%

1.00%

1.25%

1.50%

1.75%

2.00%

2.25%


LRU WLRU OPT

request rate

mis

s ra

te

Figure 8.2: Comparison of miss rates of WLRU, LRU and Optimalon Apache traces withmixed web page sizes.

WLRU has some causes in their different locality features asindicated by the distributions

of reference counts and IRG values ( see chapter 4). The two ways of handling concurrent

connections of web servers have impact on the temporal locality of the web servers. The

results of this works show that the two kinds of web servers,thttpdandApache, calls for

different management strategies.

For theApacheweb server, WLRU always has more than 15% fewer L2 cache misses than

LRU except for tracea200kr50. Tracea200kr50has good hit rates for both WLRU and

LRU. The hit rates are almost optimal. This is because tracea200kr50has very good tem-

poral locality. This is reflected by the average reference count of tracea200kr50, which is

more than 200. Other web server memory traces have average reference counts of no more

than 60.

96

50 90 120 150

0.00%

0.25%

0.50%

0.75%

1.00%

1.25%

1.50%

1.75%

2.00%

2.25%

2.50%

2.75%

thttpd, 20K, L2 128KB

LRU WLRU OPT

request rate

mis

s ra

te

50 90 120 150

0.00%

0.10%

0.20%

0.30%

0.40%

0.50%

0.60%

0.70%

0.80%

0.90%

1.00%

1.10%


LRU WLRU OPT

request rate

mis

s ra

te

50 90 120 150

0.00%

0.25%

0.50%

0.75%

1.00%

1.25%

1.50%

1.75%

2.00%

2.25%

2.50%


LRU WLRU OPT

request rate

mis

s ra

te

50 90 120 150

0.00%

0.25%

0.50%

0.75%

1.00%

1.25%

1.50%

1.75%

2.00%

2.25%


LRU WLRU OPT

request rate

mis

s ra

te

Figure 8.3: Comparison of miss rates of WLRU, LRU and Optimalon thttpd traces.

8.3 WLRU on SPEC CPU2000 Benchmarks

WLRU weight formulai64r256b4, which shows significant improvement on web servers,

is tested for all SPEC CPU2000 benchmarks. Figure 8.5 is the comparison of WLRU and

LRU miss rates of SPEC integer programs. Figure 8.5 is the comparison of WLRU and LRU

miss rates of SPEC floating point programs. The simulation results in figures 8.5 and 8.5

are on a two-level cache configuration. WLRU replacement is used only in the L2 cache.

For eight SPEC benchmarks, WLRU has higher hit rates than LRU. For three SPEC bench-

marks,ammp, appluandart, WLRU and LRU are almost exactly the same. For the remain-

ing 15 SPEC benchmarks, LRU has slightly better hit rates than WLRU. For SPEC INT

benchmarks, LRU has slightly better hit rates than WLRU, butboth LRU and WLRU have

near optimal hit rates. WLRU is slightly better than LRU for most of the SPEC FLT bench-

97

50 90 120 150

0.00%

0.25%

0.50%

0.75%

1.00%

1.25%

1.50%

1.75%

2.00%

2.25%

thttpd, mixed1-4, L2 128KB

LRU WLRU OPT

request rate

mis

s ra

te

50 90 120 150

0.00%

0.20%

0.40%

0.60%

0.80%

1.00%

1.20%

1.40%

1.60%


LRU WLRU OPT

request rate

mis

s ra

te

50 90 120 150

0.00%

0.25%

0.50%

0.75%

1.00%

1.25%

1.50%

1.75%

2.00%

2.25%


LRU WLRU OPT

request rate

mis

s ra

te

50 90 120 150

0.00%

0.20%

0.40%

0.60%

0.80%

1.00%

1.20%

1.40%

1.60%


LRU WLRU OPT

request rate

mis

s ra

te

Figure 8.4: Comparison of miss rates of WLRU, LRU and Optimalon thttpd traces withmixed web page sizes.

marks. It is believed that the SPEC FLT benchmarks have worselocality than the SPEC

INT benchmarks [CH01]. This is reflected in the hit rates. Themiss rates of WLRU and

LRU for most of the SPEC INT benchmarks are well below 1%.

The reason that WLRU with weight formulai64r256b4has slightly worse hit rates than

LRU for some SPEC benchmarks is that addresses with high reference counts represent a

larger portion of the memory references in these SPEC benchmarks. The initial weight of

weight formulai64r256b4is set too low, which results in evicting new cache contents too

early. If provided with larger initial weights, for exampleb = 32, WLRU has higher hit

rates. An initial weight of 32 keeps new cache contents a little longer but still for a much

shorter than LRU.

98

bzip crafty eon gap gcc gzip mcf parser

perl twolf vor-tex

vpr0.00%0.50%1.00%1.50%2.00%2.50%3.00%3.50%4.00%4.50%5.00%5.50%6.00%6.50%7.00%7.50%8.00%

LRU WLRU OPT

Figure 8.5: Comparison of miss rates of OPT, LRU and WLRU on SPEC integer bench-marks.

Figures 8.7 and 8.8 compare the miss rates for SPEC CPU2000 benchmarks when using

WLRU i64r256b4and WLRUi64r256b32with LRU and OPT. The figures show that WLRU

i64r256b32noticeably improved the hit rates of WLRU for SPEC benchmarks when com-

pared with WLRUi64r256b4. Four more SPEC benchmarks,applu, art, mcfandperl, have

WLRU outperforming LRU. For two SPEC benchmarks,eonandammp, the hit rates of

WLRU and LRU are almost exactly the same. Withi64r256b32, WLRU outperforms LRU

in more than half of the SPEC CPU2000 benchmarks. For web servers, WLRUi64r256b32

is worse thani64r256b4( see appendix A).

The SPEC benchmarkswimhas a noticeable difference in hit rates obtained using WLRU

and LRU. This is the result ofswimhaving a large number of addresses referenced only

twice. The IRG values of these addresses are small. For example, in cacheset0 of swim,

the majority, 164 out of 211, of the IRG values of the addresses referenced twice are less

than 32, with the remaining 44 IRG values less than 64. These IRG values are smaller than

the IRG values of addresses with low reference counts in web server traces and other SPEC

benchmarks (see tables 5.1 and 5.2). The SPEC benchmarkswimhas a unique memory

reference pattern, and WLRUi64r256b4keeps these addresses longer than LRU would.

99

amm

pap

plu apsi ar

teq

ua

kefa

c-

erec fm

a3

dga

l

gel luc

asm

esa

mgr

i

dsix

-

track sw

imwup

wise

0.00%

0.25%

0.50%

0.75%

1.00%

1.25%

1.50%

1.75%

2.00%

2.25%

2.50%

2.75%

3.00%

3.25%

LRU WLRU OPT

Figure 8.6: Comparison of miss rates of OPT, LRU and WLRU on SPEC floating pointbenchmarks.

Table 8.3 shows the IRG values of some of the addresses of the SPEC benchmarkswimthat

map to cacheset0. There are only five address tags inset0 of swim that have reference

counts larger than 20. One address tag has a reference count as high as 1618.

8.4 WLRU Performance on Multi-threaded Workloads

WLRU shows significant improvements over LRU for web servers. For SPEC benchmarks,

the difference between WLRU and LRU is small, within 5%. Ideally WLRU would replace

LRU in CPUs used in machines where there is a mix of applications running. This section

describes results for experiments that had a mix of SPEC benchmarks and web server mem-

ory traces.

In the simulations, one copy of SPEC benchmark trace and one copy web server trace are

synthesized to simulate a simultaneous multi-threading CPU. There are 26 SPEC bench-

marks and 32 web server traces. The number of combinations ofSPEC benchmarks and

web servers is26 ∗ 32 = 832. This work tested all SPEC benchmarks on a single web

10

0

address reference count 1 2 < 4 < 8 < 16 < 32 < 64 < 128 < 256 < 512351a00 1618 785 780 33 5 2 12 0 0 0 026a00 32 0 0 0 30 0 0 1 0 0 07e3400 32 2 7 6 2 2 2 0 1 0 9398200 27 1 7 5 2 1 0 0 1 1 854f900 20 0 1 3 2 2 0 1 1 0 9

......

2aa300 2 0 1 0 0 0 0 0 0 0 02aa500 2 0 1 0 0 0 0 0 0 0 02d6a00 2 0 0 0 0 0 1 0 0 0 02d6b00 2 0 0 0 0 0 1 0 0 0 031a300 2 0 0 0 0 0 1 0 0 0 031a400 2 0 0 0 0 0 1 0 0 0 0

Table 8.1: The IRG values of address tags with reference count of two in set0 of SPEC benchmarkswim.

101

bzip crafty

eon gap gcc gzip mcf parser

perl twolf vor-tex

vpr0.00%0.50%1.00%1.50%2.00%2.50%3.00%3.50%4.00%4.50%5.00%5.50%6.00%6.50%7.00%7.50%8.00%

Opt LRU WLRU WLRU b32

Figure 8.7: Comparison of miss rates of OPT, LRU and WLRU on SPEC integer bench-marks, where WLRU usingi64r256b32.

server tracet20kr90, sincet20kr90shows the best improvement of WLRU over LRU.

Figures 8.9 and 8.10 show the miss rates of LRU, WLRU and OPT replacement algorithms

on a two-level cache. The weight formula used by WLRU in the figures isi64r256b4, which

shows significant improvement for web server traces (see section 8.2). The web server mem-

ory trace used in the figures ist20kr90. The CPU cache has a two-way 32KB L1 cache

and a 16-way 128KB L2 cache. For most of the SPEC benchmarks, WLRU has more than

25% fewer L2 cache misses than LRU. The improvement of WLRU over LRU on multi-

threading simulations is consistent and not affected by thechoice of SPEC benchmarks.

8.5 Comparison of LRU and WLRU Using Victim Analy-

sis

The victim analysis applied to SPEC benchmark and web servermemory traces shows that

LRU and WLRU have different behaviors. As expected, victim analysis shows that WLRU

102

ammp

applu

apsi art equake

fac-erec

fma3d

galgel

lu-cas

mesa

mgrid

six-track

swim

wupwise

0.00%

0.25%

0.50%

0.75%

1.00%

1.25%

1.50%

1.75%

2.00%

2.25%

2.50%

2.75%

3.00%

3.25%

Opt LRU WLRU WLRU b32

Figure 8.8: Comparison of miss rates of OPT, LRU and WLRU on SPEC floating pointbenchmarks, where WLRU usingi64r256b32.

bzip crafty

eon gap gcc gzip mcf parser

perl twolf vor-tex

vpr0.00%0.50%1.00%1.50%2.00%2.50%3.00%3.50%4.00%4.50%5.00%5.50%6.00%

LRU WLRU OPT

SPEC INT

mis

s ra

te

Figure 8.9: The miss rates of LRU, WLRU and OPT replacements on multi-threading SPECINT benchmarks.

103

ammp

applu apsi art equake

fac-erec

fma3d galgel lucas mesa mgrid six-track

swim wupwise

0.00%

0.25%

0.50%

0.75%

1.00%

1.25%

1.50%

1.75%

2.00%

2.25%

2.50%

2.75%

3.00%

3.25%

LRU WLRU OPT

SPEC FLT

mis

s ra

te

Figure 8.10: The miss rates of LRU, WLRU and OPT replacementson multi-threadingSPEC FLT benchmarks.

evicts addresses that are not referenced again faster than LRU. Victim analysis shows that

WLRU is better than LRU in keeping high cache value addresses. The victim analysis is

helpful in the understanding of the decision making processof a cache replacement algo-

rithm. The victim analysis is useful in fine-tuning a replacement algorithm. The differences

between the locality qualities of SPEC benchmarks and web servers are also reflected in the

victim analysis.

Victim analysis is the empirical analysis of the hit counts,stay time and idle time of vic-

tims. The difference between LRU and WLRU is manifested in the distribution of victim

hit counts. Table 8.2 is the distribution of the hit counts ofvictims of LRU and WLRU re-

placement on web server tracet20kr50. The distributions of hit counts for other web servers

traces are provided in appendix A. LRU replacement has more cache misses than WLRU

and thus more victims. The table shows that LRU evicts a largenumber, more than 20%,

of victims which are referenced more than once, and a significant number of victims are

evicted even if these addresses are referenced more than a hundred times. In comparison,

98% of the victims under WLRU replacement are not hit at all, and there are almost no vic-

tims whose hit counts are more than a hundred. The distribution of hit counts of victims

shows that the cache flushing effect of LRU replacement is obvious. LRU is not good at

keeping high value cache contents.

Table 8.3 is the distribution of victim hit counts of SPEC benchmarkcrafty. The distribu-

tions of hit counts of other SPEC benchmarks can be found in appendix A. Although WLRU

104

keeps high reference count addresses longer, table 8.5 shows that the distributions of victim

hit counts of WLRU and LRU are quite similar. This is because SPEC benchmarks such as

craftyhas different temporal locality than web servers. The re-reference to the same address

happens in a short time. This is reflected in the L2 IRG distributions of SPEC benchmarks.

The L2 IRG gaps of SPEC benchmarks that are of medium sizes arefewer than the web

server traces. The IRG values are so small that both LRU and WLRU can result in cache

hits for these IRG values. There is no additional value for WLRU to keep a cache line pre-

viously hit longer than LRU. For some SPEC benchmarks, WLRU has slightly higher miss

rates. It is because the small initial weight of WLRU evicts new cache contents too early.

The idle time and cache stay time of victims are important indicators of the accuracy of a

replacement algorithm. The idle time is defined as the time span between the last hit or ref-

erence to a cache line and its eviction. The stay time is the time between a cache line being

loaded and evicted. Using the distributions of idle time andstay time of the optimal re-

placement as a target, better replacement algorithms should have smaller differences in idle

time and stay time with the optimal replacement. Figures 8.11 and 8.12 represent the dis-

tributions of idle time and stay time of LRU, WLRU and OPT on web server tracet20kr50.

Figures 8.13 and 8.14 are the distributions of idle time and stay time of LRU, WLRU and

OPT replacement algorithms on web server tracet20kr90. The CPU cache configurations

in the figures are a two level cache with a two-way 32KB L1 cacheand 16-way 128KB L2

cache. The weight formula of the WLRU replacement in the figures isi64r256b4. The idle

time and stay time statistics for other cache configurationsand memory traces are provided

in appendix A.

For web server traces, the idle time and stay time distributions of WLRU are much closer

to the distributions of the optimal replacement than the idle time and stay time distributions

of LRU are. The idle time of addresses for LRU replacement packed around 32 references.

This is the result of LRU replacement keeping an address in the cache for at least a time

span equal to the associativity of the cache set. This minimal stay time of LRU is too long

since a large number of addresses will not be re-referenced.As the associativity of modern

CPU caches tends to increase, LRU keeps the addresses that will not be re-referenced even

longer. The victim analysis shows that WLRU can limit the stay time of the not-to-be-re-

referenced addresses.

However, for SPEC benchmarks, the difference between the distributions of idle time and

stay time of victims of LRU and WLRU is not manifested as much as it is in the distributions

of web server memory traces. Figures 8.15 and 8.16 representthe distributions of idle time

10

5

LRUtotal victims 388276Hit Count 0 1 2 4 8 16 32 64 128 256 512 1024#Victims 302502 47276 13282 11237 9359 3011 1051 357 131 50 17 3% 77.91% 12.18% 3.42% 2.89% 2.41% 0.78% 0.27% 0.09% 0.03% 0.01% 0.0% 0.0%WLRUtotal victims 231291Hit Count 0 1 2 4 8 16 32 64 128 256 512 1024#Victims 228430 1178 318 258 504 319 205 65 12 2 0 0% 98.76% 0.51% 0.14% 0.11% 0.22% 0.14% 0.09% 0.03% 0.01% 0% 0% 0%

Table 8.2: The distribution of victim hit counts of WLRU and LRU replacements on network tracet20kr50.

10

6

LRUtotal victims 60121Hit Count 0 1 2 4 8 16 32 64 128 256 512#Victims 47007 4364 803 1125 1176 824 357 313 40 12 4% 78.19% 7.26% 1.34% 1.87% 1.96% 1.37% 0.59% 0.52% 0.07% 0.02% 0.01%WLRUtotal victims 63572Hit Count 0 1 2 4 8 16 32 64 128 256 512#Victims 51461 3942 952 1000 901 560 303 290 45 14 8% 80.95% 6.20% 1.50% 1.57% 1.42% 0.88% 0.48% 0.46% 0.07% 0.02% 0.01%

Table 8.3: The distribution of victim hit counts of WLRU and LRU replacements on SPEC benchmarkcrafty.

107

1 2 4 8 16 32 64 128 256 512 1024 20480

250005000075000

100000125000150000175000200000225000250000275000300000325000350000

Distributions of Victim Idle Time

OPT WLRU LRU

idle time (references)

#vic

tims

Figure 8.11: The distributions of idle time of WLRU, LRU and OPT replacements of net-work tracet20kr50.

and stay time of SPEC benchmarkcrafty. The figures shows that the curves of idle and stay

time for the OPT, WLRU and LRU replacement algorithms have nearly the same shape.

The OPT curve is at the left most. The LRU curve is at the right most. WLRU curve is to

the left a little bit to the LRU curve. This is the result of SPEC benchmarkcrafty having

good locality. The WLRU and LRU replacement algorithms are very similar to the optimal

replacement algorithm. Both WLRU and LRU have the almost thesame hit rates, which

are very close to the optimal hit rate.

The idle time and stay time distributions of the OPT replacement algorithm for the SPEC

benchmarkcrafty are very different from those of web server memory tracest20kr50and

t20kr90. For web server memory traces, the idle time and stay time of the optimal replace-

ment is much flatter than those of SPEC benchmark. This is result of that web server traces

have more addresses of medium sized IRG values (see chapter 4).

Victim analysis is a very powerful tool to study the details of a cache replacement algo-

rithm. Victim analysis also reveals information about the temporal locality of programs. It

has been observed for a long time that the locality of networkprotocols and applications is

different from other computation intensive programs such as SPEC benchmarks (see chapter

2). However, no previous studies showed detailed differences. Victim analysis shows that

curves of the distributions of victim idle and stay time are very different for web servers

and SPEC benchmarks. Victim analysis shows that LRU is near optimal for SPEC bench-

108

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192

0

25000

50000

75000

100000

125000

150000

175000

200000

225000

250000

275000

300000

Distributions of Victim Stay Time

OPT WLRU LRU

stay time (references)

#vic

tims

Figure 8.12: The distributions of stay time of WLRU, LRU and OPT replacements of net-work tracet20kr50.

marks. This will affect the micro-architecture design philosophies of CPUs targeting SPEC

benchmarks.

8.6 Summary

WLRU shows significant improvements over LRU for web server memory traces. It is found

that the best weight formulas for web server memory traces are those with a small initial

weight and large weight increments and upper limits of weights. A example of such weight

formulas isi64r256b4. In the best cases, WLRUi64r256b4has more than 50% fewer L2

cache misses than LRU for web servers. For SPEC CPU2000 benchmarks, the difference

of WLRU i64r256b4and LRU is small. If WLRU uses weight formulas with higher initial

weights such asi64r256b32, the hit rates of WLRU on SPEC CPU2000 benchmarks will

improve. The advantage of WLRU is that the speed-up of WLRU for web servers does not

harm traditional applications such as SPEC CPU2000 benchmarks. Simulation on synthe-

sized multi-threading memory traces shows WLRU consistently has nearly 25% fewer L2

cache misses.

Victim analysis shows that the distributions of hit counts and stay time and idle time of vic-

tims of WLRU are closer to the distributions of victims of OPTthan LRU. Victim analysis

109

1 2 4 8 16 32 64 128 256 512 1024 20480

250005000075000

100000125000150000175000200000225000250000275000300000325000350000375000

Distributions of Victim Idle Time

OPT WLRU LRU

idle time (references)

#vic

tims

Figure 8.13: The distributions of idle time of WLRU, LRU and OPT replacements of net-work tracet20kr90.

also shows that the distributions of hit counts and stay timeand idle time of victims of OPT

replacement algorithm is different for SPEC benchmarks andweb server memory traces.

This implies web servers have different temporal locality patterns than SPEC benchmarks.

SPEC benchmarks have better temporal locality than web servers. This is reflected in the

distributions of victim stay time and idle time of OPT replacement. The distributions of web

servers are flat, but the distributions of SPEC benchmarks shows clearly a peak.

WLRU does not show as large of an improvement for SPEC CPU2000benchmarks as it

does for web servers. The difference in the hit rates of WLRU and LRU for SPEC CPU2000

benchmarks is small, with WLRU slightly better than LRU for half of the SPEC CPU2000

benchmarks. Although the use of WLRU results in higher miss rates for some of the SPEC

CPU 2000 benchmarks, it was observed that the miss rate for the optimal replacement algo-

rithm is not that much lower. ?It is observed that WLRU on SPECCPU2000 benchmarks

results in better hit rates for SPEC CPU2000 benchmarks compared to other alternative

replacement algorithms. ?For example, the improvement in hit rates of WLRU on SPEC

CPU2000 benchmarks are better than hit rates reported in AIP/LvP algorithms [KS05]. ?The

experiments in [KS05] used L2 caches that were 512 KB and 1 MB in size. In this work the

results are based on using L2 caches of 128 KB and 256 KB. Sincethe primary shortcom-

ing of WLRU on SPEC CPU2000 ?benchmarks is the small initial weight, if given larger

L2 cache sizes, new cache lines can stay longer with small initial weights, and WLRU can

110

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192

0

25000

50000

75000

100000

125000

150000

175000

200000

225000

250000

275000

300000

325000

Distributions of Victim Stay Time

OPT WLRU LRU

stay time (references)

#vic

tims

Figure 8.14: The distributions of stay time of WLRU, LRU and OPT replacements of net-work tracet20kr90.

show even more improvement.

111

1 2 4 8 16 32 64 128 256 512 1024

2048

4096

8192

16384

0

2500

5000

7500

10000

12500

15000

17500

20000

22500

25000

27500

30000

32500

LRU WLRU OPT

idle time (in references)

# vi

ctim

s

Figure 8.15: The distributions of idle time of WLRU, LRU and OPT replacements of SPECbenchmarkcrafty.

1 2 4 8 16 32 64 128 256 512 1024

2048

4096

8192

16384

0

2500

5000

7500

10000

12500

15000

17500

20000

22500

25000

27500

30000

32500

LRU WLRU OPT

stay time (in references)

# vi

ctim

s

Figure 8.16: The distributions of stay time of WLRU, LRU and OPT replacements of SPECbenchmarkcrafty.

112

Chapter 9

Conclusions and Future Research

9.1 Conclusions

In this work, empirical analysis is applied on the memory reference traces of SPEC CPU2000

benchmark and web server memory traces. The existence of theproperty of temporal local-

ity is supported by the distribution of per set IRG values. The majority of per set IRG values

are very small. This is especially true at the L1 cache where 90% per set IRG values are of

size one. At the L2 cache, per set IRG values are still small. This provides strong evidence

of the temporal locality. However, study presented in this work also shows that a large por-

tion of addresses have low reference counts. At the L2 cache,nearly 50% addresses are

referenced only once, and nearly 90% are referenced under ten times. This pattern is called

theproperty of short lifetime. The property of short lifetime suggests that LRU cache re-

placement is not optimal for programs that have a large number of addresses with low ref-

erence counts. One class of applications is network applications and protocols, such as web

servers. The distributions of reference counts and per set IRG values of web server traces

are found to be different from the distributions of SPEC benchmarks. The L2 IRG distri-

butions of web server memory traces have fewer small IRG values and more medium sized

IRG values. This analysis lead to the development of a new CPUcache replacement called

WLRU.

WLRU addresses the shortcoming of LRU by differentiating addresses. It is found that the

per set IRG values of an address are correlated with the reference count of the address. Ad-

dresses that have higher reference counts also have more IRGgaps of small size, and large

IRG gaps are more likely to be associated with addresses thathave low reference counts.

113

The correlation of reference counts and IRG values suggeststhe basis of WLRU replace-

ment. Good cache value addresses show themselves quickly. Thus, if an address is not

referenced again within a short time after it is brought intothe cache, it may never be re-

referenced. If an address is hit quickly after being broughtinto the cache, it is likely to be

hit repeatedly. WLRU distinguishes an address by its numberof hits in a short time. Ad-

dresses that are hit in a short time after being brought into the cache are kept in the cache,

and addresses that are not hit in a short time after being brought into the cache are evicted

fast.

Trace based simulations show that for SPEC benchmarks, WLRUhas hit rates as good as

LRU. For web server traces, WLRU shows significant improvements over LRU in hit rates.

In the best scenario, WLRU has more than 50% fewer cache misses than LRU. The huge

speed up of WLRU on web servers does not harm traditional applications such as SPEC

benchmarks. Simulations on synthesized multi-threading memory traces show that WLRU

consistently has significant improvements over LRU for all SPEC benchmarks in multi-

threading environments.

WLRU is better than LRU for web server traces, but WLRU still has around 30% more

cache misses than the off-line optimal cache replacement. There is still room for better CPU

cache replacements and designs.

9.2 Future Research

The number of simulation experiments in Chapter 8 is not evennear full-factorial. The best

WLRU weight formula for each application domains is relatedwith the CPU cache config-

urations such as the cache associativity and cache size, especially the cache size. In future

research, we plan to do more experiments to reveal the relation between WLRU weight for-

mula and CPU cache configurations. We will also compare the hit rates of WLRU with more

replacement algorithms such as those mentioned in Chapter 5for various work loads.

In future research, we also plan to investigate the localitycharacteristics of more applica-

tions domains, such as multimedia, graphics, and more network equipment and applications.

A hardware implementation of WLRU is also planned. More importantly, we will conduct

research to develop better cache-able software systems, such as OS and application algo-

rithms.

114

9.2.1 Hardware Prototype

Trace based simulation is unavoidably the first step to evaluate WLRU CPU cache replace-

ment. We are planning to implement WLRU CPU cache in FPGA. Current FPGA works

at frequency of 500 Mhz, and the cache miss penalty is around 50. Compared with cache

miss penalties of high end CPUs, which are several hundreds,the cache miss penalty of

FPGA system is small but still large enough to manifest the difference between the hit rates

of WLRU and LRU replacements. We plan to test the throughput and responsiveness of

web servers and other network systems on the FPGA system.

9.2.2 Locality Analysis for More Applications Domains

The analysis of reference counts and per set IRG values reveals the difference in locality

between web servers and SPEC CPU2000 benchmarks. We plan to apply the analysis of lo-

cality on more applications domains and compare the hit rates of WLRU and LRU for these

application domains. Application domains that will be furthered explored include database

servers, search engines, network file servers and DNS servers. ?The difficulty in studying

these domains is that trace based simulation is inappropriate. An FPGA prototype CPU will

be the primary means of investigation. We plan to run TPC-C1 benchmarks on the FPGA

prototype CPU. The hit rates of WLRU and LRU will be compared.We will test the hit

rates of different WLRU weight formulas and CPU cache configurations.

Besides network servers, network equipment, such as routers, and wireless and handset equip-

ment are interesting topics for locality analysis. Networkequipment is used to be built on

DSP and ASIC. Increasing management requirements and security challenges are changing

the design philosophy of network equipment. There are more and more software in network

equipment. For wireless equipment, higher security demands are driving the equipment us-

ing more complex and more computationally intensive encryption and management soft-

ware. On handset equipment, video and game applications aregetting more popularity. Not

surprisingly, CPU caches will become the performance bottleneck. We plan to investigate

the CPU cache performance of these equipment, apply locality analysis on these equipment

and examine the possibility of using WLRU cache in these equipment to improve the CPU

cache hit rates.

1http://www.tpc.org/

115

9.2.3 OS and algorithm Design Issues

The importance of CPU cache on the performance of applications is recognized in the design

of algorithms. The cache oblivious algorithms [FLPR99] arean example effort of prevent

bad impact of CPU caches. There is a good deal of other research with a goal of designing

better cached programs (e.g. [CHL99]). However, the cache models the research is based on

involves the LRU replacement only. The findings in this work will help understand the CPU

cache requirements of operations and provide more accuratecache models and guidelines

for designing new algorithms.

The finding in the works and the WLRU replacement change one ofthe most basic constraint

of computer architecture. This will cause re-evaluations of CPU design trade offs and ap-

proaches. Because of the build-in mechanism against cache pollution [GC90] of WLRU,

more memory bandwidth is possible by using larger cache lines and more aggressive pre-

fetching. The advantage of WLRU replacement on multi-threading is also an interesting

research topic.

WLRU cache replacement will also have impact on the designs of operating systems. In

chapter 4, it is seen that web memory tracea200kr50has by far the best temporal locality

and the best CPU cache hit rates. Web server tracea200kr50shows that minor changes in

the scheduling of OS operations cause huge differences on the temporal locality and thus

the CPU cache performance of server applications. We plan toinvestigate the details of the

great difference in temporal localities of OS scheduling policies. We will also investigate

the CPU cache cost of OS operations under the WLRU replacement. Using the WLRU re-

placement, the scheduling policy of the operating system may be different from that of using

the LRU replacement. CPU cache gains will be one of the top important considerations of

OS scheduling and management strategies. Web server memorytracea200kr50shows there

are great potentials in CPU cache aware OS designs.

The discovery of the property of short lifetime improves ourunderstanding of the local-

ity of programs. WLRU exploits the property of short lifetime. Compared with other ap-

proaches of improving the CPU cache performance such as re-writing the software for better

cache hit rates, WLRU CPU cache is simpler and of much lower costs. Since the hit rates

of WLRU is close to the OPT replacement algorithm, changing of software seems unavoid-

able. Since WLRU cache replacement algorithm has more parameters to fine tune than LRU

has, WLRU can better support the re-writing of software for higher hit rates.

116

Appendix A

Analysis and Simulation Results

The attached CD contains the analysis and simulation results not described in the main body

of the dissertation. The materials in the CD are organized into three directories. The three

directories arelocality-analysis, hit-ratesandvictim-analysis. In each directory, there is a

READMEfile describing the structure of the directory.

The directory oflocality-analysiscontains the results of the statistical studies of SPEC CPU2000

benchmarks and web server memory traces. The distributionsof reference counts and per

set IRG values are provided in this directory. The distributions include both the L1 cache

and the L2 cache.

The directory ofhit-ratesis the simulation results of SPEC CPU2000 benchmarks and web

server memory traces. Multi-threading simulation resultsare also provided. This direc-

tory has three sub-directories representing SPEC CPU2000 hit rates, web server results and

multi-threading simulations. The hit rates of web servers are grouped by CPU cache con-

figurations.

The last first level directory is thevictim-analysis. The results of victim analysis of WLRU,

LRU and OPT replacement algorithms are provided in this directory. The subdirectories are

arranged into CPU cache configurations. Each subdirectory represents one configuration of

CPU caches. The victim analysis results of WLRU, LRU and OPT replacement algorithms

are stored under the same subdirectory.

117

References

[ADU71] Alfred V. Aho, Peter J. Denning, and Jeffrey D. Ullman. Principles of optimalpage replacement.J. ACM, 18(1):80–93, 1971.

[ASW+93] Santosh G. Abraham, Rabin A. Sugumar, Daniel Windheiser, B. R. Rau, andRajiv Gupta. Predictability of load/store instruction latencies. InProceedingsof the 26th annual international symposium on Microarchitecture, pages 139–152. IEEE Computer Society Press, 1993.

[AZMM04] Hussein Al-Zoubi, Aleksandar Milenkovic, and Milena Milenkovic. Perfor-mance evaluation of cache replacement policies for the speccpu2000 bench-mark suite. InProceedings of the 42nd annual Southeast regional conference,pages 267–272. ACM Press, 2004.

[BAYT01] Abdel-Hameed A. Badawy, Aneesh Aggarwal, Donald Yeung, and Chau-WenTseng. Evaluating the impact of memory system performance on softwareprefetching and locality optimizations. InProceedings of the 15th interna-tional conference on Supercomputing, pages 486–500. ACM Press, 2001.

[Bla96] Trevor Blackwell. Speeding up protocols for small messages. InConfer-ence proceedings on Applications, technologies, architectures, and protocolsfor computer communications, pages 85–95. ACM Press, 1996.

[CDL99] Trishul M. Chilimbi, Bob Davidson, and James R. Larus. Cache-consciousstructure definition. InProceedings of the ACM SIGPLAN 1999 conferenceon Programming language design and implementation, pages 13–24. ACMPress, 1999.

[CH01] Jason F. Cantin and Mark D. Hill. Cache performance for selected speccpu2000 benchmarks.SIGARCH Comput. Archit. News, 29(4):13–18, 2001.

[CH02] Trishul M. Chilimbi and Martin Hirzel. Dynamic hot data stream prefetchingfor general-purpose programs. InProceedings of the ACM SIGPLAN 2002Conference on Programming language design and implementation, pages199–209. ACM Press, 2002.

118

[CHL99] Trishul M. Chilimbi, Mark D. Hill, and James R. Larus. Cache-consciousstructure layout. InProceedings of the ACM SIGPLAN 1999 conference onProgramming language design and implementation, pages 1–12. ACM Press,1999.

[CJRS89] David D. Clark, Van Jacobson, John Romkey, and Howard Salwen. An analy-sis of tcp processing overhead.IEEE Communications Magazine, 27(6), June1989.

[CMT94] Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. Compiler optimiza-tions for improving data locality. InProceedings of the sixth international con-ference on Architectural support for programming languages and operatingsystems, pages 252–262. ACM Press, 1994.

[Den68] Peter J. Denning. The working set model for program behavior. Commun.ACM, 11(5):323–333, 1968.

[dLJ03] P.J. de Langen and B.H.H. Juurlink. Reducing conflict misses in caches. InProceedings of the 14th Annual Workshop on Circuits, Systems and SignalProcessing, ProRisc 2003, pages 505–510, November 2003.

[EH84] Wolfgang Effelsberg and Theo Haerder. Principles ofdatabase buffer manage-ment.ACM Trans. Database Syst., 9(4):560–595, 1984.

[EK89] S. J. Eggers and R. H. Katz. The effect of sharing on thecache and bus per-formance of parallel programs. InProceedings of the third international con-ference on Architectural support for programming languages and operatingsystems, pages 257–270. ACM Press, 1989.

[FH05] Michael J. Flynn and Patrick Hung. Microprocessor design issues: Thoughtson the road ahead.IEEE Micro, 25(3):16–31, 2005.

[FLPR99] Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachan-dran. Cache-oblivious algorithms. InFOCS, pages 285–298, 1999.

[FNAG92] J. Kelly Flanagan, Brent E. Nelson, James K Archibald, and Knut Grimsrud.BACH: BYU Address Collection Hardware, the collection of complete traces.In Proc. of the 6th Int. Conf. on Modelling Techniques and Toolsfor ComputerPerformance Evaluation, pages 128–137, 1992.

[FTP94] M. Farrens, G. Tyson, and A. R. Pleszkun. A study of single-chip proces-sor/cache organizations for large numbers of transistors.In Proceedings of the21ST annual international symposium on Computer architecture, pages 338–347. IEEE Computer Society Press, 1994.

119

[GC90] Rajiv Gupta and Chi-Hung Chi. Improving instructioncache behavior by re-ducing cache pollution. InProceedings of the 1990 ACM/IEEE conference onSupercomputing, pages 82–91. IEEE Computer Society, 1990.

[Goo83] James R. Goodman. Using cache memory to reduce processor-memory traf-fic. In ISCA ’83: Proceedings of the 10th annual international symposium onComputer architecture, pages 124–131, Los Alamitos, CA, USA, 1983. IEEEComputer Society Press.

[Hen00] John L. Henning. Spec cpu2000: Measuring cpu performance in the new mil-lennium.Computer, 33(7):28–35, 2000.

[Hig90] Lee Higbie. Quick and easy cache performance analysis. SIGARCH Comput.Archit. News, 18(2):33–44, 1990.

[Hil87] Mark Donald Hill. Aspects of cache memory and instruction buffer perfor-mance. PhD thesis, University of California, Berkeley, 1987.

[Hil88] Mark D. Hill. A case for direct-mapped caches.Computer, 21(12):25–40,1988.

[HKM02] Zhigang Hu, Stefanos Kaxiras, and Margaret Martonosi. Timekeeping in thememory system: predicting and optimizing memory behavior.In ISCA ’02:Proceedings of the 29th annual international symposium on Computer archi-tecture, pages 209–220, Washington, DC, USA, 2002. IEEE Computer Soci-ety.

[HP96] John L. Hennessy and David A. Patterson.Computer architecture (2nd ed.):a quantitative approach. Morgan Kaufmann Publishers Inc., 1996.

[HP02] John L. Hennessy and David A. Patterson.Computer architecture: a quantita-tive approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,2002.

[HR00] Erik G. Hallnor and Steven K. Reinhardt. A fully associative software-managed cache design. InProceedings of the 27th annual international sym-posium on Computer architecture, pages 107–116. ACM Press, 2000.

[HS89] M. D. Hill and A. J. Smith. Evaluating associativity in cpu caches.IEEE Trans.Comput., 38(12):1612–1630, 1989.

[Int04] Intel. The Microarchitecture of the Pentium 4 Processor, May 2004.

[Jac03] B. Jacob. A case for studying dram issues at the system level. Micro, IEEE,23(4):44–56, 2003.

120

[JmWH97] Teresa L. Johnson and Wen mei W. Hwu. Run-time adaptive cache hierarchymanagement via reference analysis. InProceedings of the 24th annual inter-national symposium on Computer architecture, pages 315–326. ACM Press,1997.

[Jou98] Norman P. Jouppi. Improving direct-mapped cache performance by the addi-tion of a small fully-associative cache prefetch buffers. In 25 years of the in-ternational symposia on Computer architecture (selected papers), pages 388–397. ACM Press, 1998.

[JS94] Theodore Johnson and Dennis Shasha. 2q: A low overhead high performancebuffer management replacement algorithm. InVLDB ’94: Proceedings ofthe 20th International Conference on Very Large Data Bases, pages 439–450.Morgan Kaufmann Publishers Inc., 1994.

[JZ02] Song Jiang and Xiaodong Zhang. Lirs: an efficient low inter-reference recencyset replacement policy to improve buffer cache performance. In Proceedingsof the 2002 ACM SIGMETRICS international conference on Measurement andmodeling of computer systems, pages 31–42. ACM Press, 2002.

[KBK02] Changkyu Kim, Doug Burger, and Stephen W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. InTenthinternational conference on architectural support for programming languagesand operating systems on Proceedings of the 10th international conferenceon architectural support for programming languages and operating systems(ASPLOS-X), pages 211–222. ACM Press, 2002.

[KS02] Gokul B. Kandiraju and Anand Sivasubramaniam. Goingthe distance for tlbprefetching: an application-driven study. InProceedings of the 29th annual in-ternational symposium on Computer architecture, pages 195–206. IEEE Com-puter Society, 2002.

[KS05] Mazen Kharbutli and Yan Solihin. Counter-based cache replacement algo-rithms. In ICCD ’05: Proceedings of the 2005 International ConferenceonComputer Design, pages 61–68, Washington, DC, USA, 2005. IEEE Com-puter Society.

[LCK+01] D. Lee, J. Choi, J. H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C.S. Kim. Lrfu: Aspectrum of policies that subsumes the least recently used and least frequentlyused policies.IEEE Trans. Comput., 50(12):1352–1361, 2001.

[LFF01] An-Chow Lai, Cem Fide, and Babak Falsafi. Dead-blockprediction and dead-block correlating prefetchers.isca, 00:0144, 2001.

121

[LKW02] Jung-Hoon Lee, Shin-Dug Kim, and Charles Weems. Application-adaptiveintelligent cache memory system.Trans. on Embedded Computing Sys.,1(1):56–78, 2002.

[LP98] Butler W. Lampson and Kenneth A. Pier. A processor fora high-performancepersonal computer. In25 years of the international symposia on Computerarchitecture (selected papers), pages 180–194. ACM Press, 1998.

[MCE+02] Peter S. Magnusson, Magnus Christensson, Jesper Eskilson, Daniel Forsgren,Gustav H lberg, Johan H berg, Fredrik Larsson, Andreas Moestedt, and BengtWerner. Simics: A full system simulation platform.Computer, 35(2):50–58,2002.

[MCFT99] Nicholas Mitchell, Larry Carter, Jeanne Ferrante, and Dean Tullsen. Ilp versustlp on smt. InProceedings of the 1999 ACM/IEEE conference on Supercom-puting (CDROM), page 37. ACM Press, 1999.

[MGS+70] R. L. Mattson, J. Gecsei, D. R. Slutz, I. L., and Traiger. Evaluation techniquesfor storage hierarchies.IBM Systems Journal, 9(2):78–117, 1970.

[MPBO96] David Mosberger, Larry L. Peterson, Patrick G. Bridges, and Sean O’Malley.Analysis of techniques to improve protocol processing latency. In Confer-ence proceedings on Applications, technologies, architectures, and protocolsfor computer communications, pages 73–84. ACM Press, 1996.

[NYKT97] Erich Nahum, David Yates, Jim Kurose, and Don Towsley. Cache behavior ofnetwork protocols. InProceedings of the 1997 ACM SIGMETRICS interna-tional conference on Measurement and modeling of computer systems, pages169–180. ACM Press, 1997.

[OOW93] Elizabeth J. O’Neil, Patrick E. O’Neil, and GerhardWeikum. The lru-k pagereplacement algorithm for database disk buffering. InProceedings of the 1993ACM SIGMOD international conference on Management of data, pages 297–306. ACM Press, 1993.

[PCD+01] P. R. Panda, F. Catthoor, N. D. Dutt, K. Danckaert, E. Brockmeyer, C. Kulka-rni, A. Vandercappelle, and P. G. Kjeldsberg. Data and memory optimizationtechniques for embedded systems.ACM Trans. Des. Autom. Electron. Syst.,6(2):149–206, 2001.

[PG95] Vidyadhar Phalke and Bhaskarpillai Gopinath. An inter-reference gap modelfor temporal locality in program behavior. InProceedings of the 1995 ACMSIGMETRICS joint international conference on Measurementand modelingof computer systems, pages 291–300. ACM Press, 1995.

122

[PH90a] Karl Pettis and Robert C. Hansen. Profile guided codepositioning. InPro-ceedings of the ACM SIGPLAN 1990 conference on Programming languagedesign and implementation, pages 16–27. ACM Press, 1990.

[PH90b] Dionisios N. Pnevmatikatos and Mark D. Hill. Cache performance of the inte-ger spec benchmarks on a risc.SIGARCH Comput. Archit. News, 18(2):53–68,1990.

[PH05] David A. Patterson and John L. Hennessy.Computer organization & de-sign: the hardware/software interface. Morgan Kaufmann Publishers Inc., SanFrancisco, CA, USA, 2005.

[PHH88] S. Prybylski, M. Horowitz, and J. Hennessy. Performance tradeoffs in cachedesign. InProceedings of the 15th Annual International Symposium on Com-puter architecture, pages 290–298. IEEE Computer Society Press, 1988.

[PHS98] Jih Peir, Windsor W. Hsu, and Alan J. Smith. Implementation issues in moderncache memory. Technical report, Berkeley, CA, USA, 1998.

[PK94] S. Palacharla and R. E. Kessler. Evaluating stream buffers as a secondary cachereplacement. InProceedings of the 21ST annual international symposium onComputer architecture, pages 24–33. IEEE Computer Society Press, 1994.

[Prz90] Steven A. Przybylski.Cache and memory hierarchy design: a performance-directed approach. Morgan Kaufmann Publishers Inc., 1990.

[Quo94] R. W. Quong. Expected i-cache miss rates via the gap model. InProceedingsof the 21ST annual international symposium on Computer architecture, pages372–383. IEEE Computer Society Press, 1994.

[RBC02] Shai Rubin, Rastislav Bodík, and Trishul Chilimbi. An efficient profile-analysis framework for data-layout optimizations. InProceedings of the29th ACM SIGPLAN-SIGACT symposium on Principles of programming lan-guages, pages 140–153. ACM Press, 2002.

[RD90] John T. Robinson and Murthy V. Devarakonda. Data cache management usingfrequency-based replacement. InProceedings of the 1990 ACM SIGMETRICSconference on Measurement and modeling of computer systems, pages 134–142. ACM Press, 1990.

[RSG93] Edward Rothberg, Jaswinder Pal Singh, and Anoop Gupta. Working sets,cache sizes, and node granularity issues for large-scale multiprocessors. InProceedings of the 20th annual international symposium on Computer archi-tecture, pages 14–26. ACM Press, 1993.

123

[RTT+98] Jude A. Rivers, Edward S. Tam, Gary S. Tyson, Edward S. Davidson, and MattFarrens. Utilizing reuse information in data cache management. InProceed-ings of the 12th international conference on Supercomputing, pages 449–456.ACM Press, 1998.

[SA93] Rabin A. Sugumar and Santosh G. Abraham. Efficient simulation of cachesunder optimal replacement with applications to miss characterization. InPro-ceedings of the 1993 ACM SIGMETRICS conference on Measurement andmodeling of computer systems, pages 24–35. ACM Press, 1993.

[SKT96] James D. Salehi, James F. Kurose, and Don Towsley. The effectiveness ofaffinity-based scheduling in multiprocessor network protocol processing (ex-tended version).IEEE/ACM Transactions on Networking, 4(4):516–530, Au-gust 1996.

[SKW99] Yannis Smaragdakis, Scott Kaplan, and Paul Wilson.Eelru: simple and ef-fective adaptive page replacement. InSIGMETRICS ’99: Proceedings of the1999 ACM SIGMETRICS international conference on Measurement and mod-eling of computer systems, pages 122–133. ACM Press, 1999.

[Smi82] Alan Jay Smith. Cache memories.ACM Comput. Surv., 14(3):473–530, 1982.

[Smi87] A. J. Smith. Line (block) size choice for cpu cache memories. IEEE Trans.Comput., 36(9):1063–1076, 1987.

[SS01] Nathan T. Slingerland and Alan Jay Smith. Cache performance for multime-dia applications. InProceedings of the 15th international conference on Su-percomputing, pages 204–217. ACM Press, 2001.

[ST85] Daniel D. Sleator and Robert E. Tarjan. Amortized efficiency of list updateand paging rules.Commun. ACM, 28(2):202–208, 1985.

[Sug93] Rabin A. Sugumar.Multi-Configuration Simulation Algorithms for the Evalu-ation of Computer Architecture Designs. PhD thesis, University of Michigan,1993. Technical Report CSE-TR-173-93 with Santosh G. Abraham.

[Tan87] Andrew S. Tanenbaum.Operating systems: design and implementation.Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1987.

[TFMP95] Gary Tyson, Matthew Farrens, John Matthews, and Andrew R. Pleszkun. Amodified approach to data cache management. InProceedings of the 28thannual international symposium on Microarchitecture, pages 93–103. IEEEComputer Society Press, 1995.

[Tho03] Mark Thorson. Internet nuggets.SIGARCH Comput. Archit. News, 31(4):26–32, 2003.

124

[Tor94] J. Torrellas. False sharing and spatial locality inmultiprocessor caches.IEEETrans. Comput., 43(6):651–663, June 1994.

[UM97] Richard A. Uhlig and Trevor N. Mudge. Trace-driven memory simulation: asurvey.ACM Comput. Surv., 29(2):128–170, 1997.

[VL00] Steven P. Vanderwiel and David J. Lilja. Data prefetch mechanisms.ACMComput. Surv., 32(2):174–199, 2000.

[VTG+99] Alexander V. Veidenbaum, Weiyu Tang, Rajesh Gupta, Alexandru Nicolau,and Xiaomei Ji. Adapting cache line size to application behavior. In Proceed-ings of the 13th international conference on Supercomputing, pages 145–154.ACM Press, 1999.

[WM95] Wm. A. Wulf and Sally A. McKee. Hitting the memory wall: implications ofthe obvious.SIGARCH Comput. Archit. News, 23(1):20–24, 1995.

[ZCL04] Yuanyuan Zhou, Zhifeng Chen, and Kai Li. Second-level buffer cache man-agement.IEEE Trans. Parallel Distrib. Syst., 15(6):505–519, 2004.

125

This page will be replaced!

Documents

WLRU CPU Cache Replacement Algorithm...THE UNIVERSITY OF WESTERN ONTARIO FACULTY OF GRADUATE STUDIES CERTIFICATE OF EXAMINATION Advisors Examining Board Dr. Hanan Lutﬁyya Dr. Marin