by Misel-Myrto Papadopoulou - University of Toronto T-Space · 2017. 12. 19. · Misel-Myrto Papadopoulou Doctor of Philosophy Graduate Department of Electrical and Computer Engineering

Address Translation Optimizations for Chip Multiprocessors

by

Misel-Myrto Papadopoulou

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

c© Copyright 2017 by Misel-Myrto Papadopoulou

Abstract

Address Translation Optimizations for Chip Multiprocessors

Misel-Myrto Papadopoulou

Doctor of Philosophy

Graduate Department of Electrical and Computer Engineering

University of Toronto

2017

Address translation is an essential part of current systems. Getting the virtual-to-physical

mapping of a page is a time-sensitive operation that precedes the vast majority of memory

accesses, be it for data or instructions. The growing memory footprints of current workloads,

as well as the proliferation of chip multiprocessor systems with a variety of shared on-chip

resources create both challenges and opportunities for address translation research. This thesis

presents an in-depth analysis of the TLB-related behaviour of a set of commercial and cloud

workloads. This analysis highlights workload nuances that can influence address translation’s

performance, as well as shortcomings of current designs. This thesis presents two architectural

proposals that both support our thesis that TLB designs and policies need not be rigid, but

should instead dynamically adapt to the workloads’ behaviour for a judicious use of the available

on-chip resources.

The Prediction-Based Superpage-Friendly TLB proposal leverages prediction to improve

energy and utilization of TLBs by allowing translations of different page sizes to coexist in a

set-associative (SA) structure. For example, a 256-entry 4-way SA TLBpred achieves better

coverage (7.7% less Misses Per Million Instructions) compared to a slower 128-entry fully-

associative TLB. It also has the energy efficiency of a much smaller structure. This design uses

a highly accurate superpage predictor that achieves a 0.4% average misprediction rate with a

meager 32B of storage.

The Forget-Me-Not TLB (FMN) proposal utilizes the existing cache capacity to store trans-

lation entries and to thus reduce TLB-miss handling latency. A per core private 1024-entry

direct-mapped FMN reduces the average L1-TLB miss latency across all simulated workloads

by 31.4% over a baseline with only L1-TLBs. Conversely, a dedicated 1024-entry 8-way SA

L2-TLB reduces it by 24.6% and causes, in some cases, performance degradation. We further

ii

propose an L2-TLB bypassing mechanism to address this challenge.

iii

Acknowledgements

This thesis is the culmination of many years of work, and reflects a significant part of my

Ph.D. journey, which would have not been possible to complete without the support, advice

and encouragement of many people.

First of all, I would like to thank my advisor, Professor Andreas Moshovos, for his support

throughout my Ph.D. studies, for his technical knowledge, his advice, and for giving me the

freedom to select my thesis topic. He has helped me develop and hone my research skills,

cultivate my critical thinking, and was instrumental in how I have matured during my Ph.D. I

sincerely appreciate how he wisely knew when it was time to actively support me and when it

was time to best support me by letting me stand on my own. I am also grateful he encouraged

me to teach, an experience that has deeply enriched the last few years of my Ph.D. studies.

I also owe many thanks to my Ph.D. committee members, Professors Natalie Enright Jerger,

Michael Stumm, Paul Chow, and Abhishek Bhattacharjee, for their input on this work, for the

care they took in reading my thesis and for their advice on how to improve it. I appreciate

them asking questions that challenged me and pushed me to think more about how the work we

do is relevant in the “big picture”. Natalie, especially, has been a significant source of support,

encouragement, and mentoring throughout my graduate studies. She would always be available

to listen, answer questions, and offer her advice. A simple thank you is not enough.

When I presented my Ph.D. proposal, Professor Greg Steffan was one of my original Ph.D.

committee members. I will always remember the excitement with which, in the discussion that

followed my Ph.D. proposal presentation years ago, he started thinking about all the research

opportunities TLBs offered in the CMP era. I wish he could be here today. This is but a small

candle lit in his memory.

I would also like to extend a thank you to all, current and former, graduate students from

the computer architecture groups with whom I have discussed research ideas, and shared the

ups and downs of my graduate studies. It was a pleasure to study and work alongside many of

you, and also see many of you grow in the process. From my early Ph.D. years, I would like

to extend special thanks to Ioana Baldini, for her unwavering support, advice, and friendship

since the beginning of my Ph.D., and to Jason Zebchuk, who was always generous in his advice

and help with our simulation infrastructure and with helping me get my hands dirty with the

administrative support of our cluster infrastructure. Many thanks to Henry Wong, Patrick

Judd, Jorge Albericio, Parisa Khadem Hamedani, Danyao Wang, and Elham Safi for all the

discussions, technical and otherwise, and for their support.

My thanks also to Andre Seznec and Xin Tong for their input on our HPCA paper.

I would also like to thank the administrative and technical staff in the ECE Department

for their help during my graduate studies, especially Ms. Kelly Chan, Ms. Darlene Gorzo and

everyone from the graduate office, as well as Ms. Jayne Leake from the undergraduate office.

iv

My graduate research studies were further enriched by my internship in AMD Research,

and my teaching endeavours. I have very fond memories of my internship in AMD Research,

in Bellevue, WA, where I had the opportunity to work alongside not only great researchers

but also wonderful people. I am grateful to all of them for welcoming me and supporting me

during that time, especially to Lisa Hsu, my mentor, for her enthusiasm, support and insights

on my research work. During the last few years of my Ph.D. I was given the opportunity to

teach multiple sections of computer programming and computer organization courses in the

Computer Science Department at the University of Toronto. All faculty members and other

instructors I have worked with, during these past few years, created a very welcoming and

nurturing environment. I have learned a lot from them and I have grown as an educator

beyond what would have been feasible otherwise.

Last but not least, this long journey would not have been possible without friends and

family, the community of people that make me feel at home, the kind of home you always carry

with you. To all my friends, my family, and all the people who have supported me: this journey

was possible and is more meaningful because of you.

It would be impossible to list everyone here, but I would like to extend my sincere gratitude

to Mrs. Vasso Mexis and her family who have warmly embraced me ever since I first came to

Toronto, to my friends Foteini, Irene, and Rena for their support during these past years, and

also, to my dear friends Debbie and Maria whose friendship dates back to our undergraduate

studies.

To my dad, Spyros, my aunt, Despoina, my grandmother, Maria, my sister, Maria, and her

husband, Panagiotis: a thank you will never be enough. I am forever grateful and indebted to

you for your care, love, and encouragement, and for you caring for my soul, an ocean and a

continent away. You had faith in me and my abilities even in times when I would falter. You

are the ones who have enabled me to come so far. I always carry you with me, your love as

precious and dear as the almond trees that bloom in our garden in the midst of the winter.

v

Contents

1 Introduction 1

1.1 The Analysis of TLB-related Behaviour . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Prediction-Based Superpage-Friendly TLBs . . . . . . . . . . . . . . . . . . . . . 3

1.3 Forget-Me-Not TLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background and Related Work 6

2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Background on Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Page Tables and Page Walks . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.2 Translation Lookaside Buffers (TLBs) . . . . . . . . . . . . . . . . . . . . 10

2.2.3 The Cheetah-MMU in SPARC . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.3.1 MMU Registers and TLB Miss Handling . . . . . . . . . . . . . 13

2.2.3.2 TLB Organization and Replacement Policy . . . . . . . . . . . . 14

2.2.3.3 Special MMU Operations . . . . . . . . . . . . . . . . . . . . . . 14

2.2.4 Address Translation for I/O . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Literature Review of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.1 Techniques that Reduce TLB Misses . . . . . . . . . . . . . . . . . . . . . 16

2.3.1.1 TLB Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.1.2 Shared or Distributed-Shared TLB Designs . . . . . . . . . . . . 19

2.3.1.3 Increasing TLB Reach . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.2 Techniques that Reduce TLB Miss Latency Overhead . . . . . . . . . . . 21

2.3.3 Techniques that Revisit Address Translation/Paging . . . . . . . . . . . . 22

2.3.4 Techniques that Reduce Address Translation Energy . . . . . . . . . . . . 23

2.3.5 Techniques that Address TLB Coherence Overheads . . . . . . . . . . . . 25

2.3.6 Techniques that Target I/O Address Translation . . . . . . . . . . . . . . 26

2.3.7 Architectural Optimizations that Take Advantage of Address Translation 26

2.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

vi

3 TLB-related Behaviour Analysis 28

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1.1 Characteristics Inherent to the Workload . . . . . . . . . . . . . . . . . . 29

3.1.2 Other Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.1 Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Unique Translations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.1 Per-Core Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.2 CMP-Wide Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4 Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4.1 Context Count and Sharing Degree . . . . . . . . . . . . . . . . . . . . . . 35

3.4.2 Context Lifetimes (Within Execution Sample) . . . . . . . . . . . . . . . 38

3.4.3 Context Significance: Frequency and Reach . . . . . . . . . . . . . . . . . 40

3.4.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.5 Translation Mappings Lifetime . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.5.1 Demap-Context Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.5.2 Demap-Page Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.5.3 TLB-Entry Modification Analysis . . . . . . . . . . . . . . . . . . . . . . . 45

3.5.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.6 TLB Capacity Sensitivity Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.6.1 Split L1-TLBs; One per Page-Size . . . . . . . . . . . . . . . . . . . . . . 47

3.6.2 Fully-Associative L1 TLB . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.6.3 Set-Associative L1-TLB for Small Pages and Fully-Associative L1-TLB

for Superpages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.6.4 L2-TLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.7 Compressibility and Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.8 The First Cache Block Access After A TLB-Miss . . . . . . . . . . . . . . . . . . 57


4 Prediction-Based Superpage-Friendly TLB Designs 60

4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2 Analysis of TLB-Related Workload Behavior . . . . . . . . . . . . . . . . . . . . 62

4.2.1 Unique Translations Analysis Recap . . . . . . . . . . . . . . . . . . . . . 62

4.2.2 TLB Miss Analysis and Access-Time/Energy Trade-Offs . . . . . . . . . . 63

4.2.3 Native x86 Runs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.3 Page Size Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.3.1 Superpage Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3.1.1 PC-based Predictor . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.3.1.2 Base Register-Value-Based (BRV-based) Predictor . . . . . . . . 67

4.4 Prediction-Guided Multigrain TLB . . . . . . . . . . . . . . . . . . . . . . . . . . 69

vii

4.4.1 Supporting Other Page Size Usage Scenarios . . . . . . . . . . . . . . . . 71

4.4.1.1 Precise Page Size Prediction . . . . . . . . . . . . . . . . . . . . 71

4.4.1.2 Predicting Among Page Size Groups . . . . . . . . . . . . . . . . 72

4.4.2 Special TLB Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.5 Skewed TLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.5.1 Prediction-Guided Skewed TLB . . . . . . . . . . . . . . . . . . . . . . . . 75

4.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.7.1 Superpage Prediction Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 77

4.7.2 TLBpred Misses Per Million Instructions and Capacity Distribution . . . . 78

4.7.2.1 TLBpred Capacity Distribution . . . . . . . . . . . . . . . . . . . 79

4.7.3 Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.7.4 TLBskew and TLBpskew MPMI . . . . . . . . . . . . . . . . . . . . . . . . 81

4.7.5 Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.7.6 Sensitivity to the Page Size Access Distribution . . . . . . . . . . . . . . . 84

4.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87


5 The Forget-Me-Not TLB 90

5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.2 FMN’s Goal and Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.2.1 FMN Operating Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.3 FMN Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.3.1 Page Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.3.2 FMN’s Indexing Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.3.3 FMN’s Allocation and Replacement Policies . . . . . . . . . . . . . . . . . 96

5.4 Caching the FMN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.4.1 FMN Probes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.4.2 FMN Allocation Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.5 Simulation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.5.1 Simulation Challenges - Software-Managed TLBs in Simics . . . . . . . . 101

5.5.2 Timing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.5.3 Page Walk Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.5.4 Discussion of Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.6 Reasoning about FMN’s Performance Potential . . . . . . . . . . . . . . . . . . . 108

5.7 Synthetic Memory Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.8 Baseline CMP Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.9 Sequential Page Access Patterns - A Case Study with Synthetic Traces . . . . . . 115

5.9.1 Impact of Workload’s Footprint on Baseline Configuration . . . . . . . . . 115

viii

5.9.2 Effect of Per-Page Access Pattern on Baseline . . . . . . . . . . . . . . . . 118

5.9.3 Effect of Data Sharing on Baseline . . . . . . . . . . . . . . . . . . . . . . 119

5.9.4 Effect of Process Mix on Baseline . . . . . . . . . . . . . . . . . . . . . . . 120

5.9.5 Private FMNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.9.6 Private FMNs versus Private L2-TLBs . . . . . . . . . . . . . . . . . . . . 125

5.9.7 Private FMNs: Filtering Optimization . . . . . . . . . . . . . . . . . . . . 126

5.9.8 Private FMNs: Replacement Optimization . . . . . . . . . . . . . . . . . . 127

5.10 FMN’s Evaluation for Commercial Workloads . . . . . . . . . . . . . . . . . . . . 128

5.10.1 Impact of Address Translation on Baseline’s Performance . . . . . . . . . 129

5.10.2 FMN’s Impact on L1-TLB Miss Latency . . . . . . . . . . . . . . . . . . . 131

5.10.3 FMN’s Effect on Average Memory Latency . . . . . . . . . . . . . . . . . 132

5.10.4 FMN’s Effect on Performance . . . . . . . . . . . . . . . . . . . . . . . . . 134

5.11 L2-TLB Bypassing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

5.11.1 Proposed Solution: Bypassing the L2-TLB . . . . . . . . . . . . . . . . . 136

5.11.2 L2-TLB Bypassing: Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 137


6 Concluding Remarks 139

6.1 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

Bibliography 142

ix

List of Tables

2.1 Commercial D-TLB designs; all the L2-TLBs are unified except for the AMD

systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 List of characteristics/metrics inherent to the workload presented in this analysis

along with a brief explanation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Other Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Per-core unique translation characterization: 8KB pages. Footprint in MB is

listed in parentheses for the min., max. and avg. (arithmetic mean) columns.

SD is also expressed as a percentage of the average in parentheses. . . . . . . . . 33

3.5 Per-core unique translation characterization: Superpages (i.e., 64KB, 512KB

and 4MB pages). No 64KB pages were present. . . . . . . . . . . . . . . . . . . . 33

3.6 CMP-wide unique translation characterization: 8KB pages and Superpages. . . 35

3.7 Context 0: % TLB accesses and cumulative per core unique translation entries

across the entire CMP. See previous equations. . . . . . . . . . . . . . . . . . . . 41

3.8 Non-zero contexts: % TLB accesses and cumulative per core unique translation

entries across the entire CMP for PARSEC and Cloud workloads. . . . . . . . . 41

3.9 Translation Demap and Remap Operations (cumulative in the entire CMP). . . . 43

3.10 Unique characteristics of Demap-Page requests (per core). Values in parentheses

are for the entire CMP wherever different. . . . . . . . . . . . . . . . . . . . . . . 44

4.1 Commercial D-TLB Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 System Parameters for Native x86 Execution . . . . . . . . . . . . . . . . . . . . 65

4.3 Fraction of TLB Misses due to 2MB Superpages (x86) . . . . . . . . . . . . . . . 65

4.4 Primary TLBpred Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.5 Secondary TLBpred Lookup Using a Binary Superpage Predictor. . . . . . . . . . 71

4.6 i-th TLB Lookup (1 < i ≤ N); N supported page sizes. . . . . . . . . . . . . . . . 72

4.7 Page Size Function described in Skewed TLB [69]. . . . . . . . . . . . . . . . . . 73

4.8 TLB Entry Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.9 Canneal Spin-Offs: Footprint Characterization . . . . . . . . . . . . . . . . . . . 85

5.1 TSB hit code in D-MMU Trap Handler (Solaris) . . . . . . . . . . . . . . . . . . 103

x

5.2 System Configuration Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.3 L2-TLB Hit-Rate (%) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5.4 L2-TLB Bypassing Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

xi

List of Figures

2.1 x86-64 (or IA-32e) Page Walk for a 4KB page. Intel refers to a level-4 table as

Page-Map Level-4 (Page-Map Level-4 (PML4)), and to a level-3 table as Page-

Directory-Pointer Table (Page-Directory-Pointer Table (PDPT)). . . . . . . . . . 9

2.2 SA TLB indexing for a TLB with 64 sets (x86 architecture). . . . . . . . . . . . . 11

2.3 Network I/O - System Snapshot . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Disk I/O - System Snapshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 Number of unique contexts observed in the CMP; the number is also listed on

the top of each column. Each column is colour-coded based on the number of

core-sharers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 Number of unique per core contexts for three workload classes. Each column

corresponds to a different core in the range of [0, 15] in ascending order. . . . . . 37

3.3 Context lifetimes. The average context/core lifetime is listed in parentheses as a

percentage of the workload’s execution time sample. . . . . . . . . . . . . . . . . 39

3.4 L1 TLB MPMI and Hit-Rate over different TLB sizes. The x-axis lists the

number of TLB entries for the split TLB with 8KB page translations; the capacity

of each other split TLB structure is half that in size. Canneal saturates with this

y-axis scale; see detail in Figure 3.5. . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.5 Canneal MPMI detail with larger y-axis scale. . . . . . . . . . . . . . . . . . . . . 48

3.6 L1 TLB MPMI and Hit-Rate over different FA TLB sizes. All TLBs model full-

LRU as replacement policy. Figure 3.7 shows canneal in detail as it saturated

with this y-axis scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.7 Canneal MPMI detail with larger y-axis scale. . . . . . . . . . . . . . . . . . . . . 49

3.8 L1 TLB MPMI and Hit-Rate over different TLB sizes for the 2-way SA TLB

that only hosts translations for 8KB pages. A fixed 16-entry FA TLB is modeled

for all superpages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.9 Canneal MPMI with larger y-axis scale. . . . . . . . . . . . . . . . . . . . . . . . 51

3.10 L1 TLB MPMI over different TLB sizes for the FA TLB that hosts translations

for all superpages. A fixed 2-way SA 512-entry TLB is modeled for 8KB pages. . 52

xii

3.11 L2 TLB MPMI and Hit-Rate over different TLB sizes. The x-axis lists the

number of L2 TLB entries for an 8-way SA L2-TLB that only supports 8KB

pages. Canneal saturates with this y-axis scale; see detail in Figure 3.12. . . . . . 53

3.12 Canneal L2-TLB MPMI detail with larger y-axis scale. . . . . . . . . . . . . . . . 53

3.13 Per-Core L2-TLB Capacity classified percentage-wise in valid and invalid TLB

entries for different L2-TLB sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.14 Unique Bytes and Byte-Sets Nomeclature . . . . . . . . . . . . . . . . . . . . . . 55

3.15 Number of unique bytes and byte-sets in the virtual and physical addresses. . . . 56

3.16 Number of unique values for MSB 4 and Byte-Set 4 (both in virtual and physical

addresses). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.17 Percentage of all CMP D-TLB L1 Misses that access the same 64B cache block

as the last time that same translation-entry experienced a TLB miss. . . . . . . 58

4.1 D-TLB L1 MPMI for Different TLB Designs . . . . . . . . . . . . . . . . . . . . 63

4.2 Access Time and Dynamic Energy Trade-Offs . . . . . . . . . . . . . . . . . . . . 64

4.3 (a) PC-Based and (b) Base Register-Value Based Page Size Predictors . . . . . . 67

4.4 Multigrain Indexing with 4 supported page sizes, shown here for a 512-entry

8-way SA TLB (6 set-index bits). . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.5 Multigrain Tag Comparison for Figure 4.4’s TLB on superpage prediction. Page

Size field (2 bits) included in every TLB entry. . . . . . . . . . . . . . . . . . . . 70

4.6 Skewed Indexing (512 entries, 8-way skewed associative TLB) with 4 supported

page sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.7 Prediction Table (PT) Entry Transition Diagram . . . . . . . . . . . . . . . . . . 77

4.8 Superpage-Prediction Misprediction Rate (%) . . . . . . . . . . . . . . . . . . . . 78

4.9 TLBpred MPMI relative to AMD-like 48-entry FA TLB . . . . . . . . . . . . . . . 79

4.10 TLBpred per core capacity distribution over translations of different page sizes. . 80

4.11 Dynamic Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.12 TLBskew, TLBpred, and TLBpskew: MPMI relative to AMD-like 48-entry FA TLB 82

4.13 CPI saved with TLBpred . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.14 Canneal Spin-Offs: Miss Distribution for 48-entry FA (AMD12h-like) TLB . . . . 85

4.15 Canneal Spin-Offs: MPMI relative to AMD-like TLB. Includes TLBpred with

precise page-size prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.1 FMN’s Best Case Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.2 FMN Operation Timeline - Page Walk completes before FMN probe . . . . . . . 93

5.3 FMN Operation Timelines - FMN probe completes before page walk . . . . . . . 94

5.4 FMN’s effect on cache contents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.5 Virtualizing a small 8-entry 2-way SA FMN. . . . . . . . . . . . . . . . . . . . . . 99

5.6 Timing Model - Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.7 Page Walk Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

xiii

5.8 Projected ideal % performance improvement based on Equation (5.11) with

∆TLB miss = 0.75 and ∆mem = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.9 Projected % performance improvement based on Equation (5.10) with c = 0. . . 111

5.10 Effect of pool size on TLB hit rate. . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.11 TLB Miss Latency as percentage of execution time with varying Pool Size (PS)

and Block Count Per Page (BCPP) values. Figure 5.12 presents how the execu-

tion time changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

5.12 Execution time with varying PS and BCPP values. . . . . . . . . . . . . . . . . . 117

5.13 Average Memory Request Latency in cycles. Note the logarithmic y-axis scale. . 118

5.14 Average L1-TLB Miss Latency in cycles. No TLB misses exist for the PS-64

series in these last 16M references, as explained earlier. . . . . . . . . . . . . . . . 118

5.15 Shared versus Private: Effect of data sharing on L1-TLB miss latency. . . . . . . 119

5.16 Shared versus Private: Effect of data sharing on average memory latency. . . . . 119

5.17 Private sharing pattern: Effect of process mix on baseline’s TLB miss latency. . . 120

5.18 Shared sharing pattern: Effect of process mix on baseline’s TLB miss latency. . . 121

5.19 Performance Impact: FMN versus Baseline. . . . . . . . . . . . . . . . . . . . . . 122

5.20 Average TLB Miss Latency in cycles. . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.21 Average Memory Latency in cycles; this is measured after translation is retrieved.124

5.22 Performance Impact: FMN compared to L2-TLB. . . . . . . . . . . . . . . . . . . 125

5.23 FMN Filtering and FMN vs. Baseline . . . . . . . . . . . . . . . . . . . . . . . . 127

5.24 Percentage of execution time spent in L1-TLB miss handling. . . . . . . . . . . . 129

5.25 Percentage of execution time reduction due to L2-TLB. . . . . . . . . . . . . . . 130

5.26 FMN or L2-TLB: Percentage L1-TLB Miss latency reduction over HB. . . . . . . 131

5.27 FMN or L2-TLB: Percentage of execution time spent handling L1-TLB misses. . 132

5.28 Characterization of FMN probes for a 1K-entry per core FMN with 8KB VPN

indexing scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

5.29 FMN or L2-TLB: Percentage memory latency increase over HB. . . . . . . . . . . 133

5.30 FMN or L2-TLB: Performance over HB. . . . . . . . . . . . . . . . . . . . . . . . 134

5.31 Percentage of execution time reduction with L2-TLB bypassing. . . . . . . . . . . 138

xiv

Acronyms

ASI Address Space Identifier. The acronym used in SPARC systems instead of ASID. 14

ASID Address Space Identifier. xv, 12, 13, 24, 29, 35, 75, 95

BCPP Block Count Per Page. xiv, 115–117, 120, 125

BRV Base-Register (src1) Value. vii, 67, 77, 88

CPI Cycles Per Instruction. 83

DM Direct-Mapped. 10, 123

DMA Direct Memory Access. 15

FA Fully-Associative. 10, 17, 24, 32, 49, 87, 101

FMN Forget-Me-Not. 4, 90, 91, 101, 140

FPGA Field-Programmable Gate Array. 28

GPU Graphics Processing Unit. 6

ILP Instruction Level Parallelism. 107

IOMMU I/O Memory Management Unit. 15

IPC Instructions Per Cycle. 107

IRQ Interrupt Request. 36

ISA Instruction Set Architecture. 12

L1 Level 1. 4

L2 Level 2. 4, 10

LLC Last-Level Cache. 25, 26, 99, 100

xv

LRU Least Recently Used. 17

MLP Memory-Level Parallelism. 107

MMU Memory Management Unit. 1, 7, 13, 92

MPKI Misses Per Kilo Instructions. 24, 46

MPMI Misses Per Million Instructions. 3, 28, 30, 46, 58, 62, 63, 135, 140

OoO Out Of Order. 107

OS Operating System. 7, 47

PC Program Counter. 44

PCID Process Context Identifier. 13

PDE Page Directory Entry. 8, 21, 121

PDPT Page-Directory-Pointer Table. xii, 9, 21

PID Process ID. 7, 56

PIPT Physically-Indexed and Physically-Tagged. 25

PML4 Page-Map Level-4. xii, 9, 21

PPN Physical Page Number. 7, 45

PS Pool Size. xiv, 115–117, 120, 125

PT Prediction Table. xiii, 66, 77, 88

PTE Page Table (or Translation) Entry. 8, 17, 97, 98

RMM Redundant Memory Mappings. 23, 87

SA Set-Associative. 10, 47, 99, 101, 116

SCSI Small Computer System Interface. 15

SD Standard Deviation. x, 32, 33

SPARC Scalable Processor Architecture. xv, 7

STXA Store extended word into alternate space. 14

THP Transparent Huge Pages. 88

xvi

TLB Translation Lookaside Buffer. 1, 6, 10

TSB Translation Storage Buffer. 4, 12, 101, 102

TT Trap Type. 13

VIPT Virtually-Indexed and Physically-Tagged. 12, 23, 24

VIVT Virtually-Indexed and Virtually-Tagged. 24

VPN Virtual Page Number. 7, 20, 34, 35, 45, 57, 95

xvii

Chapter 1

Introduction

Address translation has been an integral part of computer systems for decades, since the concept

of virtual memory was introduced in the early 1960s [31]. Virtual memory support is considered

a de facto facility for current systems. Having a large contiguous address space for each process,

along with isolation and access control provisions across different processes and memory regions,

are characteristics of the virtual memory abstraction all programmers rely on. The operating

system and the hardware architecture must support these requirements, usually transparently

to and with no effort from the programmer. Beyond correctness, which is not negotiable,

there is also the fundamental expectation of efficiency: the architecture should support address

translation within strict performance, energy, and often area, envelopes. A brief introduction

to virtual memory is given in Chapter 2.

Getting the virtual-to-physical mapping of an address is on the processor’s critical path

because it precedes the vast majority of memory accesses, be it for data or instructions. Modern

Memory Management Units (MMUs) employ Translation Lookaside Buffers (TLBs) to avoid

walking the page tables on every memory access that needs a translation. This functionality

has a parallel to data and instruction caches: TLBs hide part of the page-walk latency, while

data and instruction caches hide part of the memory latency. However, despite its long history,

address translation still causes significant performance loss in many scenarios, as both system

architectures and workloads evolve. The percentage of execution time spent doing page walks

is as high as 16% for scale-out workloads [44] or 14% for a wide range of server and scientific

applications [15], and can even reach 89% under virtualization [15]. This trend is expected to

continue as the increased data footprints of emerging applications stress conventional TLBs and

their tight latency and energy constraints. As the virtual address space grows, the TLB miss

handling latency is also expected to increase as more levels are added to current multi-level

page tables. For example, Intel is currently working on introducing 5-level page tables [40].

The TLB organization and TLB miss handling need to accommodate these ever evolving needs.

The straightforward solution of making the structures in question larger has been tried in

the realm of caches, and it is now common knowledge that blindly dedicating resources to the

problem is not enough. In the realm of L1-TLBs, it is not even a viable option because of

1

Chapter 1. Introduction 2

their strict latency constraints. But even within the existing latency and power constraints,

the “one design fits all” paradigm rarely manages to capture the widely different memory

behaviour requirements, not only across different workloads, but also across cores running a

single multithreaded workload. For example, rigid a priori decisions about the likely page size

distributions of workloads, reflected in different TLB sizes for split TLBs that each support

one page size, can waste both energy and hardware resources, and can also hurt performance

when the observed behaviour deviates from the one expected. This thesis advocates for TLB

designs and policies that dynamically adapt to the workloads’ behaviour for a judicious use of

the available on-chip resources.

To understand which are the workload behaviour and system architecture aspects that

influence the TLB usage scenarios, and thus are the ones that TLB designs and management

policies should adapt to, this thesis presents an in-depth exploration of TLB-related behaviour

for a set of commercial and cloud workloads. This analysis showed significant variation in the

use of superpages (i.e., pages with sizes greater than the smallest supported one) versus small

pages across workloads, with a strong bias for the largest supported superpage. It also showed

that most mainstream TLB structures (e.g., split TLBs) are either biased towards the smallest

page size or make an implicit assumption about the page size distribution of memory accesses.

These two observations have motivated our proposal for Prediction-Based Superpage-Friendly

TLBs that use superpage-prediction to allow translations of different page-sizes to coexist in a

single set-associative TLB, sharing its capacity at runtime as needed. Our analysis also showed

that translation modifications are rare, encouraging our second proposal, the Forget-Me-Not

TLB, a cacheable and speculative TLB that allows translations to dynamically share existing

on-chip capacity with regular data and reduces TLB miss handling latency.

The remainder of this chapter is organized as follows. Sections 1.1 to 1.3 introduce our

analysis and our two architectural proposals. Section 1.4 outlines the research contributions of

this dissertation and, lastly, Section 1.5 reviews the organization of this thesis.

1.1 The Analysis of TLB-related Behaviour

Our analysis’ goals were to: (i) understand which are the workloads’ behaviour aspects that

influence TLB usage, and (ii) to characterize the interplay between these aspects and the

existing TLB infrastructure. We thus classified our measurements according to the following

taxonomy: (i) characteristics inherent to the workloads, that is, characteristics or metrics

unaffected by translation caching structures like the TLBs, and (ii) other metrics that are

influenced by the architecture of these structures.

The measurements in the first category answer the following questions: What is the number

of unique translations, and thus the TLB sizing requirements, for these workloads? Do these

requirements vary from the perspective of each CMP core for a given workload? Which are

the page sizes more prominently used? Is there any bias we can exploit? Is there translation


sharing across cores that would motivate non-private TLB designs? How often are translation

mappings modified? If we look at address translation via the abstraction of process IDs (i.e.,

contexts), how does this lens influence our view of translation sharing potential across cores?

What is the frequency, data reach and lifetime of these different contexts, and would filtering

them and their translations be appropriate? Are there opportunities for translation compression

or predictability of the cache block accessed after a TLB miss? Answers to these questions help

us better understand how and why specific TLB designs affect Misses Per Million Instructions

(MPMI) measurements, and also motivate our architectural proposals. The measurements in

the second category of the above taxonomy evaluate how MPMI changes across different state-

of-the-art TLB organizations highlighting their trade-offs and shortcomings. For example, our

TLB capacity sensitivity study illustrates how rigid TLB designs that make a priori assumptions

about page size distribution in workloads poorly capture different superpage usage scenarios.

1.2 Prediction-Based Superpage-Friendly TLBs

Our analysis reveals different page size usage scenarios across workloads, with some workloads

heavily relying on superpages. It also shows that, when superpages are used, workloads tend

to favor the largest superpage size, while intermediate superpage sizes rarely appear. These

observations are not reflected in existing TLB designs that make an a priori decision about

workloads’ page size distribution, unnecessarily wasting energy and area.

To address this research gap, this thesis proposes a lightweight binary superpage prediction

mechanism that accurately guesses ahead of time if a memory access is to a superpage or

not. This predictor enables our proposed TLBpred design, an elastic set-associative TLB that

dynamically adapts its super- and regular page capacity to fit each application’s needs. That

is: (1) A workload using mostly a single page size can use all the available TLB capacity

and not waste any resources or be limited by predetermined assumptions on page size usage.

(2) A workload that uses multiple page sizes should have its translations transparently compete

for the available TLB entries. A set-associative TLB design will better scale to larger sizes

without the onerous access and power penalties of a large fully-associative TLB. For example,

a 256-entry 4-way SA configuration of the proposed TLBpred design achieves better coverage

(7.7% less MPMI) compared to a slower 128-entry fully-associative TLB. It is also significantly

more energy efficient; its energy efficiency is comparable to that of a much smaller 48-entry

fully-associative TLB that has much higher MPMI. This TLBpred design uses a highly accurate

superpage predictor; a small 128-entry predictor table with a meager 32B of storage has an

average misprediction rate of 0.4% across all simulated workloads. This work also provides the

first experimental evaluation of the previously proposed Skewed TLB design [69] that can also

support multiple page sizes in a single structure. We further augment the Skewed TLB with

page size prediction, a modified version of our superpage predictor that now predicts among

groups of page sizes, to increase its per page-size effective associativity.


1.3 Forget-Me-Not TLB

Level 1 (L1) TLBs are inherently limited by strict latency constraints. Even if they are re-

designed to better adapt to the workloads’ behaviour and each architecture’s nuances, they

still cannot meet the growing demands of data footprints. Extending the TLB hierarchy, e.g.,

by adding a Level 2 (L2) TLB, is a solution many current systems follow. Even though the

L2-TLB access latency is no longer in the processor’s critical path, the benefits gained by allo-

cating hardware resources to such a design need to be scrutinized especially because the L2-TLB

probe happens before a page walk. Thus, in cases where the L2-TLB hit-rate is low because

the workload’s footprint is too large or the existing L2-TLB configuration does not cater to the

workload’s page-size distribution, preceding the page-walk latency with an L2-TLB probe may

not just waste energy, but also cause performance degradation.

This work proposes the Forget-Me-Not (FMN) TLB, a cacheable TLB design that can sig-

nificantly reduce TLB-miss handling latency without any dedicated on-chip translation storage.

FMN leverages the observation that large on-chip memory caches can be shared transparently

and on demand with properly engineered virtualized structures [25, 26]. On a TLB miss, this

virtualized TLB is accessed in parallel to the page table. However, unlike the page walk that

requires multiple memory accesses, only a single access is needed to retrieve the translation, if

the latter exists in this new cacheable structure. FMN’s translations are speculative because

the FMN is not kept coherent with the page tables. However, because our analysis shows that

translation modifications are rare, so is FMN misspeculation. As explained in Section 5.4.3,

when compared to a software based translation cache, like the Translation Storage Buffer (TSB),

our design is different because it is hardware managed, its lookup does not precede the page

walk, and it is also, by design, not kept coherent with the page-tables. Further, FMN is not a

per process structure, but can be configured as a private (per-core) or a shared structure.

Contrary to an L2-TLB, FMN does not require any dedicated hardware storage and its

virtualized nature enables more flexible organizations (e.g., different indexing schemes, sizes).

A per core private 1024-entry direct-mapped FMN reduces the average L1-TLB miss latency

across all simulated workloads by 31.4% over a baseline with only L1-TLBs, while a dedicated

1024-entry 8-way set-associative L2-TLB reduces it by 24.6%. FMN’s L1-TLB miss latency

reduction results in up to 1.97% overall execution time reduction (performance). For systems

that already have an L2-TLB, this work also proposes an L2-TLB bypassing optimization. An

interval based predictor enables L1-TLB misses to bypass the L2-TLB and immediately trigger

a page walk, if it predicts that the sequential to the page walk L2-TLB lookup is likely to cause

performance degradation.


1.4 Thesis Contributions

This thesis makes the following contributions:

• It analyzes different aspects of workload behaviour relevant to address translation, and

highlights inefficiencies of existing TLB designs such as the poor support of multiple page

sizes (Chapter 3). This analysis also points to interesting directions for future research,

such as context-aware TLB-indexing schemes.

• It proposes a highly accurate superpage predictor that predicts if a memory access is

to a superpage or the smallest supported page size. It then leverages this predictor to

propose TLBpred that allows translations of different page sizes to co-exist in an elastic

set-associative TLB design (Chapter 4).

• It evaluates the previously proposed but not evaluated Skewed-TLB, TLBskew, and aug-

ments it with a prediction mechanism to improve the effective associativity of each page-

size (Chapter 4).

• It proposes FMN, a cacheable and speculative TLB design that reduces the TLB miss

handling latency by using the available on-chip cache hierarchy to transparently and on

demand host past translations (Chapter 5).

• It presents a suite of synthetic traces and their configuration parameters that can enable

exploration of TLB designs for a variety of workload characteristics which might not be

represented in existing workloads (Chapter 5).

• It highlights the circumstances under which L2-TLBs can hurt performance and proposes

an L2-TLB bypassing mechanism based on an interval-based predictor (Chapter 5).

The prediction-based superpage-friendly TLB designs presented in Chapter 4 were published

in the International Symposium on High Performance Computer Architecture (HPCA) in

2015 [55].

1.5 Thesis Organization

The remainder of this thesis is organized as follows. Chapter 2 first provides the necessary

background on address translation and the hardware or software facilities that support it (e.g.,

page tables, TLBs), and then reviews relevant past research. Chapter 3 presents our in-depth

exploration of TLB-related behaviour. Chapter 4 presents our superpage predictor and our

prediction-based superpage-friendly TLB designs TLBpred and TLBpskew. Chapter 5 presents

our Forget-Me-Not TLB, as well as our L2-TLB bypassing mechanism. Finally, Chapter 6

summarizes this work’s contributions and anticipates future research directions.

Chapter 2

Background and Related Work

2.1 Overview

This chapter: 1) provides the necessary background on address translation and the hardware

or software facilities that support it, page tables, TLBs, (Section 2.2), and 2) reviews ad-

dress translation related research (Section 2.3). The latter is organized in different thematic

categories that all together formulate the landscape of architectural optimizations targeting

address translation. This landscape changes as the workloads and the underlying architectures

do. The earliest research works targeted scientific applications in uniprocessor systems, while

starting from 2007 there was an emergence of works in multiprocessor systems targeting data

parallel applications; there is also research on address translation for heterogeneous systems

such as GPUs [60,82]. The literature presented in this chapter will focus on general-purpose

systems and will not include research for virtualization support. Research works that aim to

reduce the number of TLB misses or the TLB miss latency overhead or the energy spent for

address translation are the most pertinent to this work.

2.2 Background on Address Translation

The first system to implement a variant of virtual memory was the ATLAS computer in the

early 1960s. In ATLAS, “address is an identifier of a required piece of information but not

a description of where in main memory that piece of information is” [31]. As Denning later

said, referring to ATLAS, this concept of virtual memory “gives the programmer the illusion

that he has a very large main memory at his disposal, even though the computer actually has a

relatively small main memory” [29]. This indirection in the view of the address space from the

perspective of the process (i.e., the virtual address space, a linear address space exposed to the

programmer) and of the physical system (i.e., the physical address space), remains one of the

main and most crucial facilities provided by virtual memory.

The need to map addresses between virtual memory, the imaginary “large main memory”

mentioned earlier, and the actual physical memory gave birth to address translation. It is the

6

Chapter 2. Background and Related Work 7

responsibility of the operating system and the hardware to implement address translation by

providing the necessary mapping from virtual to physical addresses. The MMU serves this

purpose allowing the application (process) to be oblivious 1 to this indirection.

As application footprints grew, and with the advent of multiprogramming, virtual memory

became a necessity. For this reason, address translation has become an integral part of modern

computer systems. As memory virtualization matured over the years, it also became synony-

mous with providing isolation across multiple running processes and access control (protection)

for different parts of memory. Therefore, the address translation is not only limited to provid-

ing a virtual-to-physical address mapping, but also incorporates additional information such as

access permissions, as discussed in Section 2.2.1.

With virtual memory, only the data currently in use by a given process need to reside in

physical memory. The norm in today’s systems is that any data that exceeds the available

physical memory is stored on the disk (secondary storage). Depending on the implementation

of virtual memory used, paging [9] or segmentation [9], the data is organized in memory at

the granularity of pages or segments respectively. Usually pages have a significantly smaller

size than segments; a segment is a large contiguous memory region identified by a base address

and its size. It is also possible for a system to implement both segmentation and paging. This

work focuses on paging systems as paging is the most widespread memory management scheme.

Segmentation is not discussed as it can be applied on top of paging and it is not influencing

the specifics of this work.

The sections that follow present the structures the MMU probes to retrieve the needed

translation. Section 2.2.1 discusses the functionality and organization of page tables, while

Section 2.2.2 focuses on Translation Lookaside Buffers (TLBs). Section 2.2.3 details the Cheetah

MMU, a SPARC MMU, introducing terminology required by the methodology that will follow.

Lastly, Section 2.2.4 covers a different type of MMU, the I/O MMU, used in the path of I/O

accesses.

2.2.1 Page Tables and Page Walks

All the virtual-to-physical mappings, i.e., translations, for a given process are maintained in

a page table in memory. Contiguous regions of virtual addresses, called pages, are mapped to

contiguous regions of physical memory, called page frames. The size of a page, and by extension

the size of a page frame, is always a power-of-two bytes. The page size separates any virtual and

physical address into two fields: the page number and the page offset. The page offset consists

of the log2(page size) least significant bits of the address, while the page number consists of the

remaining higher-order bits. A translation maps a Virtual Page Number (VPN) to a Physical

Page Number (PPN), while the page-offset bits remain unchanged. The VPN and knowledge

of the Process ID (PID), a unique identifier provided to each process by the Operating System

1Here the term oblivious highlights that the programmer need not do any additional work to facilitate thevirtual-to-physical mapping. However, address translation can have a performance impact on a given process.


(OS), are the only two pieces of information needed to access the page tables and retrieve the

translation.

Even though page tables can be organized in different ways, each Page Table Entry (PTE)

usually contains the following fields: (a) the physical page number (b) a present bit, set when

the page is resident in physical memory, (c) a referenced bit, (d) a modified bit, (e) a caching

disabled bit, and (f) an access permission (protection) field [80].

A not-set present bit in a PTE entry means that the page in question is not mapped in

physical memory. This triggers a page fault, a trap to the operating system. A replacement

candidate among all present pages is selected by the system’s page replacement algorithm, if

needed. The referenced bit in the PTE entry can be used for this purpose. If the modified

(dirty) bit of the selected page is set, then the operating system should write the contents of

this victim page back (i.e., swap it) to the disk, before it brings the requested page frame from

the disk into physical memory and updates the PTE entry as needed. A set caching-disabled

bit, usually relevant in case of memory-mapped I/O, indicates that the contents of a page frame

should not be cached. Finally, the protection field specifies the access rights to a given page

frame. In an Intel-64 processor, protection encompasses the R/W flag that controls read-only

versus read-write access to a page, the user/supervisor mode flag (U/S) that controls if user-

mode accesses to a page are permitted, and the execute-disabled flag that prevents instruction

fetches from such marked pages further protecting against malicious code [39].

How translation entries are organized in the page tables is especially important since it

determines how fast a translation can be retrieved. The most common types of page tables are:

(a) multi-level page tables, and (b) inverted page tables.

Multi-level page tables solve the problem of needing to keep really large page tables in

physical memory. Only the page tables which cover the address space used by a process are

kept in memory. This idea is realized as follows. The virtual page number of any virtual address

is split into x separate fields, where x is the number of page table levels. Each of these fields

is used as an index to the relevant page table in each level. The contents of each entry at that

index serve either as a pointer to the base address of the next (lower) level page-table or as the

final translation. This work refers to the first type of entries as Page Directory Entries (PDEs)

and to the latter as Page Translation Entries (PTEs). The page table traversal until a PTE

is found is commonly referred to as “walking the page tables” or “page walk”. Page walks are

by nature sequential and they will require more page-table levels as the virtual address space

grows.

Figure 2.1 shows a page walk in an x86-64 architecture; this example retrieves the translation

of a 4KB page. Currently 48-bit virtual addresses are mapped to 52-bit physical addresses. Each

page-table has 512 8B-entries and requires 4KB of storage.


47 39 38 30 29 21 20 12 11 0 63 48

PDE L4 Page Table (512 entries)

L3 Page Table (512 entries)


Page Offset (4KB page)

PDE

PTE

PDE


Control Register CR3

Page Offset (2MB page)

4KB page translation

Virtual Address

x86-64 Page Walk

Page Offset (1GB page)

Figure 2.1: x86-64 (or IA-32e) Page Walk for a 4KB page. Intel refers to a level-4 table asPage-Map Level-4 (PML4), and to a level-3 table as Page-Directory-Pointer Table (PDPT).

The control register CR3 points to the base physical address of the page-table hierarchy. The

first highlighted VPN field (bits [47:39]) is used to index into the topmost L4 page-table. The

contents of this entry form the base address of the next level page-table. The second field of

the virtual address is used to index into the subsequent L3 table. These steps repeat until we

reach a page translation entry. This PTE can be either at the leaf of the tree, as in the provided

figure, or at a higher tree level, if this virtual address belongs to a larger page. For example, for

a 1GB page, the largest x86-64 supported page size, only the L4 and L3 tables will be accessed

thus requiring only two memory references for the page walk instead of the four needed for a

4KB page.

Unlike the multi-level page tables that contain an entry per virtual page, the inverted

page tables contain an entry per physical page frame, and therefore they cannot be VPN-

indexed. Even though inverted page tables have reduced memory requirements, since a system’s

physical memory is much smaller than the sum of all virtual address spaces of its currently

running processes, they have increased complexity cost. The complexity stems from the need

to exhaustively search all entries of an inverted page table to find an entry that maps (contains)

to the requested VPN and process. Hash tables are often used to speed up this search [9]. This

work considers only multi-level page tables, the most widespread page table format.


2.2.2 Translation Lookaside Buffers (TLBs)

To avoid walking the page-tables on every memory access that needs a translation, and incurring

a significant memory latency overhead, modern MMUs employ Translation Lookaside Buffers

(TLBs). TLBs act as caches for the paging hierarchy; TLBs solely cache translations, while

data and instruction caches cache memory blocks (e.g., data, instructions). To avoid ambiguity,

the term cache(s), in the remainder of this thesis, will refer to data and instruction cache(s)

and not TLBs. The temporal and spatial locality principles that caches rely upon also result

primarily in temporal page-table entry locality. This temporal locality is more pronounced

in the TLBs because they track memory accesses at a coarser granularity than caches (pages

versus cache lines). Spatial locality in the application may also result in adjacent page table

entries being accessed close in time.

TLBs are usually organized as cache-like, fully-associative or set-associative structures, ad-

dressable by a virtual address. Table 2.1 lists the data TLB (D-TLB) configurations of several

commercial processors. All these TLBs are private per-core structures. The acronyms FA,

SA, and DM stand for Fully-Associative, Set-Associative, and Direct-Mapped respectively;

these acronyms are used to describe the associativity of various structures (e.g., TLBs, caches)

throughout the thesis. Multiple levels of TLBs can exist in a system, similar to the multiple

levels of caches in the cache hierarchy. The first level usually has separate instruction and

data TLBs, as is the case with the split L1 instruction and data caches that are the norm in

today’s systems. The second level (L2) TLBs are usually unified, hosting translations for both

instructions and data. The TLBs for the AMD systems are the only exceptions in this table.

Processor Microarchitecture L1 D-TLB Configuration L2 TLB Configuration

AMD

12h family [4, Section A.10] 48-entry FA TLB (all page sizes)4-way SA 1024-entry D-TLB (4KB)2-way SA 128-entry D-TLB (2MB)8-way SA 16-entry D-TLB (1GB)

15h family [5, Section 2.9] 64-entry FA TLB (all page sizes)8-way SA 1024-entry D-TLB (4KB,2MB or 1GB)

ARM Cortex-A72 [8] 32-entry FA (4KB, 64KB and 1MB)4-way SA 1024-entry

(4K, 64KB, 1MB, 16MB)

Haswell [38, Table 2.10], [34] 4-way SA split L1 TLBs: 64-entry (4KB),32-entry (2MB) and 4-entry (1GB)

8-way SA 1024-entry (4KB and 2MB)

Intel

Broadwell [38, Table 2.11] same as Haswell6-way SA 1536-entry (4KB and 2MB)4-way SA 16-entry (1GB pages)

Skylake [38, Table 2.5] same as Haswell12-way SA 1536-entry (4KB and 2MB)4-way SA 16-entry (1GB pages)

Knights Landing [38, Table 16.3] uTLB: 8-way SA 64-entry (4KB fractured)

8-way SA 256-entry (4KB), 8-way SA 128-entry (2MB), 16-entry FA (1GB)

OracleSparc T4 [72] 128-entry FA TLB (all page sizes)

Sparc M7 [59] same as Sparc T4

Sun UltraSparc III [76]2-way SA 512-entry TLB (8KB)

16-entry FA TLB (superpages and locked 8KB)

Table 2.1: Commercial D-TLB designs; all the L2-TLBs are unified except for the AMD systems.


The various TLB designs in Table 2.1 are also annotated with the page-size(s) they support.

In each system, the memory allocation algorithm and, in some cases user hints/requests, influ-

ence the page size a given virtual address will belong to. As mentioned earlier, only the virtual

address and the process ID are known before a translation is retrieved. Since the page size a

virtual address belongs to is unknown at translation time, special care is needed to avoid using

page-offset bits as the TLB index. Figure 2.2 illustrates the virtual address bits commonly

being used to index a set-associative TLB for each of the supported x86 page sizes (i.e., 4KB,

2MB, and 1GB). This example assumes a TLB with 64 sets. The tag and set-index bits form

the page number, while the remaining low-order bits form the page offset. Unfortunately, set

index bits for one page size can be page offset bits for another page size. If page-offset bits

are used as a TLB-index, a translation for a single page could reside in multiple TLB-set(s)

depending on the part of the page being accessed (i.e., the page offset).

Virtual Address

4KB

2MB

page-offset bits set-index bits tag bits

1GB

63

11

63 63

20

29

0

0 0

Figure 2.2: SA TLB indexing for a TLB with 64 sets (x86 architecture).

Current systems usually mitigate this set-indexing issue either by implementing a fully-

associative TLB where only a single set exists (e.g., AMD 12h family, SPARC M7) or by

implementing split TLB designs where multiple TLB structures exist, one per page-size (e.g.,

Intel’s Haswell). These split TLBs are all probed in parallel adjusting the TLB index based on

the page size they support. Fully-associative designs have the shortcoming of a slower access

latency, in addition to being less energy efficient than less associative structures, while split

designs have the shortcoming of wasted energy as at most one of all the split-TLB lookups will

be useful. Chapter 3 presents an evaluation of these different design choices, while Chapter 4

further examines their shortcomings and proposes alternative TLB design choices that are

superpage friendly.

A few systems do support multiple page-sizes in a set-associative design presumably either

via multiple sequential lookups or by splitting a single superpage translation into multiple

translation entries of the supported page size. The fractured uTLB in Knights Landing takes

an alternative approach by fracturing the translation of any page greater than 4KB and holding

only the translation(s) for the 4KB parts of the page being accessed.


The TLB organization affects the TLB access latency which is time-critical because it pre-

cedes every cache lookup. In systems with a Virtually-Indexed and Physically-Tagged (VIPT)

L1 cache, the translated address is required before the L1 tag comparison. Therefore, the

L1 TLB organization where a hit is hopefully the common case should meet some tight timing

requirements. But constraints exist from the perspective of the paging infrastructure as well.

For example, the smallest supported page size limits the organization of any L1 VIPT cache;

the cache index bits should fall within the page-offset to avoid being translated, to ensure cor-

rectness. That is, the capacity of a VIPT cache way (i.e., the number of sets multiplied by the

cache block size) should not exceed the smallest page size. For x86-64, a 32KB SA L1 cache

with 64B cache blocks should be at least 8-way SA.

TLBs can be classified into hardware-managed and software-managed. In architectures

with hardware-managed TLBs, like the x86 Instruction Set Architecture (ISA), a TLB miss is

serviced via a hardware state-machine that walks the page tables and delivers the translation

entry, if any is present, to the TLB. Hardware management is minimally intrusive; it does

not require flushing the processor’s pipeline and does not pollute the instruction and data

caches. However, this comes with the overhead of an inflexible page table organization since

the organization specifics need to be fixed for all hardware that supports the same ISA. On the

other hand, software-managed TLBs allow for a more flexible page table design.

In ISAs with software-managed TLBs, like SPARC, a TLB miss triggers an interrupt and

it is a software interrupt handler routine that walks the page table and refills the TLB. Un-

fortunately, the use of precise interrupts requires flushing the core’s pipeline. Therefore, the

flexibility of hardware agnostic page tables comes with a cost. In some SPARC systems, namely

the UltraSPARC CPU family, the interrupt handler checks the TSB before walking the page

tables. The TSB is a direct-mapped, virtually-addressable data structure that caches transla-

tions. Think of it as a software cache that logically lives between the TLBs and the page-tables.

The TSB can be accessed with a single memory access and can thus hide the page-walk latency.

Chapter 5 provides additional details on how the TSB is accessed and presents a TSB-inspired

cacheable TLB for systems with hardware-managed TLBs.

All types of TLBs, both hardware and software-managed, need to be kept coherent with the

page tables. For example, if any modifications happen to the access permissions of a page, the

relevant TLB entry should be updated or invalidated. This change should also be communicated

to other cores in the system (TLB shoot-down). Another scenario when TLB entries need to

be invalidated is on a context-switch. Some systems have enhanced the TLB entries with an

Address Space Identifier (ASID), to allow translations from multiple processes to cohabit the

TLB. The absence of an ASID forces the system to flush the entire TLB. Finally, some systems

provide the option of invalidating entries within a given address range.


2.2.3 The Cheetah-MMU in SPARC

The MMU implementation varies per architecture and encompasses a wide range of features,

such as the TLB organization, the format of a TLB-entry, or any special MMU registers used

in handling TLB-misses. This work uses and extends Flexus from the SimFlex project [35], “a

family of component-based C++ computer architecture simulators”, based on Simics [52] that

models the SPARC v9 ISA [74] and supports full-system simulation. Therefore, this section

presents relevant details for the Cheetah-MMU [84], the MMU used in the UltraSparc-III [76]

processors, and also draws parallels, wherever applicable, with x86.

One might naturally ask how measuring behaviour in one system can be indicative of be-

haviour in other systems. For the purpose of this work, we used the existing MMU to collect

memory traces with address translation information, such as the virtual to physical mapping

of a memory access. Collecting memory accesses before the TLB, and not a TLB miss stream,

allowed us to simulate different TLB configurations. The traces also include non-translating

accesses.

As Section 2.2.2 discussed, the TLBs in Cheetah-MMU are software-managed. On a TLB

miss, a trap handler walks the page tables and refills the TLB. The UltraSparc-III processors

use the Trap Type (TT) register to track the most recent trap (multiple trap levels exist). A

D-TLB miss triggers a fast_data_access_MMU_miss trap with trap type 0x68. Section 5.5.1

reviews the Cheetah-MMU trap handler in more detail. The remainder of this section reviews:

(i) the MMU registers used in TLB miss handling, (ii) the TLB organization and replacement

policy, and (iii) the special MMU operations that keep the TLB coherent with the page tables.

2.2.3.1 MMU Registers and TLB Miss Handling

The Cheetah-MMU has various special registers. Of particular interest when a TLB miss occurs

are the TLB Tag Access Register and the Data In Register [76]. On a TLB miss, the TLB Tag

Access register contains the only information known when a TLB miss occurs: (a) the virtual

address bits [63 : 13] (i.e., 8KB VPN) of the missing address (since the page size is yet unknown,

the smallest page size is used), and (b) a 13-bit context identifier.

The context identifier, also present in TLB entries, allows translations from multiple pro-

cesses, and thus address-spaces, to co-exist in the same structure. With a context ID, systems

can avoid invalidation of all TLB entries in case of a context switch. The terminology for iden-

tifiers similar to context ID that are used in address translation varies greatly. For example,

Address Space Identifier is another term commonly used in literature, while in x86-64 the term

is Process Context Identifier (PCID), a 12-bit value [39]. During a TLB lookup, the context ID

for the currently running process, usually kept in a separate register, is compared against the

context stored in a given TLB entry. The only exception is when the global bit of the corre-

sponding TLB entry is set. If this is the case, no context comparison takes place; the virtual

page comparison suffices.

Once the translation is retrieved (e.g., after a page walk), it is loaded into the 64-bit Data In


register. This register has various fields but the following are the most relevant for this work:

(a) a 2-bit page-size field that distinguishes between the four supported page sizes of 8KB, 64KB,

512KB, and 4MB, (b) physical address bits [40 : 13] (8KB PPN; more least significant bits can

be masked according to the page-size field), (c) a global bit, explained earlier, and (d) a locked

bit that indicates whether this translation can be a TLB replacement candidate; translations

with the locked bit set are called locked or pinned. The writeable flags, the privileged flags and

the bits that determine cacheability are a few examples of other fields.

2.2.3.2 TLB Organization and Replacement Policy

The Cheetah-MMU uses two separate D-TLBs to support translations of different page-sizes

as well as locked translations. As Table 2.1 showed, a 512-entry 2-way set-associative TLB

only hosts 8KB pages, while a smaller 16-entry fully-associative TLB holds translations for

superpages (i.e., non-8KB page sizes) and locked translations of any page-size. Both structures

are probed during a TLB-lookup; on a TLB miss the retrieved translation is installed in the

appropriate structure based on the page-size and locked fields discussed earlier.

Each TLB entry in the Cheetah-MMU TLBs also has a used bit associated with it; this bit

is set on a TLB hit, speculative or otherwise, and it is used to identify a replacement candidate

when a TLB set is full [76]. If no invalid entries exist in the current TLB set, the first unlocked

entry with a used-bit set to zero is selected. If no replacement candidate is identified, all used

bits are reset and the process repeats.

2.2.3.3 Special MMU Operations

Until now, the main focus was on how to retrieve the translation information after a TLB

miss. However, it is crucial for the MMU to have the ability to modify the TLB contents

to ensure they are coherent with the page tables. Correctness is non-negotiable. There are

two types of such modifications: invalidations of TLB-entries, often referred to as demaps or

demappings, and modifications of the contents of TLB-entries, sometimes referred to in this

work as remappings.

In the Cheetah MMU, these operations are initiated by specialized store instructions (STXA

opcode). These instructions use specific ASIs. An ASI is an 8-bit value that specifies the address

space. In SPARC v9, the 13th bit of the instruction (starting from zero) specifies the ASI’s

location. If the bit is zero, the ASI is explicitly encoded in the instruction (bits 5-12 inclusive),

while if it is one, the ASI is held in the ASI register. An ASI with its most significant bit set to

zero corresponds to a restricted ASI, one only accessible by privileged software [74]. Accesses to

MMU registers usually involve one of these special ASIs explicitly encoded in the instruction.

Demaps: “stxa %g0, [%g1 + %g0] 0x5f # ASI_DMMU_DEMAP” is a disassembled demap

instruction; 0x5F is the D-MMU TLB DEMAP ASI. The virtual address denoted in square

brackets, here the sum of the contents of global registers g1 and g0, contains the demap type,


the virtual address bits, and a 2-bit field that indicates the register that contains the context

ID. The value to be stored can be ignored. The restricted ASI is explicitly encoded in the

instruction.

In UltraSparc-III, two types of demaps exist: (a) a demap page type that can invalidate

at most one TLB entry associated with an instruction-encoded VPN and context, and (b) a

demap context type that invalidates all TLB entries associated with a given context, if their

global bit is not set. Locked translation entries can also be demapped like all others.

The functionality described above is also provided in other non-SPARC architectures. For

example, x86-64 supports similar operations with dedicated instructions. Intel reports that

the INVLPG instruction can invalidate all translation entries for a given page number, while

the INVPCID instruction has four different operation modes that can invalidate mappings of

a specific address, all mappings of a specific context (similar to demap-context), or map-

pings of all contexts with the option to either include or exclude any translations marked

as global [39, Section 4.10.4.1].

TLB-Entry Modifications: TLB-entry modifications occur via instructions that directly

modify the MMU Data Access register (0x5d is the relevant ASI). Here is an example in-

struction: “stxa %o3, [%o1 + %g0] 0x5d # ASI_DTLB_DATA_ACCESS_REG”. These are OS-

directed writes to a specific TLB-entry. The virtual address of this write (store) operation

specifies the TLB-entry to be modified (overwritten), while the store value specifies the new

TLB data.

2.2.4 Address Translation for I/O

The previous sections discussed the basics of address translation from the perspective of the

core’s MMU. However, a different path to memory exists via Direct Memory Access (DMA) from

I/O devices. An I/O Memory Management Unit (IOMMU) provides address translation and

memory protection to I/O accesses that would previously directly access physical memory [2,37].

This added functionality provides a level of protection from misbehaving drivers, as well as the

necessary hardware support for I/O virtualization.

With an IOMMU, an address translation step is now introduced in the critical path of

every DMA access. I/O TLBs are used to avoid the costly walk of the I/O page tables. The

IOMMU is commonly located on an I/O hub or bridge (PCI bridge in our simulated system)

and can serve multiple devices. A PCI bridge is a system component that connects multiple

buses together. Figures 2.3 and 2.4 show two examples of a typical I/O architecture focusing

on network and disk I/O traffic respectively, the two dominant types of I/O traffic. For the

network traffic, DMA accesses initiated by the network adapter reach the memory of the server

system after crossing a PCI bus and a PCI bridge where the IOMMU resides. For disk I/O,

Figure 2.4 shows a snapshot of one SCSI disk array from a system that contained multiple

arrays of fiber channel SCSI disks connected to multiple PCI bridges via a hierarchy of PCI


Figure 2.3: Network I/O - System Snapshot Figure 2.4: Disk I/O - System Snapshot

buses. DMA accesses initiated by the Fibre SCSI controller in this snapshot cross the PCI-to-

PCI bridge as well as the Host-to-PCI bridge before reaching memory. The IOMMU module is

now located at the Host-to-PCI bridge, the root node of this hierarchy of buses.

SPARC also contains an IOMMU. In the Serengeti server systems, the IOMMU is located

in Schizo, the host-to-PCI bridge [84]. It has a 16-entry fully-associative TLB and supports

8KB and 64KB page sizes [77]. Selective flushing of TLB entries is permitted via programmable

I/O operations. On a TLB miss, the IOMMU looks-up the Translation Storage Buffer (TSB),

a software-managed direct-mapped memory data structure. The TSB serves like a page table

here. On a TSB miss, an error is returned to the device which initiated the DMA access.

2.3 Literature Review of Related Work

This section reviews past work that optimizes some aspect of the address translation process

either via hardware optimizations or hardware/software co-design techniques. Past research

has been grouped in the following thematic categories: (a) techniques that reduce the number

of TLB misses, (b) techniques that reduce the latency overhead of a TLB miss, (c) techniques

that revisit address translation/paging, (d) techniques that reduce address translation energy,

(e) techniques that address TLB coherence overheads, (f) techniques that target I/O address

translation, and lastly (g) techniques that leverage address translation facilities to optimize dif-

ferent system aspects (slightly deviating from the prior classification). Some optimizations cross

the boundaries of these categories. Also, many techniques are orthogonal to each other. As is

often the case in architecture research, no perfect design exists. As the applications and the un-

derlying hardware changes, both new opportunities and challenges arise. Sections 2.3.1 - 2.3.7

review the aforementioned thematic categories and also discuss how this thesis relates to them.

2.3.1 Techniques that Reduce TLB Misses

TLBs, similar to caches, capture workloads’ memory access behavior, albeit at a coarser gran-

ularity. Increasing the TLB hit rate, which is already high due to the spatial and temporal

locality forces at work, is one of the main approaches to alleviate the address translation over-


head. This section classifies research works that reduce TLB misses into the following three

categories: (i) research works that employ TLB prefetching, (ii) research proposals for shared

or distributed TLB designs that exploit translation sharing across CMP cores, and (iii) research

techniques that extend the reach of each TLB-entry by revisiting the amount of information it

tracks; translation coalescing is one such example. The straightforward solution of increasing

the TLB capacity is not discussed; it is not sustainable for the same timing constraints that

bound L1 cache sizes.

2.3.1.1 TLB Prefetching

Prefetching is employed in caches to anticipate future data use based on previously seen memory

patterns. Prefetching has been proposed for TLBs too, first for uniprocessors in the early 2000s

and later for multiprocessors near the end of that decade.

Saulsbury et al. were the first to propose a hardware-based TLB prefetching mechanism [68].

Their recency-based prefetcher targets deterministic iterative TLB misses. These are charac-

teristic of applications which iteratively access data structures in the same order, but suffer

capacity TLB misses. The proposed scheme maintains a temporal ordering of virtual pages in

an LRU stack by adding a previous pointer and a next pointer to each page table entry. When

this ordering is updated, on a TLB miss, the entries near the requested PTE, referred to by

the aforementioned pointers, are prefetched. All predicted translations are first placed into a

prefetch buffer, and are only promoted to the TLB on a hit, thus minimizing TLB pollution

from bad prefetches. For a set of five applications, the recency-based prefetcher correctly pre-

dicts between 12% to ∼59% of TLB misses for a 64-entry Fully-Associative (FA) TLB, assuming

8KB pages. Applications with a regular stride access pattern benefit the most, and in all cases,

the proposed scheme consistently outperforms a linear, next-page(s) prefetcher.

Kandiraju et al. proposed a TLB prefetching mechanism that captures memory reference

patterns in the form of distances [42]. Distance is the difference in pages between two con-

secutive memory accesses in the TLB-miss stream. The proposed scheme stores previously

seen distances in a distance-indexed hardware table. For example, if the current TLB miss

is to page a and the current distance points to an entry with distances two and five, then

translations for pages a+2, and a+5 will be prefetched. One of the main benefits of the dis-

tance prefetcher is that it can perform well with limited hardware storage. For example, if

all TLB misses of a workload had the same stride (distance), then just a single-entry distance

table would provide full coverage. The recency-based prefetcher [68], discussed earlier, does

not have any on-chip storage constraints but at the expense of larger in-memory page tables.

The proposed distance prefetching scheme has the highest average prediction accuracy across

the simulated workloads for a variety of stride and history based prefetching schemes, some

originally proposed for caches, as well as for the recency-based prefetcher. The latter however

has slightly higher average accuracy than distance prefetching when the accuracy is weighted

by each application’s TLB miss rate. The authors attribute this behaviour to a few applications


with high TLB miss rates that benefit from a long history.

All aforementioned prefetching schemes were evaluated in a uniprocessor environment.

Bhattacharjee et al. were the first to characterize TLB misses of parallel workloads (PARSEC)

in a CMP environment and to propose prefetching schemes to address them [20, 21]. The first

class of misses they identified, inter-core shared, captures TLB misses to the same virtual page

across different cores, representative of multi-threaded workloads that access the same instruc-

tions and data. A leader-follower prefetching scheme is proposed, which pushes translations

into the prefetch buffers of other cores (the sharers) under the rationale they will miss on the

same page. A confidence mechanism filters useless or harmful prefetches.

The second class of TLB misses, inter-core predictable stride, captures misses to virtual

pages that are accessed by different cores within a fixed timeframe and are stride pages apart

from each other. The hypothesis is that if core i accesses V PNa, then it is possible that core j

will access V PNb that is V PNa plus some stride. Such behaviour is reflective of data-parallel

applications where threads running on different cores operate on different subsets of data but

follow the same access pattern. Because memory accesses to different pages across cores can

be reordered in timing execution, the proposed distance-based cross-core prefetching scheme

keeps track of distances between consecutive per-core TLB misses. These distance pairs are

stored in the distance table, a hardware structure that is shared across all cores, and they drive

distance-based prefetches on other CMP cores.

Even though this thesis does not evaluate the use of prefetching, prefetching optimizations

are orthogonal to our work. Prefetching could be added both to the superpage friendly TLB

designs in Chapter 4 and the FMN design in Chapter 5. For FMN, a straightforward prefetching

implementation would trigger more FMN and page-table accesses, assuming the prefetching

candidate does not exist in a hardware structure on chip. By default, useless translation

prefetches can unnecessarily increase the memory bandwidth or displace useful data in the

cache hierarchy by bringing additional page table entries on chip. This behaviour could be

further exacerbated with an FMN that is also cacheable like the page tables. A judicious

feedback mechanism to throttle prefetches might be needed, if the existing FMN probe filtering

mechanism (Section 5.9.7) proves insufficient. Alternatively, prefetches could only probe either

the FMN or the page tables and not both. The former would result in a faster prefetch, if the

translation entry has been previously seen and is present in the FMN, while the latter would

avoid the risk of FMN displacing useful data. Both the page walk and FMN probes also perform

indirect prefetching. Within each 64B - FMN or page table - cache block accessed, multiple

translations, for usually spatially adjacent virtual pages, co-reside. All these translations are

naturally moved higher in the cache hierarchy, i.e., closer to the cores, when another translation

in that cache line is accessed.


2.3.1.2 Shared or Distributed-Shared TLB Designs

In a CMP environment different applications can either share memory access patterns and data

(e.g., parallel applications) or have very different resource requirements (e.g., multiprogrammed

applications). Even though today’s cache hierarchies often include one or more shared cache

levels to facilitate data sharing, a per core private TLB hierarchy continues to be the norm.

Two proposals in 2010 - 2011, one for a shared TLB [19] and one for a distributed-shared TLB

design [75], were the first to target this research gap. In both cases, the proposed TLB designs

appear to only cater to the smallest supported page size.

A shared last-level TLB design [19] was proposed to better utilize the available TLB capacity

compared to private TLBs. Having a shared structure avoids translation replication across

private TLBs, expanding the effective on-chip TLB capacity, and thus the TLB hit-rate, for

parallel applications. It also allows multi-programmed applications to freely contend for the

entire shared TLB capacity without the constraints of per-core TLBs, which can be especially

beneficial for applications with unbalanced TLB requirements. Integration of stride prefetching

in a shared TLB design is orthogonal, yielding additional miss reduction. The proposed design

was however a monolithic structure with increased access time compared to private TLBs and

potentially poor scalability.

Synergistic TLBs is an alternative design that aims to combine the short access latency of

private TLBs with the better utilized capacity of the shared TLB paradigm [75]. The proposal

is for a distributed TLB design that allows evicted translation entries from borrower TLBs to

spill to remote TLBs that have been dynamically classified as donors. On top of the distributed

design, synergistic TLBs also permit heuristic-based translation replication and migration. The

first allows translations to be replicated across cores to avoid long access times to remote TLBs,

while the latter migrates translations to cores that are likely to access them to better utilize

the available TLB capacity.

Neither the shared nor the synergistic TLB designs were shown to support multiple page

sizes. Using any of our superpage-friendly TLB designs from Chapter 4 as a shared or dis-

tributed TLB could further benefit performance or energy as it would increase the TLB reach.

Also, it would be straightforward to probe our FMN design (Chapter 5) on a shared or a syn-

ergistic TLB miss. One of the shortcomings of a shared TLB is having a monolithic hardware

structure that might not scale well as the number of cores on a chip grows. The FMN design

can be easily configured to be a shared one; all TLB miss controllers need to share the same

FMN base address (see Section 5.4). A shared FMN straddles the ground between a shared

and a distributed structure. Multithreaded workloads that share data across cores can fully

utilize FMN’s capacity without any translation replication. FMN’s cacheable nature allows the

same FMN entries to simultaneously exist in multiple private caches, thus reducing TLB access

latency, albeit at the risk of more useful-data displacement. One could also envision different

FMN organizations where a subset of CMP cores share an FMN while other FMNs are private,

a potentially beneficial configuration for systems running virtual machines or multiprogrammed


workloads.

2.3.1.3 Increasing TLB Reach

Prefetching and shared/distributed structures that facilitate translation sharing both rely on

the traditional reach of each TLB translation entry. An alternative way to increase the TLB hit-

rate is extending the reach of each individual TLB entry. TLB entries that support superpages

already do that. However, these entries simply adhere to the decision made by the OS’s memory

allocation algorithm. This section reviews research that extends the reach of TLB entries beyond

what was decided at the OS level.

Talluri and Hill were the first to explore this research avenue in the mid 1990s with two

subblocking TLB designs [78]. The complete-subblock TLB allows a subblock-factor number of

contiguous virtual page numbers to share a single TLB tag, with separate data fields for the

physical frame numbers these VPNs map to. With this design, each TLB entry has a similar

reach as a superpage that is subblock-factor times greater than the smallest supported page size,

albeit at the cost of more hardware resources. A partial-subblock TLB, on the other hand, eases

the area overhead by requiring all PPNs to fall within an aligned memory region, share attribute

bits, and properly align with their VPNs. The partial-subblocking TLB entries are closer in size

to a superpage TLB entry and, contrary to it, do not require any OS support. However, because

not all virtual pages within a subblock might meet these requirements, multiple instances of

the same VPN tag could coexist in the TLB unless the valid bits for all subblock’s VPNs are

combined with the tag.

Almost two decades later, Pham et al. proposed translation coalescing [58]. They observed

the presence of intermediate degrees of contiguity where a group of contiguous VPNs maps

to contiguous PPNs, but this contiguity does not suffice for the contiguous VPN region to be

promoted to a superpage. Their proposed design, CoLT, coalesces such VPN groups maintaining

a single TLB-entry for each. They report a 40% to 57% average TLB miss reduction on a set of

SPEC CPU 2006 and Biobench (bioinformatics) workloads while limiting the maximum number

of coalesced translations to four. Their scheme changes the TLB indexing scheme to support

larger coalesced pages but at the expense of more conflict misses. Contrary to the subblocking

designs discussed earlier, CoLT does not have any alignment restrictions. However, CoLT’s

potential is inherently limited to the “contiguous spatial locality” available in a given system,

which can be scarce in the presence of fragmentation.

Pham et al. later relax CoLT’s requirement for contiguous VPNs to map to contiguous PPNs

and propose the use of clustering to extend TLB reach [57]. Similar to the partial subblock

TLBs, each TLB entry in this clustered TLB maps a set (cluster) of contiguous VPNs. This

VPN cluster needs to be in both cases properly aligned; all VPNs in a cluster share the same

VPN bits except for the lower log2(cluster factor) bits. The same alignment requirement

applies to the PPN cluster too. However, unlike the partial subblock TLB design, these VPNs

can map anywhere within an equally sized and properly aligned cluster of PPNs. Holes, that is,


VPNs that do not map to any PPNs in that cluster, are also permitted. These two differences

allow the clustered TLB to capture more cluster locality than CoLT and without any OS

changes. But, as the authors observe, not all translation mappings exhibit such cluster locality;

having too many holes within a cluster would unnecessarily waste resources. Therefore, they

propose a multi-granular TLB design where a clustered TLB and a conventional TLB are both

probed in parallel. This design is further enhanced with a frequent value locality optimization

that will be discussed in Section 3.7. Bhattacharjee reports that “TLB coalescing schemes are

being adopted by industry (e.g., AMD’s Zen chip supports TLB coalescing today)” [17]. AMD’s

“Zen” microarchitecture supports “PTE coalescing [that] [c]ombines 4K page tables into 32K

page size” [27].

Our work relies on page contiguity identified by the OS and does not exploit any interme-

diate degrees of contiguity. The proposed coalescing schemes coalesce translations for only one

page size per set-associative structure. For example, CoLT proposed CoLT-SA coalescing for

a set-associative TLB that supports the smallest page size, while a separate design, CoLT-FA,

coalesced translations for the fully-associative TLB that supports superpages. Coalescing sup-

port for multigrain set-associative structures (i.e., structures that support multiple page sizes)

is far from straightforward, especially since coalescing already requires modification of the set-

indexing scheme. Configuring FMN to support CoLT-like or clustering contiguity might be

possible, if the risk of wasted resources due to holes in a cluster is mitigated. Having a separate

small cacheable FMN that tracks such groups of pages might be an alternative.

2.3.2 Techniques that Reduce TLB Miss Latency Overhead

Unlike the aforementioned TLB miss reduction works, this section reviews research that targets

the TLB miss latency overhead. That is, if one cannot reduce the number of TLB misses, is

it possible to make them less costly? MMU caches and speculative translations are two such

options.

MMU caches, employed by many current commercial designs, logically reside between the

TLB hierarchy and the page tables. By caching parts of the page walk, MMU caches reduce

the number of page-walk required memory accesses and thus the TLB-miss latency. The main

insight here is the presence of temporal locality in the high levels of a multi-level page table,

i.e., memory accesses that share the most significant virtual address bits. AMD64 processors

employ a Page Walk Cache (PWC) [3, 15], a “fully-associative, physically-tagged page entry

cache” that hosts page entries from all page table levels but the last one. This MMU cache

type is also referred to as page table cache [10] because it provides the physical address for

the next (lower) level page table. Intel’s processors employ paging structure caches [39], also

referred to as translation caches [10]. Contrary to the page table caches, translation caches are

virtually tagged and a single entry can skip more than one memory accesses. A PML4 cache

skips accessing the topmost page-table level, a PDPT-entry cache skips the top two levels,

while a PDE cache can skip the top three [39]. Barr et al. first explored the effect of these


types of MMU caches [10], including the newly proposed translation-path cache, while more

recently Bhattacharjee proposed coalescing and sharing modifications [16], grounded in the

same observations that guided the CoLT and shared TLB designs.

Another mechanism that attempts to hide the page walk latency is SpecTLB [11]. SpecTLB

speculates as to what the virtual to physical translation will be on a TLB miss, allowing for

memory accesses and other useful computation to proceed speculatively and in parallel with

the page table walk. Note that a TLB miss most probably denotes the presence of a cache

miss too. The proposed system takes advantage of the unique characteristics of a reservation-

based memory allocation system (FreeBSD). On a page fault, the OS might choose to reserve a

superpaged-size region (large page reservation) instead of the default small page, if it predicts

that the entire large page reservation is likely to be used. When this large page reservation

is filled, i.e., all small pages within it are accessed, it is promoted to a superpage. SpecTLB

takes advantage of this memory allocation algorithm. Whenever an address that misses in the

TLB falls within a partially filled large page reservation, the SpecTLB provides a speculative

translation based on the assumption that all small reservations will be promoted to a single

large page. Even with a heuristic reservation detection, SpecTLB overlaps on average 57% of

the page table walks with successful speculative execution.

Our FMN proposal (Chapter 5) also targets TLB miss latency reduction. It leverages the

idea that retrieving a translation with a single memory access is faster than a multi-level page

walk. If MMU caches are present in the system, a configuration not evaluated in this thesis,

then the FMN should be probed either in parallel with or after the MMU caches. The MMU

caches’ location and access latency will likely influence this decision.

2.3.3 Techniques that Revisit Address Translation/Paging

Most research works reviewed earlier in this chapter attempt to optimize address translation

within the existing virtual memory paradigm of paging. Nonetheless, some have followed a

different direction, revisiting how virtual memory is supported.

Basu et al. proposed using direct segments to map large contiguous virtual memory regions,

associated with key data structures, to contiguous physical memory regions [12]. Virtual ad-

dresses that belong to a direct segment do not suffer from TLB misses; instead, they are mapped

to physical addresses via minimal hardware that co-exists with the TLBs. The main motivation

was that big-memory workloads not only pay a hefty penalty due to paging, but an unnecessary

one as they do not benefit from the facilities paged virtual memory provides. Specifically, Basu

et al. observe that “For the majority of their address space, big-memory workloads do not re-

quire swapping, fragmentation mitigation, or fine-grained protection afforded by current virtual

memory implementations. They allocate memory early and have stable memory usage.” [12].

As the authors point out, direct segments do not replace paging. The two mechanisms co-exist;

virtual memory addresses outside a direct segment are mapped via paging. Direct segments re-

quire significant software support; the programmer needs to identify a memory region amenable


to this optimization, and the OS needs to consider this in its memory allocation algorithm. The

OS is also responsible for managing the special hardware registers that support direct segment

mappings, e.g., by updating them on a context switch.

Even though direct segments can reap significant benefits for workloads that have a single,

easily identifiable by the programmer, direct segment (e.g., database workloads), they cannot

be extended to different application types, they are not transparent to the application, and they

are limited to one segment per application. These limitations are addressed in the Redundant

Memory Mappings (RMM) proposal [45]. In RMM, Karakostas et al. propose range mappings

for multiple “arbitrarily large” and contiguous virtual memory regions that, similar to a direct

segment, each map to a contiguous physical memory region. Translations for these mappings

are hosted in a fully-associative range-TLB, probed in parallel to the conventional L2-TLB, and

a range table, similar to a page table. Page table entries are augmented with a range bit to

specify that a page has a range-table entry. On a last-level TLB miss on both TLB types, a page

walk takes place first. Then, if the range bit is set, the range table is accessed in the background.

This access happens off the critical path and updates the range TLB with a range translation.

Next time an address within this range mapping misses in the L1-TLB, it will hit in the range-

TLB (unless evicted); the relevant page translation will then be installed in the conventional

L1 TLB. Beyond the required architectural support, RMM also requires explicit OS support to

manage range translations and update the range table. The authors also modified the memory

allocation algorithm to support eager paging. This algorithm generates more memory regions

amenable to range mappings during memory allocation, but might inadvertently fragment the

memory space.

All aforementioned research proposals did not abdicate paging; instead, the proposed mech-

anisms only use paging when necessary. Because they do not needlessly use multiple TLB

entries to track translations for virtual memory regions that could be tracked by direct seg-

ment(s), they improve TLB utilization. All our proposed TLB designs can be used for virtual

memory regions not amenable to direct segments or similar optimizations and alongside any

structures that might support them. For FMN, some co-design might be needed if the non-

paging optimization involves a separate page table walk as in RMM [45]. For example, the

FMN could trigger the range table walk on a hit before the page walk completes.

2.3.4 Techniques that Reduce Address Translation Energy

Address translation not only impacts performance, but also involves a significant energy cost;

translation caching structures like the D-TLBs are accessed on every memory operation, which

needs a translation, for today’s widespread VIPT caches. Some TLB designs exacerbate this

energy cost when multiple structures (e.g., split per page size L1 TLBs) are accessed in parallel.

Chapter 4 presents our Prediction-Based Superpage-Friendly TLB designs that can support

multiple page sizes within the same associative structure, significantly reducing TLB lookup

energy. Related work on supporting multiple page sizes is reviewed there. This section briefly


reviews other research that also targets energy.

Karakostas et al. proposed Lite [43], a mechanism that disables TLB ways at runtime to

reduce dynamic address-translation energy. Their motivation is twofold: (i) parallel lookups

of different per page-size TLB structures waste dynamic energy, and (ii) page-table walks can

also consume significant energy. The former is a similar observation we had made earlier,

which had motivated our superpage-friendly set-associative designs [55] discussed in Chapter 4.

The authors’ approach to energy reduction is different than ours. They do not replace the

multiple per page size associative L1 TLBs, but rather downsize them dynamically by disabling

a power-of-two subsets of TLB ways, an idea originally proposed for caches [1]. Their operating

principle is that if hits for a specific page-size dominate, then reducing the sizes of the TLBs that

support other page-sizes will save energy, with minimal, if any, performance overhead. Using

an interval-based scheme, they dynamically identify the number of ways they can disable while

keeping Misses Per Kilo Instructions (MPKI) changes within an acceptable threshold. Their

decision algorithm occasionally re-enables all TLB ways, based on some random probability,

to avoid pathological cases. They also couple Lite with their Redundant Memory Mappings

proposal [45]; by adding an FA L1 range-TLB (recall the range-TLB was L2 in RMM [45]),

they can more aggressively trim down ways from the conventional L1 TLBs. The architectural

and explicit OS level support required for RMM, reviewed in Section 2.3.3, applies here too.

This paper also includes a comprehensive listing of research papers that target TLB energy

reduction, many at the circuit level.

An alternative approach to reducing address translation energy is to reduce the frequency

of TLB accesses, and thus their energy, via Virtually-Indexed and Virtually-Tagged (VIVT)

instead of Virtually-Indexed and Physically-Tagged (VIPT) L1 caches. With a VIVT L1 cache,

the TLB needs to be accessed only on a cache miss. Given the high hit-rates of L1 caches, this

design can result in significant energy savings. Furthermore, removing the TLB access from

the critical path of an L1 cache access, frees the TLB design from strict latency constraints,

potentially allowing TLBs to grow more in size.

Unfortunately, other challenges prevent VIVT L1 caches from becoming prevalent. For

example, unless ASID information is included in the cache tag entry, the entire cache would need

to be flushed on a context switch to correctly deal with homonyms, identical virtual addresses

that belong to different address spaces and map to different physical addresses. Synonyms

(different virtual pages mapping to the same physical page) are also harder to support in VIVT

caches and can complicate cache coherence; cache coherence would require a reverse translation

lookup to identify a specific cache line (or cache lines in the presence of synonyms) via a physical

address. Basu et al. proposed Opportunistic Virtual Caching (OVC), a hybrid L1 cache where

some blocks are either VIVT or VIPT cached [13]. They rely on the OS to specify which

addresses are amenable to virtual caching, which is enabled “when it is safe (i.e., no read-write

synonyms) and efficient (i.e., few permission changes)” [13]. Yoon et al. dynamically detect

and remap synonyms, thus revisiting virtual L1 cache design [85], whereas Park et al. also use


synonym detection and advocate for virtual caches throughout the cache hierarchy [56]; caches

below L1 are prevalently Physically-Indexed and Physically-Tagged (PIPT).

Kaxiras et al. approach the problem from a different perspective: instead of redesigning

virtual caches to solve the synonym problem, they advocate for a cache coherence protocol re-

design [46]. They observe that “Virtual-cache coherence (supporting synonyms) without reverse

translations is possible with a protocol that does not have any request traffic directed towards vir-

tual L1s; in other words, a protocol without invalidations, downgrades, or forwardings, towards

the L1s.” [46]. Their previously proposed VIPS-M [65] protocol meets these requirements and

enables VIVT L1 caches. In this new design, the TLBs could either be private structures probed

after the L1, or a shared banked structure placed alongside the Last-Level Cache (LLC). The

latter requires page-colouring to ensure a memory request accesses the TLB and the LLC in the

same tile (bank) and does not incur network traffic overhead. It also removes the need for TLB

consistency while reaping the benefits of a shared TLB design (reviewed in Section 2.3.1.2).

2.3.5 Techniques that Address TLB Coherence Overheads

Besides all the overheads associated with walking the page tables, translation coherence has its

own challenges. TLB coherence describes the correctness requirement for the TLBs’ data to

be in sync with (i.e., coherent) with the page tables. The term TLB consistency was originally

used [23, 81] to describe this requirement, but recent literature uses the terms TLB coherence

and consistency interchangeably, despite the nuanced but important distinction of the two in

the cache domain. This section also follows this nomenclature. In multiprocessor systems, TLB

consistency requires that any page table modifications made by one core (e.g., remappings,

invalidations) need to be propagated to the other cores, as their TLBs might host that stale

translation.

Early work in 1989-1990 [23,81] highlighted the translation consistency problem, proposing

different hardware or software solutions. Almost two decades later, two research papers [64,83]

highlighted the overhead that software-based TLB consistency - usually implemented via TLB

shootdown software routines that use inter-processor interrupts - incurs in multiprocessor sys-

tems. Romanescu et al. demonstrated that today’s software TLB shootdown mechanisms scale

poorly, latency wise, as the number of cores increases. They proposed a hardware coherence

mechanism in a scheme that unifies instruction, data, and translation coherence [64]. Villavieja

et al. also explored the impact of TLB shootdowns [83]. They identified two main issues.

First, TLB shootdowns are performed via the very costly and intrusive mechanism of precise

interrupts. Second, at the time of a TLB shootdown the OS does not know the exact set of

TLB sharers, and thus unnecessarily interrupts some processors. Their proposed scheme, DiDi,

avoids both these overheads by (a) keeping track of all the translation sharers in a dictionary

directory, and (b) by using a hardware invalidation mechanism instead of an interrupt.

In this thesis, special instructions communicate any translation modifications/invalidations

to the TLBs, as Section 2.2.3.3 illustrated. Section 3.5 in the next chapter provides specifics


about the frequency of such operations. The superpage friendly TLB designs from Chapter 4

can use the TLB consistency mechanism of any system, whereas the FMN design proposed in

Chapter 5 is configured as a speculative, and thus not coherent with the page tables, structure.

2.3.6 Techniques that Target I/O Address Translation

There is limited work on the impact of I/O address translation on system’s performance. Yehuda

et al. were the first to explore the performance impact of an IOMMU on real hardware [14].

They measured throughput and CPU utilization with and without the IOMMU, when running

the FFSB and netperf workloads for disk and network I/O. In a system without a hypervisor, no

difference was seen in throughput, while CPU utilization increased by up to 60% more with the

IOMMU enabled. There were two main sources for this overhead: (a) mapping and unmapping

entries in the page tables in memory, and (b) the system’s inability to selectively invalidate

IOTLB entries.

Amit et al. were the first to propose hardware and software optimizations to reduce IOTLB

miss rates [6]. They used a virtual IOMMU in order to collect I/O traces and ran netperf

and bonnie++ write tests to analyze network and disk I/O. They proposed page offsetting

for each device’s virtual I/O address space to avoid IOTLB hot spots for coherent mappings.

They also proposed a modification to Intel’s Address Locality Hints mechanism, which provides

hints as to whether prefetching of higher or lower adjacent pages should occur. Finally, in their

Mapping Prefetch (MPRE) scheme the OS provides the IOMMU with hints to prefetch the first

group of consistent mappings. Unfortunately, limited details are provided about the various

optimizations (e.g., the timeliness of their MPRE scheme).

2.3.7 Architectural Optimizations that Take Advantage of Address Transla-

tion

Address translation and its facilities not only present optimization challenges but they also

offer opportunities. A different class of research does not aim on improving address translation

per se, but rather leverages existing address translation hardware or software for other archi-

tectural optimizations. For example, R-NUCA [36] augmented page table entries with sharer

information to optimize placement of private versus shared data in a non-uniform cache setting.

Additional bits in the page tables and the TLBs were also used in a snooping mechanism in

virtualized systems [47] to find private pages and pages shared across multiple virtual machines

to filter snoops from cores not mapped to a given virtual machine. Most recently, Bhattacharjee

proposed TEMPO, “translation-enabled memory prefetching optimizations” [18]. He observed

that for big-data workloads 20-40% of DRAM accesses are due to page walks, and these DRAM

accesses are almost always followed by a DRAM access to retrieve the data. TEMPO identifies

these DRAM accesses and prefetches the data first into the DRAM row buffer and then into

the LLC. Once the memory instruction that missed in the TLB is replayed, it is expected to


hit in the LLC, or worst case in the row-buffer, saving both time and energy. TEMPO requires

modifications both in the page table walker and the memory controller.

The research works mentioned above represent only a small sample of related work. Sec-

tion 3.8 illustrates that there is predictability in the first cache block of a virtual page accessed

on a TLB miss. This observation could trigger additional optimizations. The benefit of looking

at the virtual address space is that it is more representative of the application behaviour at

a coarser granularity. The temporal ordering of the various data structures and application

accesses crossing page boundaries could be lost if one looks through the physical address lens.

Similarly, the spatial correlation between physical addresses might be harder to dynamically

extract/learn compared to virtual address correlation.

2.4 Concluding Remarks

This chapter (i) presented background information for address translation, and (ii) classified

and reviewed related work that targets different aspects of the address translation process

such as TLB miss reduction or energy (wherever relevant, additional related work information

will be provided in the following chapters). This review is not all-encompassing, but it high-

lights how design decisions for address translation structures, such as TLBs, permeate system

design affecting performance and energy. It also reflects the different approaches to address

translation optimizations: from purely micro-architectural to mostly OS or system-level or

hardware-software co-design, to name a few. This thesis opts for an architectural approach.

First, Chapter 3 analyzes TLB-related behaviour to better understand those application or

system-level characteristics that can influence the address translation cost. Then, Chapter 4

addresses the paucity of research in associative designs that can support multiple page sizes,

while Chapter 5 proposes a cacheable TLB to reduce TLB miss latency without hefty hardware

resources.

Chapter 3

TLB-related Behaviour Analysis

3.1 Overview

This chapter presents an exploration of the TLB-related workload behaviour for a set of appli-

cations emphasizing commercial and cloud workloads. Address translation caching structures

such as TLBs can improve performance but are not prerequisite for correct execution. In a

design space that is on one end marked by a system without a TLB and on the other end

by a system with an ideal TLB - a utopian structure that has zero access time, zero stor-

age requirements, zero energy, and zero misses - there are multitudes of possible and realistic

designs.

As in most architectural designs, it would be possible to optimize the hardware design of ad-

dress translation structures (e.g., TLBs) to perform well for an individual workload, but such an

approach could be mainly applicable in reconfigurable architectures (e.g., Field-Programmable

Gate Arrays (FPGAs)). The core objective in system design is to first optimize for the common

case, and if possible have dynamic mechanisms to reduce the negative impact or further optimize

for the less common operating scenarios. This chapter’s goal is a comprehensive summary of

characteristics and metrics that we believe are of interest for anyone doing research in the area

of address-translation optimizations. The complex interactions between these characteristics

are mapped out, to the extent possible, and suggestions on design trade-offs are also described.

Sections 3.1.1 and 3.1.2 present a roadmap for the rest of this chapter. These sections list

the characteristics/metrics that will be reported along with a short justification as to why these

measurements were collected. The precise definition of each metric is given in later sections,

just before the measurements are presented. Wherever relevant, both the overall behaviour of

the characteristics as well as how they vary in time is presented. Some of the metrics shown,

e.g., MPMI, can be considered as proxy metrics for performance. It is key to understand that

no single metric is sufficient in isolation to decide the most appropriate TLB-design or relevant

address translation optimization for a single workload. There is overlap in the information that

each measurement yields, and they all together formulate a complex set of trade-offs. This

problem becomes even more complex when trying to identify designs that work well for most

28

Chapter 3. TLB-related Behaviour Analysis 29

workloads, especially when these workloads have drastically different characteristics.

3.1.1 Characteristics Inherent to the Workload

Table 3.1 lists characteristics that are inherent to the workload and that are not influenced by the

hardware configuration of the TLBs or the organization of other address translation structures,

such as the page tables or the MMU caches. However, these characteristics, which can vary

greatly across workloads, can profoundly influence the impact of different TLB designs and

address translation optimizations on metrics such as performance, TLB miss handling latency,

etc. The main characteristics listed for each workload are: the number of unique translations,

the number of ASIDs (contexts), and the lifetime of the various translation mappings. The

memory allocation algorithm used by the operating system as well as the supported page sizes

can influence these characteristics, but modifying any of these parameters is beyond the scope

of this thesis.

Characteristics Measured Brief Justification

Sets an upper bound to the ideal TLB size. Useful whenUnique Per Core and deciding among different TLB sizes and private versus

Per Page-Size Translations shared TLBs. Per page-size breakdown highlights(Section 3.3) issues of current TLB structures.

The ASID count can help in TLB indexing scheme selection,Contexts (ASIDs) e.g., a high count can result in many TLB conflict misses

Count and Sharing Degree, for non context-aware indexing schemes. The sharing degreeLifetimes, can affect private versus shared TLB decisions. For example,

Frequency and Reach if all contexts have high sharing degrees, private TLBs would(Section 3.4) suffer from translation replication and thus have less

effective capacity.

The lifetimes of translation mappings influence all TLBTranslation Mappings management schemes. For example, large TLB capacities

Lifetime would be useless if all mappings had an extremely short,(Section 3.5) one-time access, lifetime. History-based management

schemes better cater to long translation lifetimes.

Unique Bytes or Byte Sets Hints at the compressibility of translation information.in Translations Can drive compression schemes that reduce the size(Section 3.7) of structures needed to capture translation footprint.

Table 3.1: List of characteristics/metrics inherent to the workload presented in this analysisalong with a brief explanation.


To summarize, the goal of Sections 3.3 to 3.5 is to reveal behaviour that is inherent to the

program and thus aid in understanding any specific measurements obtained for specific TLB

configurations in the sections that follow.

3.1.2 Other Characteristics

Table 3.2 presents characteristics and metrics that, contrary to the ones depicted in Table 3.1,

are influenced by the structure of the TLBs and other address translation structures. Some

of these metrics, e.g., MPMI, can be used as proxy metrics for performance, evaluating the

effectiveness of a given TLB design, while other metrics outline what we consider interesting

opportunities for architectural optimizations.

Characteristics Measured Brief Justification

MPMI and Hit-Rate Effectiveness of different TLB structures.for different TLB organizations Influenced by metrics like unique translations,

(Section 3.6) context, page size, and translation lifetimes.

Cache Block Accessed after TLB miss Opportunity for Cache/TLB co-design.(Section 3.8)

Table 3.2: Other Measurements

3.2 Methodology

All graphs in this chapter were generated via functional trace-based full-system simulation.

Functional simulation allows for a longer execution sample and is appropriate for the aforemen-

tioned measurements. Even if accesses are reordered in a full-timing setting, it unlikely that

this will affect metrics such as TLB MPMI, given that TLB misses are not extremely frequent

events and thus would not fall within that short time window.

The traces were collected using Flexus, from the SimFlex project [35], a full-system simulator

based on Simics [52]. Simics models the SPARC ISA and boots Solaris. We relied on Simics

API calls (e.g., probing TLBs/registers) to extract translation information. The traces include

both user and OS (privileged) memory accesses. The collected memory references correspond to

one billion dynamic instructions per core in a 16-core CMP, 16 billions in total. In our sample,

the progression of memory references implies the progression of time. For the remainder of this

chapter, the terms execution and execution time will interchangeably refer to the aforementioned

execution sample.

The memory traces were collected after the running workloads had reached a stable state,

that is, after they had passed their initialization phase, in order to obtain a representative


execution sample. No drastic changes were observed in the results throughout the aforemen-

tioned execution sample. We thus expect that similar trends will persist for longer executions,

until a workload enters a drastically different execution phase. Different execution phases are

likely to exacerbate some of the already observed trends. Unless a workload’s data footprint or

data use change (e.g., by allocating new data structures or accessing the existing ones with a

drastically different access pattern that spans page boundaries), it is unlikely that smaller-scale

phase changes would affect TLB-related trends.

The aforementioned execution sample is also within the same order of magnitude as what

is used in other research works that use architectural simulators [19, 43]. Simulating longer is

practically difficult given the slow simulation speeds. Research works that can be evaluated

via OS modifications and with measurements from hardware performance counters [12] can

be run for significantly lengthier evaluation intervals (e.g., several minutes). Even though it

would be feasible to collect some of this chapter’s results in a real system too, e.g., by using the

BadgerTrap [32] tool that instruments and collects x86-64 TLB misses, these results would be

naturally limited to the TLB organization of one hardware system. For example, a functional

simulator would still be needed to explore how different TLB organizations influence MPMI as

Section 3.6 does.

3.2.1 Workloads

Table 3.3 summarizes the set of eleven commercial, scale-out and scientific workloads used in this

work. These are standard state-of-the-art workloads, sensitive to modern TLB configurations,

that many other works have used [13,16,19,44,57].

Workload Class/SuiteWorkloadName

Description

Online TransactionProcessing (OLTP) -

TPC-C

TPC-C1 100 warehouses (10GB), 16 clients, 1.4GB SGA

TPC-C2100 warehouses (10GB), 64 clients, 450MB

buffer pool

Web Server (SpecWEB-99) Apache 16K connections, FastCGI, worker-threading

PARSEC [22](native input-sets)

canneal simulated annealing

ferret Content similarity search server

x264 H.264 video encoding

Cloud Suite [30]

cassandra Data Serving

classification Data Analytics (MapReduce)

cloud9 SAT Solver

nutch Web Search

streaming Media Streaming

Table 3.3: Workloads

We opted for a variety of workloads versus an exhaustive representation of a single workload


suite. For workload suites such as PARSEC, a workload subset was selected based on each

workload’s data footprint and how much it stresses the TLBs. For example, Canneal, Ferret

and x264, have some of the highest weighted D-TLB misses per million instructions values, as

presented in an earlier TLB characterization work [20].

3.3 Unique Translations

What is the number of TLB entries required to deliver the maximum possible hit-rate for each

workload given a conventional private or shared FA TLB? This is the main question that this

section addresses. If the TLB capacity was not limited by constraints like access-time or area,

having a fully-associative TLB with as many entries as the number of unique translations each

workload requires would result in no TLB capacity misses. A unique translation, from the

perspective of each core, is identified via the tuple {Virtual Page Number (VPN), Context,

Page Size}. Each such translation requires a separate TLB-entry. The physical frame this

tuple maps to can change over time, but this is not classified as a separate translation because

it would not occupy a separate TLB entry.

3.3.1 Per-Core Measurements

Tables 3.4 and 3.5 present a per page-size classification of the unique translations accessed per

workload, as seen from the perspective of each CMP core. The simulated system supports page

sizes of 8KB, 64KB, 512KB and 4MB. All pages with sizes other than the smallest one are

referred to as superpages. The total memory footprint these unique translations correspond to,

assuming all the data blocks in each page are accessed, is also shown.

Variations in the access pattern of each core result in different ideal TLB capacity require-

ments of each private TLB. Therefore, the per core tables below present the minimum, the

maximum, as well as the average number of unique translations per core along with the mea-

sured Standard Deviation (SD) to provide a more well-rounded picture. The maximum number

of unique translations would be the answer to the question posed in the beginning of this section,

assuming private FA TLBs, all with the same number of entries.

For the small 8KB pages, Table 3.4 shows that the average number of unique translations

accessed per core is one or two orders of magnitude more than the state-of-the-art L1 TLB

capacity. As the SD column hints (the detailed measurements are omitted), while for some

workloads all cores have similar TLB-capacity requirements (canneal) or there is only one

outlier core (apache), for other workloads (e.g., cloud9, x264) the variations across cores are

more pronounced, showing, in some cases, different clusters of cores in terms of their ideal TLB

requirements. Designing private TLBs with different space requirements could be meaningful

for workloads like these; a shared TLB where workloads could contend on-demand for TLB

capacity could be another alternative.

While the simulated system supports four page sizes, only the 8KB and 4MB page sizes


Workload Workload Min. Unique Max. Unique Avg. Unique SD (σ)

Class 8KB Translations 8KB Translations 8KB Translations (σ as % of Avg.)

(Footprint in MB) (Footprint in MB) (Footprint in MB)

Commercial

apache 11,571 (90) 55,393 (433) 51,380 (401) 10,306 (20.06%)

TPC-C2 15,310 (120) 19,757 (154) 17,903 (140) 1,060 (5.92%)

TPC-C1 6,177 (48) 11,627 (91) 7,071 (55) 1,230 (17.39%)

canneal 68,047 (532) 68,471 (535) 68,132 (532) 112 (0.16%)

PARSEC ferret 3,537 (28) 9,299 (73) 7,304 (57) 2,152 (29.46%)

x264 22 (0.2) 9,862 (77) 2,513 (20) 2,754 (109.59%)

cassandra 11,392 (89) 18,819 (147) 14,727 (115) 2,316 (15.73%)

classification 411 (3) 1,336 (10) 918 (7) 276 (30.07%)

Cloud-Suite cloud9 9,145 (71) 74,033 (578) 28,014 (219) 19,278 (68.82%)

nutch 7,088 (55) 8,030 (63) 7,648 (60) 233 (3.05%)

streaming 20,295 (159) 56,449 (441) 53,110 (415) 8,506 (16.02%)

Table 3.4: Per-core unique translation characterization: 8KB pages. Footprint in MB is listedin parentheses for the min., max. and avg. (arithmetic mean) columns. SD is also expressedas a percentage of the average in parentheses.

Workload Workload Min. Unique Max. Unique Avg. Unique SD (σ) Superpage

Class Translations Translations Translations Footprint (MB)

512KB 4MB 512KB 4MB 512KB 4MB 512KB 4MB Max. Avg.

Commercial

apache 28 4 54 7 50 4 6 1 55 41

TPC-C2 69 13,207 73 14,121 71 13,706 1 279 56,521 54,860

TPC-C1 4 410 7 748 6 544 1 95 2,996 2,179

canneal - 5 - 5 - 5 - 0 20 20

PARSEC ferret - 5 - 8 - 7 - 1 32 28

x264 - 1 - 8 - 4 - 3 32 16

cassandra - 1,280 - 1,534 - 1,469 - 90 6,136 5,876

classification - 121 - 832 - 549 - 192 3,328 2,196

Cloud-Suite cloud9 - 4 - 6 - 6 - 1 24 24

nutch - 174 - 190 - 185 - 4 760 740

streaming - 8 - 8 - 8 - 0 32 32

Table 3.5: Per-core unique translation characterization: Superpages (i.e., 64KB, 512KB and4MB pages). No 64KB pages were present.

were prominently used. No use of 64KB page sizes was observed, while very few, if any, 512KB

pages were used. There are multiple possible reasons for this page size usage. For example,

workloads that allocate and access large data structures likely exhibit sufficient memory address

contiguity for them to use the largest supported page size (4MB) from all supported superpage

sizes, without risking internal fragmentation. In such cases, using a larger page size is preferable

because it extends the reach of each TLB entry. It is also possible -depending on the memory

allocator used- that smaller superpages were used during the initialization phase of these work-

loads, and they were later promoted to 4MB pages once the workload reached a more stable


state. The intermediate page sizes might be reserved for specific purposes, e.g., I/O buffers;

we had observed usage of 64KB pages for I/O TLB accesses during an earlier research project.

They might also exist to support legacy devices. Even though the OS with its memory allocator

is the one responsible for page size decisions, hints by the user can influence these decisions.

For example, in Solaris, one can use the ppgsz utility to specify the preferred page size for

the heap or the stack. Lastly, it is also possible that the TLB configuration of the underlying

system influences the page size decisions of the memory allocation algorithm. For instance, if

there is little hardware support for a given page size, the OS might avoid using it.

As Table 3.5 shows, the use of superpages varied drastically across workloads. The OLTP

workloads TPC-C1 and TPC-C2 (commercial database systems) and three scale-out applica-

tions from the Cloud benchmark suite (cassandra, classification, and nutch) use 4MB pages.

These workloads can easily thrash an unbalanced TLB design with limited superpage capac-

ity, while the 8KB-only workloads waste energy looking up any separate superpage-only TLB

structures. Chapter 4 presents our proposed prediction-based superpage-friendly TLB designs

that address this challenge.

3.3.2 CMP-Wide Measurements

This section has until now focused on per core measurements of unique translations, as these can

better inform design decisions for the private TLB designs that are prevalent today. Table 3.6

will next report CMP-wide measurements. These CMP-wide unique translation counts reflect

the capacity (number of entries) of a shared - across all 16 cores - fully-associative TLB that

would yield no TLB capacity misses. These values are also shown normalized to the average per

core unique translations counts from Table 3.4 for 8KB pages and from Table 3.5 for superpages

respectively. The range for the normalized values is [1, 16] inclusive. Workloads with low

degrees of data sharing have normalized values close to 16 (e.g., cloud9 for 8KB pages); each

core in these cases accesses an almost distinct part of the data footprint at the page granularity.

Conversely, workloads with high degrees of data sharing have normalized values close to 1 (e.g.,

ferret, canneal for 8KB pages); here most cores access the same data footprint, and thus the

CMP-wide unique measurements closely match the average per-core values. Translations for

superpages, when a workload only accesses a small number of them, also fall under this second

category with a normalized value close to 1. However, when a significant number of superpages

is accessed, as for example in cassandra and classification, these superpages are mostly private.

The presence or not of different processes running on different cores also contributes to this

private versus shared distinction, as Section 3.4 will discuss.

3.4 Contexts

Context IDs are not traditionally part of TLB indexing schemes. As Section 2.2.2 mentioned,

only parts of the VPN are commonly used for TLB indexing. However, the presence of multiple


Workload

Unique 8KB Unique 512KB Unique 4MB

Workload Translations Translations Translations

Class (Normalized over (Normalized over (Normalized over

Per-Core Avg.) Per-Core Avg.) Per-Core Avg.)

apache 138,919 (2.7) 56 (1.12) 7 (1.75)

Commercial TPC-C2 34,575 (1.93) 85 (1.2) 31,847 (2.32)

TPC-C1 31,455 (4.45) 36 (6) 4,807 (8.84)

canneal 69,287 (1.02) - 5 (1)

PARSEC ferret 14,294 (1.96) - 9 (1.29)

x264 16,619 (6.61) - 9 (2.25)

cassandra 110,418 (7.5) - 23,397 (15.93)

classification 10,911 (11.89) - 8,137 (14.82)

Cloud-Suite cloud9 447,302 (15.97) - 6 (1)

nutch 28,226 (3.69) - 2,871 (15.52)

streaming 768,854 (14.48) - 8 (1)

Table 3.6: CMP-wide unique translation characterization: 8KB pages and Superpages.

contexts (also known as ASIDs) can apply more pressure on specific TLB sets as different

processes may use identical VPNs to refer to otherwise different physical frames. Even if these

different VPNs refer to the same physical frame (synonyms), a different translation entry is

needed, as the previous section mentioned. This TLB-set pressure is anticipated to be more

pronounced in shared TLB structures. Thus, context-aware TLB management schemes could

be a compelling alternative as they could prevent translation entries with the same VPN, but

with different contexts, from mapping to the same TLB set. But even beyond this, better

understanding and analyzing contexts can provide us with a coarser-grain lens through which

we can see and understand measured TLB behaviour (e.g., TLB MPMI).

3.4.1 Context Count and Sharing Degree

CMP-Wide Measurements: Figure 3.1 depicts the number of unique contexts (y-axis) ob-

served in the entire CMP for a set of workloads (x-axis). All bars are further colour-coded to

indicate the number of cores that issue memory requests for this context. Private is a context

that appeared only in a single core during the workload’s execution, while Shared is a context

present in all 16 cores. All workloads have at least one shared context, context zero. Four

additional categories represent contexts shared by 2 to 15 cores.

Two of the commercial workloads, apache and TPC-C2, have a considerable number of

contexts with varying degrees of sharing. TPC-C2 has an order of magnitude more contexts

than the other TPC-C variant, TPC-C1, that is running on a different database. Almost all


172

205

37

4 4 423

133

19 2235

020406080

100120140160180200220

apache TPC-C2 TPC-C1 canneal ferret x264 cassandraclassifica?on cloud9 nutch streaming

CommercialWorkloads PARSEC Cloud-Suite

#Uniqu

eCM

PCo

ntexts

Private(1core) Shared(all16cores) 2cores [3,4]cores [5,8]cores [9,15]cores

Figure 3.1: Number of unique contexts observed in the CMP; the number is also listed on thetop of each column. Each column is colour-coded based on the number of core-sharers.

simulated Cloud-Suite workloads are dominated by private contexts. Ferret, from PARSEC, is

the only workload without a private context in any of the CMP cores; it has two contexts shared

by all 16 cores and two contexts in the [9, 15] category, shared by 10 and 13 cores respectively.

PARSEC workloads have two shared contexts, context 0 and the application’s context, that are

responsible for nearly all TLB accesses, as Section 3.4.3 shows.

Per-Core Measurements: Figure 3.2 further details the number of unique contexts for each

CMP core (i.e., core 0 to core 15), following the same colour scheme as Figure 3.1. Each

of the three subfigures plots data for a different workload class. Overall there are not many

variations across cores; the standard deviation of the per core context count is below one for

all workloads except for classification (2.8), TPC-C2 (2.4), and apache (20.6), the latter is only

due to core #1.

We believe that the significantly smaller context count for apache’s core #1, the sole contrib-

utor to the aforementioned high standard deviation, is due to the system’s Interrupt Request

(IRQ) affinity set-up. In some operating systems, like Solaris here, only specific CPUs can ser-

vice IRQs from specific devices. We speculate that for apache, the CPU #1, from the server side,

is the one responsible for servicing interrupts triggered by the network adapter (Figure 2.3 ear-

lier showed a system snapshot of the server’s network I/O). Unfortunately, the kernel debugger

in the Solaris version used in this Simics checkpoint did not have the capability to display the

interrupt affinity table. But our assumption is supported by the following two observations:

(i) running mpstat showed CPU1 had an order of magnitude more interrupts than the other

15 CPUs, and (ii) running vmstat -i showed two sources of interrupts: clock and hmec0, the

controller for the “cheerio-hme Network Adapter”. The network’s adapter interrupt rate from

vmstat is similar to the interrupts reported from mpstat for CPU1.

As Figure 3.2’s results indicate, the context behaviour is inherently tied both to the type


0

20

40

60

80

100

120

140

0 15 0 15 0 15

apache TPC-C2 TPC-C1

#Uniqu

ePe

r-Co

reCon

texts


(a) Commercial workloads

0

1

2

3

4

5

0 15 0 15 0 15

canneal ferret x264

#Uniqu

ePe

r-Co

reCon

texts


(b) PARSEC workloads

0

2

4

6

8

10

12

14

16

0 15 0 15 0 15 0 15 0 15

cassandra classifica2on cloud9 nutch streaming

#Uniqu

ePer-CoreCo

ntexts


(c) Cloud Workloads

Figure 3.2: Number of unique per core contexts for three workload classes. Each columncorresponds to a different core in the range of [0, 15] in ascending order.

of the running workload and its specific configuration. For example, multi-threaded workloads

such as PARSEC, or even more traditional high-performance computing workloads, are likely to


have multiple threads of a single process running across all CMP cores, unless they are running

in a multi-programmed environment. For commercial server-type workloads, multiple processes

are expected to handle different tasks; their numbers can drastically vary as the example of the

two TPC-C instances running on different database systems demonstrates. Cloud (scale-out)

workloads, although they can occasionally be configured to run in a single process, are also more

naturally fitted to deploy multiple processes for scalability purposes, to leverage the available

cores. It is likely that such behaviour will be more prevalent in future workloads. In all cases,

because of the small overall standard deviation, the context behaviour observed at a given

core, or known a priori due to workload profiling, could drive decisions about context-aware

TLB-indexing schemes and/or shared TLB structures. These decisions could be dynamically

adapted based on a system’s target workload, to reflect the trends described above.

3.4.2 Context Lifetimes (Within Execution Sample)

For the workloads that have multiple contexts, it is important to examine the lifetime of these

contexts; that is, whether they are persistent throughout the workload execution or they are

created and destroyed over time. It is overlapping - in time - contexts that could stress the TLBs

due to increased set conflicts. The inherent limitations of architectural simulation preclude us

from simulating workloads in their entirety. Therefore, in this section, the term context lifetime

denotes the portion of the execution sample during which a context (process) issues memory

accesses on each core. A context’s “lifetime” starts once the first memory request from that

context is issued and ends when the last memory request from that context is issued. It is

possible that this context is not destroyed, but issues more requests later in time. However, we

believe that observing the contexts’ “lifetime” within this sample, can provide useful information

about overlapping contexts and their use over time. Section 3.4.3 will later discuss contexts’

significance, e.g., the percentage of TLB accesses a context is responsible for.

Figure 3.3 depicts the lifetime of each context as seen from the perspective of each core for

workloads with more than 30 per core contexts. The x-axis represents the passage of time as

measured in a trace-driven functional simulator; one cycle corresponds to one memory request.

Each plotted horizontal line starts when the first memory request for a given (context, core)

was issued and ends when the last memory request was issued for that (context, core). For

example, if a context is accessed by all 16 cores, it will appear as 16 distinct lines on the graph.

Each figure label also lists the average lifetime duration as a percentage of the workload’s

execution time. If a core had two running contexts, with the first triggering requests for the

first quarter of the execution sample and the second triggering requests throughout the sample,

a context lifetime graph would include two horizontal lines both with the same starting time.

The first line would span the first 25% of the x-axis, while the second line would span the entire

x-axis (execution sample). The average context lifetime in this example, which is used solely

for illustration purposes, would be 62.5% of the execution time. In Figures 3.3a and 3.3b the

lifetimes (x-axis) are sorted via lifetime duration (y-axis), with the longer lifetimes at the top,


to point out the presence of contexts with small lifetimes, while in the remaining figures the

lifetimes are sorted (y-axis) by their initial start time.

0 1E+09 2E+09 3E+09 4E+09 5E+09

Time(func4onalsimula4on)

apache(contextlife4mesiny-axissortedbydura4onandnotstart4me)

(a) apache (61%)

0 1E+09 2E+09 3E+09 4E+09 5E+09


TPC-C2(contextlife4mesiny-axissortedbydura4onandnotstart4me)

(b) TPC-C2 (84%)

0 1E+09 2E+09 3E+09 4E+09


TPC-C1

(c) TPC-C1 (65%)

0 1E+09 2E+09


classifica1on

(d) classification (36%)

0 1E+09 2E+09 3E+09


streaming

(e) streaming (64%)

Figure 3.3: Context lifetimes. The average context/core lifetime is listed in parentheses as apercentage of the workload’s execution time sample.

In all cases, a large number of contexts is actively issuing memory requests throughout

the entire workload execution sample. This observation is more pronounced in TPC-C2 where

on average each context/core is active for 84% of the workload’s execution sample. On the

other hand, classification has many contexts that are active for a short period of time with an

average lifetime of 36% of the workload’s execution sample. As the classification plot shows

(Figure 3.3d), many contexts often start when the lifetimes of other contexts end, most probably

reflecting the various map-reduce operations performed by this data-analytics workload. These

results indicate that contexts’ lifetimes usually overlap in time, which might encourage context-

aware TLB designs in the future. They also show that some contexts might be used only for a

short period within an execution sample; details such as the lifetime duration or their number


vary greatly across workloads. Even though longer execution times would be ideal, these results

-which are naturally based on the specific execution samples- provide a snapshot of common

context behaviour. Dynamic hardware policies usually adapt their behaviour based on much

smaller execution intervals.

3.4.3 Context Significance: Frequency and Reach

The previous sections explored context characteristics like count, sharing degree, and context

lifetimes. But are all contexts of equal importance when it comes to TLB behaviour? If all

contexts are not to be treated homogeneously, for example by filtering out translations for

some contexts to reduce TLB pollution, how can we quantify context significance? We use

the following two metrics: the number of TLB accesses initiated by each context (frequency),

and the number of unique translations accessed (reach). This section’s measurements highlight

that aside from one or two prominent contexts (e.g., context 0), most contexts have small,

percentage wise, individual contributions which do however cumulatively add up.

As Table 3.7 shows, context 0 is responsible for a significant number of TLB accesses as well

as unique translation entries accessed in many of the simulated workloads. For example, it is

responsible for almost half the total TLB CMP accesses and unique translations in apache. It

is the most prominent of all contexts in all workloads, with the exception of canneal, ferret, and

cloud9. The number of unique translations is measured from the perspective of each CMP core

(Section 3.3.1). The measurements presented in Table 3.7 are the sum of these per core unique

translation measurements according to Equation 3.1. Equation 3.2 shows the computations for

each context’s frequency of accesses.

%Unique 8KB translations =

15∑core=0

Unique Translationscore, ctxt=0, page size=8KB∑page size

∑ctxt

15∑core=0

Unique Translationscore, ctxt, page size

(3.1)

%CMP TLB Accesses =

15∑core=0

TLB Accessescore, ctxt=0

∑ctxt

15∑core=0

TLB Accessescore, ctxt

(3.2)

Table 3.8 shows the significance of some non-zero contexts for the PARSEC and Cloud-Suite

workloads; for the Cloud-Suite workloads, contexts that contribute to less than 0.1% in terms

of accesses or unique translations are classified as negligible. For canneal and ferret, a shared

non-zero context is responsible for the vast majority of TLB accesses and unique translations.

This behaviour is not observed in Cloud-suite workloads where the average contribution to

TLB accesses and translations is below 7%. In cloud9, two contexts (476 and 580) out of the 16

mentioned in this table are responsible for ∼16% of unique 8KB translations even though they


Workload Workload % CMP TLB % Unique 8KB

Class Accesses Translations

apache 48.6 46.9

Commercial TPC-C2 20.2 19.3

TPC-C1 30.1 77.4

canneal 13.6 0.3

PARSEC ferret 23.6 10.1

x264 75.4 72.8

cassandra 22.4 73.5

classification 15.7 20.9

Cloud-Suite cloud9 4.3 0.4

nutch 20.5 88.3

streaming 44.1 12.7

Table 3.7: Context 0: % TLB accesses and cumulative per core unique translation entries acrossthe entire CMP. See previous equations.

correspond to 0.8 and 16.2% of TLB accesses respectively. This observation indicates that a

large unique translation reach does not always go hand in hand with a large TLB access count.

Workload Workload Context % CMP TLB % Unique 8KB

Class Information Accesses Translations

canneal 1400 86.4 99.7

PARSEC ferret 1400 76.4 89.7

x264 1401 24.6 27

Workload Workload Context Count % Range of CMP TLB % Range of Unique 8KB

Class (contexts with negligible Accesses; Avg. % Translations; Avg. %

contributions: second row)

cassandra16 contexts 4 - 6 (each); avg. 4.9 0.5 - 1.5 (each); avg. 1.1

6 contexts negligible

classification74 contexts 0.1 - 4.9 (each); avg. 1.2 0.1 - 2.3 (each); avg. 0.5


Cloud-Suitecloud9

16 contexts 0.5 - 7 (each); avg. 6 2-16.5 (each); avg. 6.2


nutch16 contexts 4.5 - 5.5 (each); avg. 5 0.4 - 0.7 (each); avg. 0.6


streaming16 contexts 0.3 - 4 (each); avg. 3.5 1.9 - 5.8 (each); avg. 5.5


Table 3.8: Non-zero contexts: % TLB accesses and cumulative per core unique translationentries across the entire CMP for PARSEC and Cloud workloads.


3.4.4 Concluding Remarks

To summarize, Section 3.4 demonstrated the variations that exist across different sets of work-

loads in terms of unique context counts, in how these contexts are shared across multiple cores,

in the active lifetime of each context, as well as in the variations of these contexts’ contributions

to TLB accesses and their data reach.

The context count and sharing degree varied greatly across each workload class; the sim-

ulated PARSEC and Cloud-Suite workloads mostly had only a few contexts, the former with

a high sharing-degree while the latter with mostly Private contexts, whereas the commercial

workloads with the exception of TPC-C1 had two orders of magnitude more contexts present,

most with a high sharing-degree. The vast majority of contexts had lengthy lifetimes that

covered more than half of the execution sample, with the exception of classification that had

many short-lived contexts. Significant variation - in these context specific measurements - was

observed in the frequency and data reach of each context. Context 0 was the most prominent

context for most workloads with only a few exceptions, mostly of other Shared contexts, e.g.,

in canneal. However, this section’s measurements indicate that these other contexts should not

be ignored or filtered; their small individual contributions are significant when accumulated.

3.5 Translation Mappings Lifetime

This section explores the frequency with which translation mappings are modified, either via

demap or translation modification requests as Section 2.2.3.3 discussed. MMU TLBs are de-

signed with the fundamental expectation that translation mappings do not change often. Oth-

erwise, if mappings were single-use only - an extreme example - TLBs would experience no

reuse and would thus fail to hide part of the lengthy page-walk latency. The expectation that

translation mappings are persistent for appropriately lengthy intervals is also essential for any

address translation optimization techniques that rely on remembering past mappings. The

FMN TLB proposed in Chapter 5 exploits this observation.

Table 3.9 presents the absolute number of demappings and remappings that took place

during workload execution. Demappings are further classified into demap-page or demap-

context as explained in Section 2.2.3.3. The results show that translation mappings persist

over time since translation modifications, of any kind, are quite rare. Demap-context operations

appear only for a single workload (TPC-C2), while remap operations are more pronounced in the

three commercial workloads modeled, as are the demap-page operations. While the simulated

traces contain 2.6 - 4.8B memory accesses, translation-mapping modifications are 3 to 6 orders

of magnitude less.

The remainder of this section provides additional analysis and observations for the three

types of translation mapping modifications: the demap-context, the demap-page, and the TLB-

entry modification scenarios. For the workloads where translations invalidations/modifications

were observed, this analysis examines how frequent these operations are, how many cores and


Workloads Number of Demappings Number of Remappings

Demap-Context Demap-Page

apache 0 567,059 127,385

TPC-C2 59 611,685 585,201

TPC-C1 0 896,844 176,713

canneal 0 90 9,296

ferret 0 60,122 5,243

x264 0 11,315 5,694

cassandra 0 513 6,747

classification 0 168 1,025

cloud9 0 1,212 697

nutch 0 66 1,537

streaming 0 11,914 15,379

Table 3.9: Translation Demap and Remap Operations (cumulative in the entire CMP).

instructions were involved, and how many unique pages or translation-entries were affected.

This analysis can be helpful if someone wants to predict these operations or build a caching

structure as in Chapter 5.

3.5.1 Demap-Context Analysis

A demap-context operation invalidates all translations involving the context in question, a po-

tentially faster way to tear down mappings involving large multi-page memory buffers instead

of individually invalidating each virtual page from that buffer with separate demap-page re-

quests. The only workload that experienced demap-context operations was TPC-C2. Four of

its contexts, which experienced primarily 8KB-page accesses, were involved in multiple demap-

context operations; two contexts were shared across all 16 cores, whereas the other two contexts

were shared by 13 and 14 cores respectively, thus the 59 overall demap-context measurement

in Table 3.9.

As anticipated, and as our measurements confirm, once a demap-context operation takes

place for one core, all cores that had at some point accessed translations from this context should

also execute a demap-context operation. Two PCs were involved in these demap operations:

one PC (0x10156830) initiated the first demap operation for each context, while another PC

(0x101568d8) triggered all the subsequent demap operations (i.e., the demaps that took place in

the other cores). Once all these demap operations for a given context have completed across all

sharer cores, the corresponding process may issue new memory accesses (e.g., accessing a new

memory buffer). All four demapped contexts were involved in multiple subsequent accesses,

shortly after all the relevant - for that context - demap-context operations had completed.

The rarity of demap-context operations is welcomed, as invalidating all translation entries

associated with a given context can be a costly operation. The impact of such operations is


not only gauged by their frequency, but also by each context’s data reach. The more unique

translations a demapped context has accessed, the greater the impact of the demapping. In

TPC-C2, each of these four contexts has accessed on average 170 to 205 unique translations per

core, the largest data reach after context 0 that dominates TPC-C2 accesses. The time elapsed

between the last access to a context and its demapping time will determine if these translations

will persist in the TLBs and other paging structures. In the worst case, this interval was

∼890K functional simulation cycles. Noticeably, all four involved contexts were short-lived

with a maximum lifetime at 6.9% of execution time, significantly below the average TPC-C2

contexts’ lifetime that is at 84% of execution time.

3.5.2 Demap-Page Analysis

A demap-page operation, discussed in Section 2.2.3.3, invalidates a single (context, virtual page)

and will thus affect at most one translation entry at a time. As Table 3.9 showed, all workloads

experienced numerous demap-page operations. Table 3.10 presents the maximum number of

unique contexts, PCs (demap-page instructions), VPNs, and (VPN, context) tuples involved

in demap-page requests per core across all our workloads. The values for the entire CMP are

shown in parentheses when different.

Overall, the number of demap-page operations varied across cores; in four workloads (can-

neal, cassandra, cloud9 and x264) demap-page operations took place in half or less the CMP

cores, whereas in the remaining workloads such operations took place in all 16 cores. Multiple

contexts were involved in demap-page operations, with context-0 usually being one of them. In

all workloads only a handful PCs triggered these demap-page requests, an anticipated behaviour

given these operations are handled by system code. It is common for the same (VPN, context)

to get a demap-page request multiple times during the workload’s execution time.

Workload Workload Max. Context Max. PC Max. VPN Max. (VPN, context)

Class Name Count Count Count Count

Commercial

apache 90 (132) 7 17,286 (18,926) 17,752 (27,297)

TPC-C2 26 (58) 8 559 (765) 1,336 (3,115)

TPC-C1 3 (18) 6 14,868 (14,930) 14,868 (14,974)

PARSEC

canneal 2 3 43 (50) 43 (50)

ferret 2 6 358 (1,796) 358 (1,796)

x264 2 3 2,417 (5,673) 2,417 (5,673)

Cloud-Suite

cassandra 2 2 2 2

classification 3 (31) 4 (5) 12 (15) 14 (73)

cloud9 2 (5) 2 355 (611) 355 (611)

nutch 1 (16) 4 (5) 2 2 (32)

streaming 2 (17) 5 698 (1,858) 698 (9,643)

Table 3.10: Unique characteristics of Demap-Page requests (per core). Values in parenthesesare for the entire CMP wherever different.


The same VPNs are demapped multiple times during workload execution with the PPNs

these pages map to either changing or remaining the same over time. In cassandra, a single

8KB virtual page was repeatedly demapped (512 times total) and the physical page it was

mapping to was constantly incremented by one. We have also observed that in some cases the

same VPNs are demapped from different contexts over time. For example, in TPC-C2 a context

sees demaps to a group of VPNs that are later demapped from another context too and so on.

These VPNs mapped to the same PPNs for the aforementioned small group of contexts. The

vast majority of demap-page requests were to 8KB pages only. The only exceptions were the

commercial workloads where most demap-page requests affected 512KB pages. TPC-C1 was

the only workload with demap-page requests to 4MB pages.

3.5.3 TLB-Entry Modification Analysis

This section presents observations about TLB-entry modification operations (discussed in Sec-

tion 2.2.3.3) to help better understand the characteristics of such operations and the translations

they involved. All translations installed in the TLB via such operations had both their privi-

leged and locked bit set. All 16 CMP cores executed modification operations with the exception

of cloud9 and x264 where a few cores did not. For the Commercial workloads, the allocated

translations that were installed by these modifications all belonged to 512KB pages. For the

PARSEC workloads they belonged to 8KB pages with the exception of canneal where the vast

majority belonged to 4MB pages. Lastly, for the Cloud-Suite workloads, 4MB pages prevailed

for cassandra and streaming, 8KB for cloud9, while nutch and classification saw equal partici-

pation for 8KB and 4MB pages. All translation entries installed via a TLB-entry modification

operation involved context 0.

Only a few PCs trigger modifications operations: two for PARSEC workloads, two for

Commercial workloads and three for Cloud-Suite workloads. Having shared PCs within each

of the three workload classes is anticipated. The OS code which initiates these modification

operations can vary across kernel versions, and these three workload classes were set-up on

different systems (e.g., the commercial workloads were running on Solaris 8, while Cloud-Suite

workloads on Solaris 10). For the commercial workloads, both PCs of these modifications

operations immediately succeed PCs that trigger demap-page operations.

Lastly, we have observed many cases where the entry being modified does not share a VPN

with the entry being allocated. Only in classification and cloud9 there were a few cases where

these entries shared the same VPN and in both workloads this was a single VPN that got

mappings that alternated between two different PPNs (per workload) across all cores.

It is an open question whether the translation remapping trends will remain the same in

emerging systems with heterogeneous memory architectures, or if more remappings could help

take better advantage of the different memory devices, and their characteristics, in such systems.


3.5.4 Concluding Remarks

To conclude, the analysis presented in Section 3.5 has shown that MMU operations such as

demap-page or TLB-entry modifications are rare and are triggered by specific instructions

(PCs) that are not application specific. For each of the three workload classes used in this

work, the number of these instructions was always less than a dozen. This observation suggests

it is possible to predict these special PCs, and thus anticipate such operations. A simple

table-based predictor would suffice, but such an approach is beyond the scope of this work. In

this work, the rarity of the aforementioned operations encourages design choices that rely on

remembering past translation mappings, thus optimizing for the common operating scenario

without harming correctness.

3.6 TLB Capacity Sensitivity Study

The previous sections presented an analysis of workload characteristics that, although indepen-

dent of the organization of TLBs and other address translation caching structures, can influence

the address translation cost as expressed by metrics like TLB hit-rate, TLB miss-handling la-

tency, etc. As this chapter’s introduction discussed, it would be possible to optimize a specific

architectural design to perform well - for one or more of these metrics - for an individual work-

load; such an approach would be applicable on reconfigurable architectures but not appropriate

for a general purpose system. The goal here is to design for the common case and if possible

have dynamic mechanisms to alleviate the negative impact for the less common cases.

One could naturally anticipate that different TLB organizations would better cater to dif-

ferent workloads. For example, the workloads measured to have a large superpage footprint

would benefit from TLB organizations that are not biased in their space allocation against such

translations. To this respect, this section examines the effectiveness of different TLB organiza-

tions as the TLB capacity scales. Effectiveness is measured via TLB MPMI (Misses Per Million

Instructions1) and TLB hit-rate. First, Sections 3.6.1 to 3.6.3 focus on L1-TLBs by model-

ing the following three configurations, all representative of commercial L1-TLB designs: (a) a

configuration with split L1-TLB structures, one per supported page-size, (b) a fully-associative

TLB, and (c) a two-table TLB configuration that uses a set-associative TLB for the smallest

and most prominent page-size and a fully-associative TLB for all other page sizes. Section 3.6.4

then evaluates the impact of an L2-TLB using the same MPMI and hit-rate metrics.

All results were collected from functional simulation in a 16-core system. All TLB designs

listed above are private (i.e., per core) and use an LRU replacement policy. The locked bit values

of all translation entries were ignored to allow us to simulate any associativity. As Section 2.2.3

discussed earlier, translations for locked pages are not to be evicted from the TLB. If all entries

in a set are locked, it no longer becomes possible to cache additional entries in that set, thus

1MPKI is usually used in caches but since TLBs track memory accesses at a much coarser granularity, andthus have fewer misses, MPMI is used instead.


hurting TLB hit-rate. On a given configuration, the OS can control how many locked entries

are allowed per set, potentially denying lock requests where appropriate. Since we use traces,

these decisions are embedded in the trace and cannot be altered during simulation.

3.6.1 Split L1-TLBs; One per Page-Size

Figure 3.4 depicts the variation of L1 TLB MPMI as the TLB capacity increases. The x-axis

depicts the number of entries for a Set-Associative (SA) TLB that supports the smallest page-

size (8KB). The capacity of each of the remaining three split TLBs (e.g., the TLBs for 4MB

pages, etc.) is always half of that, to closely reflect Haswell-based TLB configurations (Table 2.1,

[38]). All TLBs have an associativity of four for the same reason. Each MPMI bar is further

broken down into the MPMI contributions of each supported page-size.

92

93

94

95

96

97

98

99

100

0

1000

2000

3000

4000

5000

6000

7000

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

apache TPC-C2 TPC-C1 canneal ferret x264 cassandraclassifica@on cloud9 nutch streaming

Commercial PARSEC Cloud-Suite

L1-TLBHitRa

te(%

)

L1-TLBM

PMI

Workloadsand#TLBentriesinsplitTLBfor8KBpages

MPMI(8KBpages) MPMI(64KBpages) MPMI(512KBpages) MPMI(4MBpages) Hit-Rate(%)

Figure 3.4: L1 TLB MPMI and Hit-Rate over different TLB sizes. The x-axis lists the number ofTLB entries for the split TLB with 8KB page translations; the capacity of each other split TLBstructure is half that in size. Canneal saturates with this y-axis scale; see detail in Figure 3.5.

MPMI is expected to decrease, hence improve, as capacity increases. The measurements in

Figure 3.4 follow this trend but diminishing returns are observed as capacity increases beyond

512 entries. Today’s norm in split L1 TLB designs is a 64-entry TLB for the smallest supported

page size, the second column in this graph, indicating significant potential for improvement if

increasing the L1-TLB capacity was not an issue. However, doing so would be detrimental to

performance given the strict access-time constraints of L1 TLBs that need to be accessed in

parallel with virtually-indexed and physically-tagged L1 caches.


0

4000

8000

12000

16000

20000

24000

32

128

512 2K

canneal

L1-TLBM

PMI

Figure 3.5: CannealMPMI detail withlarger y-axis scale.

For the commercial workloads, the Haswell-like baseline has an

average MPMI of 2324 and only by quadrupling the available TLB

capacity does it fall down to near half that value, i.e., 1111. Canneal

has the most problematic behaviour from all the simulated workloads.

As the detail in Figure 3.5 shows, even a 2K-entry SA TLB for 8KB

pages yields an MPMI of 9032, just a 59.3% MPMI decrease over the

smallest 32-entry TLB.

Beyond the slope of MPMI reduction, the MPMI contribution of

each page size is also important as anticipated by the unique transla-

tion observations from Section 3.3. Cassandra and classification from

Cloud-suite are the workloads that most benefit from the presence of

larger superpage TLBs. For almost all other workloads, with the ex-

ceptions of TPC-C1 and to a lesser extent TPC-C2, a larger TLB for

4MB pages has little to no benefit. TPC-C2, the workload with the largest 4MB footprint, con-

tinues to see an MPMI of 100 even with the largest 1K-entry split L1-TLB for 4MB pages, while

other workloads like apache or the PARSEC workloads see almost zero 4MB MPMI. Our work

on Prediction-Based Superpage-Friendly TLB Designs [55], presented in Chapter 4, addresses

this inconsistency and the resulting wasted energy due to the many unnecessary split-TLB

lookups.

Figure 3.4 also includes the TLB hit-rate as a separate line (secondary y-axis). The TLB

hit-rate for private TLBs is the ratio of all CMP L1-TLB hits - across all four split structures

and cores - over the total number of CMP L1-TLB accesses. The split L1-TLBs are probed in

parallel but these parallel probes are counted as one TLB access. The measured TLB hit-rates

are over 97% for all workloads except for canneal and they increase alongside the TLB capacity.

Such high L1 TLB hit-rates, even higher than L1-D cache hit-rates which are also commonly

above 90%, were to be expected due to the coarse-grain tracking granularity of TLB-entries.

All remaining graphs in Section 3.6 will plot both MPMI and hit-rate. These two metrics

complement each other. Hit-rate is the most straightforward to grasp metric, which charac-

terizes TLB efficiency. Especially for L2-TLBs (evaluated in Section 3.6.4), the hit-rate is a

first indication that these structures do not reach their full potential, a conclusion not as eas-

ily reached via the MPMI measurements. But looking at the hit-rate alone does not suffice.

Hit-rate provides no notion of time, whereas the MPMI metric implicitly incorporates time by

showing the frequency of misses within a fixed instruction sample. Therefore, even though the

high L1-TLB hit-rates might suggest that L1-TLBs have no further room for optimization, the

MPMI measurements indicate that TLB misses are still frequent and can thus impact both

performance and energy.


3.6.2 Fully-Associative L1 TLB

As mentioned earlier in this thesis, an alternative to split-TLBs is a Fully-Associative (FA)

TLB design, similar to the AMD-12h family, which can naturally support any page-size without

requiring multiple lookups. Figure 3.6 depicts how MPMI and hit-rate vary as the per core

TLB capacity increases. All modeled TLB configurations have a power-of-two entry count

for consistency with the previous section, even though this is no longer a requirement; the

AMD-12h L1-TLB has 48 entries.

919293949596979899100

0

1000

2000

3000

4000

5000

6000

7000

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K



L1-TLBHitRa

te(%

)

L1-TLBM

PMI

Workloadsand#TLBentriesinFATLB


Figure 3.6: L1 TLB MPMI and Hit-Rate over different FA TLB sizes. All TLBs model full-LRUas replacement policy. Figure 3.7 shows canneal in detail as it saturated with this y-axis scale.

0

4000

8000

12000

16000

20000

24000

32

128

512 2K

canneal

L1-TLBM

PMI

Figure 3.7: CannealMPMI detail withlarger y-axis scale.

A steeper slope (reduction) of MPMI is observed as FA capac-

ity increases when compared to the split-TLB configurations of Fig-

ure 3.4. When comparing with the results of that figure, configura-

tions with the same x-axis label correspond to different TLB capac-

ities. For example, a 32-entry label corresponds to a 32-entry FA

TLB and to a total split-TLB capacity of 80-entries.2 Nevertheless,

while smaller, the corresponding FA TLB is on average 14.7% better

in terms of MPMI across all workload and configurations. Having a

fully-associative structure allows translation entries of different page-

sizes to coexist and can also reduce conflicts due to translations that

share the same VPN but belong to different processes (contexts). The

full-LRU replacement policy is also beneficial.

280-entries = 32-entries for 8KB pages + 16-entries for each of 64KB, 512KB and 4MB pages.


It would be amiss though not to briefly touch upon some shortcomings of a fully-associative

TLB design. FA structures are generally power-hungry and slow to access; energy and access

time measurements will be presented in Chapter 4. Also, as capacity scales, a full-LRU replace-

ment policy is an unrealistic design choice and thus the MPMI is going to be different (likely

higher) with the most commonly employed pseudo-LRU or random replacement policies.

Even when comparing the total MPMI for each vertical column in Figure 3.6 with its corre-

sponding column in Figure 3.4, there are some instances where the split L1-TLB configurations

perform better in terms of MPMI (even though on average they are worse). For example, in

cassandra, the FA configurations with entries in the 64 to 1K range (inclusive) are from 4% to

20% worse in terms of MPMI compared to their split-TLB counterparts. Cassandra accesses

slightly over 1K superpages and its MPMI really benefits from having a separate superpage

TLB structure. The FA TLB is better only when the FA TLB becomes sufficiently large to

host some working set of these pages, or when the split superpage TLB is too small (16-entries)

to make a difference for that footprint. For TPC-C2, which has a significant superpage foot-

print, the FA TLB is better when it has 128 or more entries. The difference is negligible for

smaller TLB sizes, mostly because the superpage footprint is too large for the split superpage

TLBs to make a significant difference; even increasing the split superpage TLBs from 16 to 1K

entries only reduces the 4MB MPMI by 46%.

The FA hit-rates depicted in Figure 3.6 continue to be high, most of the time slightly higher

than for split-L1 TLBs. Canneal continues to have the lowest hit-rate than all other workloads;

even a fully-associative TLB cannot accommodate the pseudo-random access pattern [22] of

this simulated-annealing workload that also has the largest footprint of all other workloads.

3.6.3 Set-Associative L1-TLB for Small Pages and Fully-Associative L1-TLB

for Superpages

The last L1 TLB design simulated and presented in this chapter is based on the TLBs from

the UltraSparc-III processors and involves a set-associative design for the smallest and most

prominent page-size and a fully-associative design for translations from all the other supported

page-sizes. Figure 3.8 depicts the variation in MPMI and hit-rate when we vary the number

of entries in the 2-way set-associative TLB that only hosts translations for 8KB pages. The

fully-associative TLB that hosts translations for superpages remains unchanged to 16-entries.

The MPMI shown in Figure 3.8 is consistently worse than its split-TLB counterpart from

Figure 3.4. Due to the difference in scale, Canneal is shown separately in Figure 3.9. When

focusing on the 8KB MPMI stacked bars, since these are the only ones affected by the SA

capacity increase, one can observe that the smaller associativity of two in the SA TLB hurts

MPMI.


919293949596979899100

010002000300040005000600070008000

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K

32

128

512 2K



L1-TLBHitRa

te(%

)

L1-TLBM

PMI

Workloadsand#TLBentriesin2-waySATLB


Figure 3.8: L1 TLB MPMI and Hit-Rate over different TLB sizes for the 2-way SA TLB thatonly hosts translations for 8KB pages. A fixed 16-entry FA TLB is modeled for all superpages.

0

4000

8000

12000

16000

20000

24000

32

128

512 2K

canneal

L1-TLBM

PMI

Figure 3.9: Canneal MPMIwith larger y-axis scale.

The only case where the UltraSparc-based configuration

is better in terms of overall MPMI is for cassandra where

having a fully-associative 16-entry TLB is considerably better

than the 16-entry 4-way SA split-TLB. The hit-rate follows

similar trends. Classification is a noteworthy exception where

increasing the capacity of the SA TLB has little to no impact

on MPMI and hit-rate, since for this workload it is superpage

capacity that matters.

Figure 3.10 depicts how MPMI changes when increasing

the capacity of the FA superpage TLB. For brevity, only work-

loads whose MPMI changes when the FA-size increases beyond 16-entries are depicted. Simply

doubling the FA capacity from 16 to 32 entries greatly reduces MPMI for classification (80%

MPMI reduction). Contrasting this behaviour with the useless SA capacity increase for this

workload illustrates why blindly increasing the capacity of a TLB that statically caters to a

specific page-size can be a poor and wasteful design decision for some workloads. Chapter 4

will present our work on “Prediction-Based Superpage-Friendly TLB-Designs” that addresses

such concerns.


0

500

1000

1500

2000

2500

16

64

256 1K

32

128

512 2K

16

64

256 1K

32

128

512 2K

16

64

256 1K

TPC-C2 TPC-C1 cassandra classifica8on nutch

Commercial Cloud-Suite

L1-TLBM

PMI

Workloadsand#FATLBentries

MPMI(8KBpages) MPMI(64KBpages)MPMI(512KBpages) MPMI(4MBpages)

Figure 3.10: L1 TLB MPMI over different TLB sizes for the FA TLB that hosts translationsfor all superpages. A fixed 2-way SA 512-entry TLB is modeled for 8KB pages.

3.6.4 L2-TLB

This section complements the L1-TLB capacity sensitivity study by exploring the effect of

L2-TLB capacity on MPMI and hit-rate. L2 TLBs are, like any other hierarchical structure,

larger and slower to access than their L1 counterparts. Commercial L2-TLBs are usually set-

associative with 4-way and 8-way SA TLBs being the most common. Traditionally, L2-TLBs

either support a single page-size or multiple page-sizes, presumably either via multiple sequential

lookups or by splitting a single superpage translation into multiple translation entries of the

supported page size.

To prune the vast design space, this section uses a single L1-TLB configuration while mod-

ifying L2-TLB properties. The state-of-the-art L1-TLB configuration used is a 64-entry 4-way

SA split TLB for 8KB pages along with three 32-entry 4-way SA TLBs, one for each remaining

page size, and corresponds to the second column in Figure 3.4. An 8-way SA L2-TLB with a

cache-like indexing scheme is modeled that only hosts translations for 8KB pages. This asso-

ciativity matches the one in Haswell’s L2-TLB. Upon an L1-TLB miss, the missing translation

is installed in both TLB levels. Figure 3.11 depicts MPMI and hit-rate as the capacity of each

private L2-TLB increases from 512 entries to 64K entries.

Even adding the smallest 512-entry L2-TLB reduces the MPMI over a baseline without an

L2-TLB by 56.5% on average (amean); the L1-MPMI for each workload is shown in parenthe-

ses for reference in Figure 3.11. For workloads such as classification that are dominated by

superpage misses, increasing the L2-TLB beyond some size (here 1K-entries) has no impact,

also shown by the hit-rate that remains unchanged at 6.8%.


0102030405060708090100

01503004506007509001050120013501500

512 2K

8K

32K

512 2K

8K

32K

512 2K

8K

32K

512 2K

8K

32K

512 2K

8K

32K

512 2K

8K

32K

512 2K

8K

32K

512 2K

8K

32K

512 2K

8K

32K

512 2K

8K

32K

512 2K

8K

32K

apache(2534)

TPC-C2(2592)

TPC-C1(1846)

canneal(14758)

ferret(340)

x264(35)

cassandra(546)

classificaBon(842)

cloud9(2101)

nutch(751)

streaming(4541)


L2-TLBHitRa

te(%

)

L2-TLBM

PMI

Workloadsand#TLBentriesin8-waySAL2TLB.L1TLBMPMIinparentheses.


Figure 3.11: L2 TLB MPMI and Hit-Rate over different TLB sizes. The x-axis lists the numberof L2 TLB entries for an 8-way SA L2-TLB that only supports 8KB pages. Canneal saturateswith this y-axis scale; see detail in Figure 3.12.

0

2000

4000

6000

8000

10000

12000

512 1K

2K

4K

8K

16K

32K

64K

canneal

L2-TLBM

PMI

Figure 3.12: Canneal L2-TLB MPMI detail withlarger y-axis scale.

Overall, the L2-TLB hit-rate (secondary y-axis) is sig-

nificantly lower than the one measured for the L1-TLBs, an

anticipated behaviour given that most spatial and temporal

locality has been filtered by the first TLB level. As the ca-

pacity moves beyond the maximum number of unique trans-

lations for each workload (reported in Section 3.3 in the be-

ginning of this chapter), there are diminishing returns; the

only benefits are from reducing conflict misses.

To better quantify the usefulness of L2-TLBs, Figure 3.13

classifies - at the end of each trace execution - all L2 TLB

entries as either invalid or valid and presents these numbers

as a percentage of the overall L2 TLB capacity (y-axis). The

x-axis depicts the number of L2 TLB entries (lower label); for each TLB size, there are 16

adjacent vertical columns, one per core, as indicated by the 0 and 15 upper labels for the

512-entry L2-TLB configuration. The remaining upper x-axis labels are omitted for brevity.

The graphs that follow illustrate not only that for many workloads a significant percentage

of the L2-TLB capacity is wasted as TLB capacity increases, but also highlight the occasional

differences that might exist across cores. For example, x264 and cloud9 are two workloads

where L2-TLBs of different cores see drastically different occupancies. These workloads had a

high standard deviation for the unique per core translation as listed in Table 3.4 earlier in this

chapter.


Figure 3.13: Per-Core L2-TLB Capacity classified percentage-wise in valid and invalid TLBentries for different L2-TLB sizes.

0%20%40%60%80%

100%

0 15

512 1K 2K 4K 8K 16K32K64K

apache

0%20%40%60%80%

100%

0 15

512 1K 2K 4K 8K 16K32K64K

TPC-C2

0%20%40%60%80%

100%

0 15

512 1K 2K 4K 8K 16K32K64K

TPC-C1

(a) Commercial Workloads

0%20%40%60%80%

100%

0 15

512 1K 2K 4K 8K 16K32K64K

canneal

0%20%40%60%80%

100%

0 15

512 1K 2K 4K 8K 16K32K64K

ferret

0%20%40%60%80%

100%

0 15

512 1K 2K 4K 8K 16K32K64K

x264

(b) PARSEC Workloads

0%20%40%60%80%

100%

0 15

512 1K 2K 4K 8K 16K32K64K

cassandra

0%20%40%60%80%

100%

0 15

512 1K 2K 4K 8K 16K32K64K

classifica2on

0%20%40%60%80%

100%

0 15

512 1K 2K 4K 8K 16K32K64K

cloud9

0%20%40%60%80%

100%

0 15

512 1K 2K 4K 8K 16K32K64K

nutch

0%20%40%60%80%

100%

0 15

512 1K 2K 4K 8K 16K32K64K

streaming

(c) Cloud-Suite Workloads


3.7 Compressibility and Compression

This section explores the compressibility of the translation information held in the TLB and

other paging structures. The motivation for this analysis was the idea of a larger L2-TLB

that could use fewer hardware resources if some hardware compression was employed. The

less involved approach, in terms of hardware complexity, is to compress the information within

each conventional translation entry. This approach would be particularly useful if one were to

support a cacheable TLB structure (see Chapter 5) as more translations could be packed in

the same cache block. The parts of the translation entry expected to be more amenable to

compression are the virtual page number, part of the TLB tag, and the physical page number,

part of the TLB data block. One can anticipate that the upper parts of the VPN and PPN

could have higher degrees of compression as a smaller part of the address space is expected to

be accessed closer in time. The sensitivity study presented in this section demonstrates both

the compressibility potential and some optimization techniques that could harvest it.

To identify viable compression techniques, the variation of unique values in the VPN and

PPN fields was first explored for a set of workloads. Unique values were measured at a per

byte or a per byte-set granularity as indicated in Figure 3.14. The two least-significant bytes of

the VPN and PPN fields are ignored as all but two bits of the second least-significant byte fall

under page-offsets bits. Byte-sets are numbered starting from the most-significant byte (MSB).

Byte-set i contains the i most significant bytes of the relevant field.

63 56 55 48 47 40 39 32 31 24 23 16 15 .. 0

MSB 0 MSB 1 MSB 2 MSB 3 MSB 4 MSB 5

Byte-Set 0

Byte-Set 1Byte-Set 2

...

Byte-Set 5

Figure 3.14: Unique Bytes and Byte-Sets Nomeclature

The working assumption is that if only a limited number of unique bytes or byte-sets exist,

then one could trade storing accurate byte information for an index to a table that stores these

unique bytes or byte-sets. Figure 3.15 plots the unique number of bytes and byte-sets for both

virtual and physical addresses across our workloads. The smaller this number is, the more

potential compression would have. For each unique byte the maximum possible number of

values is 256, while for byte-set i the maximum number of values is 2i∗8. The ALL x-axis label

in the figure below reflects the unique values measured across all workloads, a number relevant

for systems running multi-programmed workloads.

The results indicate that the first four MSBs see less than ten unique values each, compared

to the maximum of 256, making it possible to reduce the TLB storage for these bytes to at

least half. The unique values are significantly fewer for the upper bytes; this is a side-effect of


02468

101214161820

apache

TPC-C2

TPC-C1

cann

eal

ferret

x264

cassandra

classifi

ca;o

nclou

d9

nutch

stream

ing

ALL

apache

TPC-C2

TPC-C1

cann

eal

ferret

x264

cassandra

classifi

ca;o

nclou

d9

nutch

stream

ing

ALL

apache

TPC-C2

TPC-C1

cann

eal

ferret

x264

cassandra

classifi

ca;o

nclou

d9

nutch

stream

ing

ALL

apache

TPC-C2

TPC-C1

cann

eal

ferret

x264

cassandra

classifi

ca;o

nclou

d9

nutch

stream

ing

ALL

MSB0or[63:56]Byte-Setbits MSB1or[63:48]Byte-Setbits MSB2or[63:40]Byte-Setbits MSB3or[63:32]Byte-Setbits

#un

ique

BytesorB

yte-Sets

acrossall16CMPcores

VPNBytes VPNByte-Sets PPNBytes PPNByte-Sets

Figure 3.15: Number of unique bytes and byte-sets in the virtual and physical addresses.

the maximum supported virtual and physical addresses. The Sparc-V9 architecture supports a

64-bit virtual address space in all cases. Because of the address space layout of each process,

the first most significant byte, MSB 0, was observed to have up to four unique values across all

workloads for data memory accesses, as illustrated by the VPN Bytes series in Figure 3.15 and

the ALL x-axis label. Two of the unique MSB 0 values, 0x00 and 0xff were due to the location

of the heap and the stack in the address space respectively. TPC-C2 was the only workload

with one observed unique MSB-0, because this was the only 32-bit application running on the

64-bit capable operating system. A snippet of the pmap output for one of the processes (PID

937) running for TPC-C1 is shown below:

$pmap -x 937

Address Kbytes Resident Shared Private Permissions Mapped File

[..]

000000010011A000 1384 1384 680 704 read/write/exec [ heap ]

7FFFFFFF7BB04000 8 8 - 8 read/write [ anon ]

[..]

FFFFFFFF7FFEA000 88 88 - 88 read/write [ stack ]

[..]

Figure 3.15 also shows that byte-set-3 has fewer than 16 unique values, orders of magnitude

less than the the maximum number of 232. Therefore, these upper bytes and/or byte-sets are

great compression candidates.

The compression potential degrades as we move towards lower-order bytes and byte-sets.

For example, Figure 3.16 depicts the number of unique values for the fifth MSB (MSB4). Here

the potential is more limited as one could save at most 1 or 2 bits for the fifth VPN byte and

that only in specific workloads like apache or canneal. Therefore, bits below the 32-bit line are

not good unique-value compression candidates for the simulated workloads.

Pham et al. employed a compression mechanism in their clustered TLB design [57]. In

Section 5.3 “Frequent Value locality in the Address bits” they demonstrated the entropy, i.e.,


58119 132

45 41

184256

110213 190 229189 190 189

102 80

256 256 256 256 243 256

0200400600800

10001200140016001800

apache TPC-C2 TPC-C1 canneal ferret x264 cassandra classifica>on cloud9 nutch streaming

MSB4or[63:24]Byte-Setbits

#un

ique

BytesorB

yte-Sets

acrossall16CMPcores

VPNBytes VPNByte-Sets PPNBytes PPNByte-Sets

Figure 3.16: Number of unique values for MSB 4 and Byte-Set 4 (both in virtual and physicaladdresses).

average number of unique values, in the upper bits of VPN and PPN for their workloads. They

then employed two auxiliary tables (VUBT and PUBT) to keep track of the most common

virtual and physical upper bits respectively. These two tables were limited to 8 and 4 entries

respectively. Whenever the unique values did not fit in the two aforementioned tables due to

space constraints, the translations were limited to the unencoded ways of the baseline TLB.

3.8 The First Cache Block Access After A TLB-Miss

Virtual addresses are by default a closer representation of an application’s behaviour than

physical addresses, especially in an overloaded system where fragmentation might be present.

The physical addresses that the virtual addresses have been translated into can depart from,

and thus muddle, locality patterns that exist in the application’s accesses to its more common

data structures. It is possible to envision using TLB-filtered observations to guide cache opti-

mizations, such as TLB-guided cache prefetching. Even though this thesis does not evaluate

such a mechanism, it provides a useful observation for future research. This section examines

how often the cache block (memory address) that triggered a TLB miss matches the cache

block accessed the last time a TLB miss for that same page had occured. If the two cache

block addresses match (matching refers to their memory addresses and not their contents),

then TLB-prefetching mechanisms could also prefetch the corresponding cache blocks, on top

of translations, thus potentially reducing the memory latency associated with these requests.

For example, if the last time a process missed on VPN A it had accessed the second 64B cache

block of the corresponding physical page, then the next time this process misses on VPN A

this mechanism would predict that it will again access the second 64B cache block. Because

the TLB tracks memory at a coarser granularity than caches, a TLB miss usually indicates

that the data in that page has not been accessed in a while and will thus likely exist either in

lower level caches or off-chip. Therefore, prefetching them in advance, as soon as the virtual to

physical translation is known, could improve performance.


Figure 3.17 depicts the percentage of L1 D-TLB misses that access the same 64B cache block

as the most recent TLB-miss to the same virtual page. On average, 51% of all TLB misses

across all workloads access the same 64B cache block as the most recent, from the same core,

TLB-miss to that translation-entry. This number goes up to 78% for the streaming workload.

51.6457.93 56.41

24.4216.10

62.9354.54

40.5550.77

72.4477.80

0

20

40

60

80

100

apache TPC-C2 TPC-C1 canneal ferret x264 cassandra classifica@on cloud9 nutch streaming

CommercialWorkloads PARSEC Cloud-Suite

%L1D-TLBMissesthataccessthesame64BcacheblockasthelastTLBmissforthesametransla;onentry

Figure 3.17: Percentage of all CMP D-TLB L1 Misses that access the same 64B cache block asthe last time that same translation-entry experienced a TLB miss.

These results indicate that there is a high predictability of cache accesses that miss in the L1

D-TLBs. As mentioned earlier in this thesis, the vast majority of TLB misses are to 8KB pages;

therefore, only seven bits (log2(8192/64)) per translation-entry would be required to keep track

of the cache block likely to be accessed on a TLB-miss.


This chapter presented an analysis of TLB-related behaviour for a set of state-of-the-art ap-

plications, emphasizing commercial and cloud workloads. Our analysis involved the following

taxonomy: (i) characteristics inherent to the workloads, that is, characteristics or metrics unaf-

fected by translation caching structures like the TLBs, and (ii) other metrics (e.g., MPMI) that

are influenced by the architecture of these structures. The workloads’ data footprint in terms

of unique translations for page size, the presence of multiple processes and the CMP cores they

run on, as well as the lifetimes of translation mappings are all examples of characteristics that

are inherent to the workload. We believe that the characteristics and metrics presented here

should be of interest to anyone doing research in the area of address-translation optimizations.

Knowing the nuances of each workload can help both understand program behaviour and also

guide design decisions at the architectural level.

As anticipated, our results show that there is no single TLB model to match all workloads’

needs. Even within the same class of workloads, there is up to an order of magnitude variation

in the number of unique translations. Variations in TLB size requirements (high standard de-

viation) can exist across cores within a workload too. Workloads also exhibit different degrees

of translation sharing across cores, as well as different superpage usage, both behaviours that


rigid TLB hierarchies would poorly capture. Our TLB capacity sensitivity study further illus-

trates how most mainstream TLB structures (e.g., split TLBs) are either biased towards the

smallest page size or make an implicit assumption about the page size distribution of memory

accesses. Chapter 4 demonstrates how these assumptions can waste energy and space and pro-

poses Prediction-Based Superpage-Friendly TLB Designs that can allow translations of different

page-sizes to coexist in a single set-associative TLB, sharing its capacity as needed.

Each unique translation, and the TLB entry it might occupy, incorporates by default infor-

mation for the process it belongs to. These contexts provide us with another abstraction level to

observe TLB-related workload behaviour such as translation sharing across cores or per process

footprint. Even though there are variations in the frequency, data reach, and occasionally the

lifetime of each context, one should not filter or ignore them. Context-aware TLB indexing

schemes might warrant future research.

Despite the fluidity of so many characteristics, translation invalidations and modifications

are rare for the evaluated workloads. This observation was made in other research works

as well. The persistence of translation mappings encouraged researchers to propose changes

to the memory allocation algorithm to bypass paging for select large memory regions, e.g.,

direct segments [12], redundant memory mappings [45]. On our end, we believe that persistent

translation mappings can motivate history-based TLB schemes. Chapter 5 presents our history-

based cacheable TLB, a speculative - by configuration - design not kept coherent with the page

tables.

The last contributions of this chapter are the observations on (i) the compressibility of

translation entries, and (ii) the predictability of the cache block accessed within a page on a

TLB miss. Although not used in this work, we hope these results can motivate future research.

As this chapter illustrated, the landscape of TLB-related workload behaviour is vast. The

results presented here have charted different, often overlapping, facets of this landscape.

Chapter 4

Prediction-Based

Superpage-Friendly TLB Designs1

4.1 Overview

There are technology trends that compound making TLB performance and energy critical in

today’s systems. Physical memory sizes and application footprints have been increasing without

a commensurate increase in TLB size and thus coverage. As a result, while TLBs still reap the

benefits of spatial and temporal locality due to their entries’ coarse tracking granularity, they

now fall short of the growing workload footprints. The use of superpages (i.e., large contiguous

virtual memory regions which map to contiguous physical frames) can extend TLB coverage.

Unfortunately, there is a “chicken and egg” problem: some workloads do not use superpages

due to the poor hardware support, and no additional support is added as workloads tend not

to use them.

The number of page sizes supported in each architecture varies. For example, x86-64 sup-

ports three page sizes: 4KB, 2MB and 1GB. UltraSparc III supports four page sizes: 8KB,

64KB, 512KB and 4MB, while the MMUs in newer generation SPARC processors (e.g., Sparc

T4) support 8KB, 64KB, 4MB, 256MB and 2GB page sizes [72]. Itanium and Power also

support multiple page sizes. For example, POWER8 supports 4KB, 64KB, 16MB and 16GB

pages [73]. Larger page sizes extend TLB reach, reduce the TLB miss handling penalty (assum-

ing multi-level page tables), and could even enable further data prefetching without crossing

smaller-page boundaries. But using larger page sizes risks fragmentation. The use cases of the

various page sizes vary across systems. For example, Power systems running Linux use 64KB

as their default page size; however, this choice can be harmful, e.g., if “an application uses

many small files, which can mean that each file is loaded into a 64KB page” [33]. In these

systems, 16MB pages require specific support (e.g., the Linux libhugetlbfs package); these

pages are “typically used for databases, Java engines, and high-performance computing (HPC)

1A modified version of this chapter has been previously published in the Proceedings of the IEEE 21stInternational Symposium on High Performance Computer Architecture (HPCA), February 2015 [55].

60

Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 61

applications” [33].

While variety in page sizes may cater to each application’s memory needs/patterns [54],

it may burden the OS with selecting and managing multiple page sizes. It also makes TLB

design more challenging since the page size of an address is not known at TLB lookup. This is

a problem for set-associative designs as page offset bits cannot be used in the set index. Thus,

modern systems support multiple page sizes by implementing multiple TLB structures, one per

size, or alternatively resort to a fully-associative TLB structure.

Each design has its own trade-offs. Fully-associative (FA) TLBs seamlessly support all

page sizes, but are much more power hungry and slower than their set-associative counterparts.

Such slow access times are better tolerated in heavily multithreaded systems, such as Sparc T4,

where individual instruction latency does not matter as much. Separate per page-size TLBs

(e.g., SandyBridge, Haswell) are sized a priori according to anticipated page size usage. These

structures are all checked in parallel, each using an indexing scheme appropriate for the page

size they cache. If a workload does not use some page sizes, the extra lookups waste energy

and underutilize the allocated TLB area. Haswell’s [34] or Skylake’s [38] L2-TLBs are rare

examples of a commercial set-associative design which supports two page sizes. Unfortunately,

their indexing method has not been publicly disclosed. Finally, UltraSparc III is representative

of designs that just distinguish between 8KB pages and superpages storing the latter in a small

FA structure. Workloads that heavily use superpages thrash the small FA TLB.

The goal of this work is to allow translations of different page sizes to co-exist in a single

set-associative (SA) TLB, even at the L1 level, while: (1) achieving a miss rate comparable to

that of an FA TLB, and (2) maintaining the energy and access time of an SA TLB. The target

TLB design should allow elastic allocation of entries to page sizes. That is: (1) A workload

using mostly a single page size should be able to use all the available TLB capacity so that it

does not waste any resources or be limited by predetermined assumptions on page size usage.

(2) A workload that uses multiple page sizes should have its translations transparently compete

for TLB entries. An SA TLB will better scale to larger sizes without the onerous access and

power penalties of a large FA TLB.

Our analysis of the TLB behaviour of a set of commercial and scale-out workloads that

heavily exercise existing TLBs has indicated that: (i) some workloads do use superpages heavily,

and (ii) workloads tend to favor the largest superpage size, while intermediate page sizes rarely

appear. Motivated by these results, we propose a lightweight binary superpage prediction

mechanism that accurately guesses ahead of time if a memory access is to a superpage or not.

This prediction enables an elastic TLBpred design that dynamically adapts its super- and regular

page capacity to fit the application’s needs.

The rest of this chapter is organized as follows. Section 4.2 extends Chapter 3’s TLB

behaviour analysis for a set of commercial and scale-out applications with energy, access-time,

and x86 native execution results, demonstrating the need for adaptive superpage translation

capacity in the TLB. Section 4.3 discusses our binary superpage prediction mechanism, and


Section 4.4 describes how we incorporate it in the proposed TLBpred. Section 4.5 presents a

summary of the previously proposed but not evaluated Skewed TLB (TLBskew) [69] which allows

translations of different page sizes to coexist. This section also presents our enhanced TLBpskew

proposal that uses dominant page-size prediction to boost TLBpskew’s effective associativity.

Sections 4.6 and 4.7 present our methodology and evaluation results respectively, followed by

an overview of the related work in Section 4.8 and concluding remarks in Section 4.9.

4.2 Analysis of TLB-Related Workload Behavior

This section extends the analysis of TLB behaviour presented in Chapter 3. All results presented

in this chapter, except for Section 4.2.3, use full-system emulation of a SPARC system running

Solaris (Section 4.6 details the experimental methodology used). Section 4.2.1 summarizes

previously reported statistics characterizing the workload footprints. Section 4.2.2 presents

how this set of workloads behaves under different TLB designs and also quantifies their access

time/energy trade-offs. We target the data TLB as its performance is much worse than the

instruction TLB. Finally, Section 4.2.3 presents results from native runs on an x86 system

running Linux. These results illustrate that some key observations that motivate this work also

hold true in an x86 system and on a different operating system.

4.2.1 Unique Translations Analysis Recap

Section 3.3.1 from Chapter 3 presented the number of per-core unique translations for 8KB

pages and superpages accessed on average during the execution of a 16 billion instruction

sample on a 16-core CMP system. This system supports four page sizes: 8KB, 64KB, 512KB,

and 4MB pages. Based on the measurements from Tables 3.4 and 3.5, we make the following

empirical observations:

• The average number of unique pages accessed per core is one or two orders of magnitude

more than the mainstream L1 TLB capacity.

• Even though four page sizes are supported, only the 8KB and 4MB page sizes were

prominently used. No use of 64KB pages was recorded, while there were very few, if any,

512KB pages.

• The superpages use varied drastically across workloads. The OLTP workloads TPC-C1

and TPC-C2 (commercial database systems), and three scale-out applications from the

Cloud Suite, cassandra, classification, and nutch, use 4MB pages. These workloads can

easily thrash an unbalanced TLB design with limited superpage capacity.

The TLB sensitivity study presented in Section 3.6 explored how different TLB organization

impact D-TLB MPMI. The subsequent section will focus on four specific TLB designs and will

quantify their MPMI, as well as the dynamic energy and access time trade-offs.


4.2.2 TLB Miss Analysis and Access-Time/Energy Trade-Offs

Table 2.1 in Chapter 2 listed current commercial D-TLB designs. This section will focus on four

TLB designs whose original L1-TLB configurations are reiterated in the Table 4.1 below, for

convenience. These configurations are adapted as follows for this chapter. The AMD12h-like

configuration models a 48-entry FA TLB, while the SPARCT4-like models an 128-entry FA

TLB respectively, both with LRU replacement. The Haswell-like TLB design has been tuned

for our system’s supported page sizes. It includes four distinct 4-way SA TLBs: a 64-entry

TLB for 8KB pages and three 32-entry TLBs for 64KB, 512KB and 4MB page sizes, all with

an associativity of four. Lastly, the UltraSparc-III-like TLB has a 4-way SA 512-entry TLB for

8KB pages and a 16-entry FA TLB for superpages.

Processor Microarchitecture L1 D-TLB Configuration

AMD 12h family [4] 48-entry FA TLB (all page sizes)

Sparc T4 [72] 128-entry FA TLB (all page sizes)

Intel Haswell [38], [34]

4-way SA split L1 TLBs:64-entry (4KB), 32-entry (2MB) and4-entry (1GB)

UltraSparc III [76]2-way SA 512-entry TLB (8KB)16-entry FA TLB (superpages and locked 8KB)

Table 4.1: Commercial D-TLB Designs

Figure 4.1 shows the L1 D-TLB MPMI for the aforementioned adapted TLB designs. Lower

MPMI is better. The series are sorted from left to right in ascending L1 TLB capacity.

0

2000

4000

6000

8000

10000

12000

14000

16000

apache TPC-C2 TPC-C1 canneal ferret x264 cassandraclassifica;on cloud9 nutch streaming


D-TLBL1M

PMI

Workloads

48-entryFA(AMD12h-like) 128-entryFA(SparcT4-like) split-L1(Haswell-like) UltraSparcIII-like

Figure 4.1: D-TLB L1 MPMI for Different TLB Designs


The FA TLBs often have fewer misses than their SA counterparts. Increasing FA TLB size

from 48 to 128 entries further reduces MPMI, following the trend shown in Section 3.6.2. The

UltraSparc-III-like TLB, with its larger capacity for 8KB pages performs best for workloads

that mostly use 8KB pages, such as canneal. On the other hand, this design suffers when its

small 16-entry FA TLB gets thrashed by the many 4MB pages of a workload like classification

whose majority of TLB misses are due to superpages. The split Haswell-based L1 TLBs, with

their smaller overall capacity, perform much better for classification but fall short on most

others.

To summarize, the analysis shows that:

1. FA TLBs have a lower miss rate, more so given a larger number of entries.

2. Split-TLB designs are the least preferable choice for these workloads.

3. Capacity can be more important than associativity (e.g., canneal).

Figure 4.2 plots these TLB designs in a “Dynamic energy per read access” versus “access

time” plane using estimates from McPat’s Cacti [49]. The preferred TLB design would have

the MPMI of the Sparc-T4-like design (Figure 4.1), the fast access time of Haswell-like split L1

TLBs, and the dynamic read-energy per access of AMD-like 48-entry FA TLB. We approach

this goal with an elastic set-associative TLB design that uses superpage prediction as its key

ingredient.

48-entryFA(AMD12h-like)

128-entryFA(Sparc-T4-like)

split-L1(Haswell-like)

UltraSparcIII-like

0.004

0.005

0.006

0.007

0.008

0.009

0.01

0.05 0.1 0.15 0.2 0.25 0.3

Dyna

micReadEn

ergy

PerA

ccess(nJ)

AccessTime(ns)

be7er

Figure 4.2: Access Time and Dynamic Energy Trade-Offs

4.2.3 Native x86 Runs

To further demonstrate that superpages are frequent, thus requiring enhanced TLB support,

we measure,2 using performance counters, the portion of TLB misses due to superpages during

2The native x86 results [55] presented here in Section 4.2.3 were collected by Xin Tong; they are included inthis thesis for completeness.


native execution on an x86 system. Table 4.2 lists the parameters of the x86 system used for

these native runs. The workloads were run for 120 seconds, with measurements taken every

2M TLB misses with oprofile. Only a subset of the workloads was available due to software

package conflicts. Only 4KB and 2MB pages were detected in this system.

Processor Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz

OS Linux with enabled Transparent Huge Page support

L1 D-TLBs 32-entry for 2MB pages & 64-entry for 4KB pages, all 4-way SA

L2 TLB 512-entry, 4-way SA, shared across instructions and data

Table 4.2: System Parameters for Native x86 Execution

Table 4.3 results show that superpages (i.e., 2MB pages) can be responsible for a signifi-

cant portion of TLB misses in an x86 system too. For example, slightly more than half of all

L1 D-TLB misses for cassandra and classification are due to 2MB page accesses. This system

not only supports different page sizes, but also runs a different operating system and mem-

ory allocator algorithm. These results further support our other empirical observations that

superpages can be an important contributor to TLB misses.

Workloads % L1 D-TLB Misses % L2 TLB Misses

canneal 16.4 2.6

cassandra 51.8 14.8


cloud9 21.4 33.3

Table 4.3: Fraction of TLB Misses due to 2MB Superpages (x86)

4.3 Page Size Prediction

The page size of a memory access is unknown during TLB lookup-time. This is a challenge

for a set-associative TLB caching translations of all page sizes. Without knowing the page size

we cannot decide which address bits to use for the TLB index. This section explains how a

page-size predictor can be used to overcome this challenge.

For simplicity, let us assume a system with only two page sizes. A binary predictor, similar

to those used for branch direction prediction, would be sufficient here. Using an index available

at least a cycle before the TLB access (e.g., PC), the predictor would guess the page size and

then the TLB would be accessed accordingly. A TLB entry match could occur only if the

predicted size is correct. If this first, primary, lookup results in a TLB miss, then either the

prediction was incorrect or the entry is not in the TLB. In this case, another secondary TLB


lookup is needed with the alternate page size. If this lookup also results in a miss, then a page

walk ensues.

Most architectures support multiple (N) page sizes. Thus, a binary predictor does not suf-

fice to predict the exact page size of an access. In such a system, a page size predictor would

have to predict among multiple page sizes [24]. It could do so by using wider (log2(2 ∗N) bits)

or multiple saturating counters to predict among the N possible page sizes. Besides the addi-

tional hardware and energy costs of this predictor, which may be modest, mispredictions and

misses would suffer. On a misprediction up to N − 1 additional lookups may be needed, if the

translation is present in the TLB. These serial lookups hurt performance and energy. We do

evaluate such designs in Section 4.7.6. The rest of this section discusses superpage predictors.

Section 4.4 presents the complete TLBpred design.

4.3.1 Superpage Prediction

To avoid multiple sequential lookups, we take advantage of the observed application behavior

and opt for a binary approach distinguishing between 8KB pages and superpages. Our predictor

guesses whether the page is a superpage but it does not guess its exact size. We manage all our

superpages homogeneously, as Section 4.4 will show. The proposed predictor uses a Prediction

Table (PT) with 2-bit saturating counters. The PT is a direct-mapped, untagged structure,

similar to bimodal branch predictor tables. Each entry has four possible states. Each state is

represented as A B, where A specifies the prediction for this current state, and B the prediction

for the next state after a misprediction. Both A and B can only have two possible values: P

to signify a small (8KB page) and SP to signify a superpage. Thus, the four states are the

following: (i) strongly predicted 8KB page (P P), (ii) weakly predicted 8KB page (P SP), (iii)

weakly predicted superpage (SP P), and (iv) strongly predicted superpage (SP SP). All entries

are initialized to the weakly predicted 8KB state.

For prediction to be possible, the index must be available early in the pipeline. The instruc-

tion’s address (PC) is a natural choice and intuitively should work well as an instruction would

probably be accessing the same data structure for sufficiently long periods of time, if not for the

duration of the application. However, libraries and other utility code may behave differently.

Another option is the base register value which is used during the virtual address calculation

stage and thus is available some time before the TLB access takes place. Figure 4.3 presents the

two predictors that use the PC or the base register value as the PT index respectively. In all

predictors, the prediction occurs only for memory instructions. In the SPARC v9 architecture,

memory instructions have the two most significant bits set to one as shown in the same figure.

PT entries are updated only after the page size becomes known: on a TLB hit or after the

page walk completes in case of a TLB miss. The predictor tables are never probed or updated

during demap or remap operations.

Sections 4.3.1.1 and 4.3.1.2 next detail the two PT index types. In both cases, the least

significant log2(#PTentries) bits from the selected field are used as index, discarding any


11313018140

RegisterFile

6322

Memory Instruction

Base Register-Value Indexed Prediction Table

PC

PC – Indexed Prediction Table

rs1

: 10

Figure 4.3: (a) PC-Based and (b) Base Register-Value Based Page Size Predictors

high-order bits.

4.3.1.1 PC-based Predictor

The first predictor uses the low-order PC bits. This information is available early in the

pipeline, as soon as we have identified that this is a memory instruction. A concern with PC-

based prediction is that the page size of a given page will be “learned” separately for different

instructions. For example, a program that processes different fields of a data structure would

do so via different instructions. However, most likely these data fields will all fall within the

same type of page (i.e., a superpage or an 8KB page). Having a PC-based index unnecessarily

duplicates this information resulting in slower learning times and more aliasing. Commercial

and scale-out workloads often have large instruction footprints, thus putting pressure on PC-

based structures.

4.3.1.2 Base Register-Value-Based (BRV-based) Predictor

Address computation in SPARC ISA uses either two source registers (src1 and src2 ) or a source

register (src1 ) and a 13-bit immediate. The value of register src1 dominates the result of the

virtual address calculation in the immediate case and more than not in the two source register

scenario as well (data structure’s base address). Therefore, we are using the value of source

register src1 as an index, after omitting the lower 22 bits to ignore any potential page offset

bits (this offset corresponds to the 4MB superpage size).

To demonstrate how src1 dominates the memory address calculation we provide below a

typical compiler-generated assembly of two small loops. The first loop initializes an array,

while the second sums its elements. The generated assembly of these loops, compiled with g++


with -O3 optimization on a SPARC machine, is listed below. In SPARC assembly [74], the

destination register of instructions is listed last. For example, add %o2, 0x3, %o0 adds the

contents of register o2 with the immediate value 0x3 and stores the result in register o0.

//=============================================================================

// Loop 1

//=============================================================================

for (i=0; i< cnt; i++) {

a[i] = i + 3;

}

/* In the beginning of each loop iteration:

reg. o2 holds i, reg. o3 holds &a[0], and reg. o4 holds cnt.

Registers o1 and o0 are used as temporaries. */

main+0x30: 93 2a a0 02 sll %o2, 0x2, %o1

main+0x34: 90 02 a0 03 add %o2, 0x3, %o0

main+0x38: 94 02 a0 01 add %o2, 0x1, %o2

main+0x3c: 80 a2 80 0c cmp %o2, %o4

main+0x40: 06 bf ff fc bl -0x10 <main+0x30>

main+0x44: d0 22 c0 09 st %o0, [%o3 + %o1]

//=============================================================================

// Loop 2

//=============================================================================

for (i = 0; i< cnt; i++) {

sum += a[i];

}

/* Reg. o2 holds i, reg. o3 holds &a[0], and reg. o4 holds cnt, as before.

Reg. i0 holds sum, while o0 and o1 are used as temporaries. */

main+0x64: 91 2a a0 02 sll %o2, 0x2, %o0

main+0x68: d2 02 c0 08 ld [%o3 + %o0], %o1

main+0x6c: 94 02 a0 01 add %o2, 0x1, %o2

main+0x70: 80 a2 80 0c cmp %o2, %o4

main+0x74: 06 bf ff fc bl -0x10 <main+0x64>

main+0x78: b0 06 00 09 add %i0, %o1, %i0


Both loops have one memory instruction that either modifies an array element (st, store

instruction in Loop 1) or reads an array element (ld, load instruction in Loop 2). The store

instruction in the first loop is in the branch delay slot, and is executed irrespective of the

branch outcome. In both these store and load instructions, the src1 register (bits 18-14 of the

instruction in SPARC-V9) is o3, which holds the base address of the array (i.e., &a[0]). The

src2 register (instruction bits 4-0) is register o1 for the store and register o0 for the load. Thus,

as expected, the value of o3 (array’s base address) will dominate.

By using only the base register value, and not the entire virtual address, prediction can

proceed in parallel with the address calculation. Accordingly, there should be ample time to

access the tiny 32-byte prediction table that Section 4.7 shows is sufficient.

4.4 Prediction-Guided Multigrain TLB

The proposed multi-grain TLB, TLBpred, is a single set-associative structure that uses two

distinct indices: an 8KB-based and a superpage-based index. This binary distinction mirrors

the observation that there are two prominent page sizes used in the analyzed system (8KB and

4MB). The multi-grain TLB successfully hosts translations of any page size as its tags are big

enough for the smallest 8KB supported page size.

Figure 4.4 shows the indexing scheme used for a given TLB size. All superpages, irrespective

of their size, share the same indexing bits. Also, Figure 4.5 shows a potential implementation

of the tag comparison for a predicted superpage access. With this indexing scheme all page

sizes are free to use all sets. Consecutive 8KB pages and 4MB pages, the two prominent page

sizes, map to consecutive sets. Consecutive 64KB or 512KB pages may map to the same set as

they use the same index bits as 4MB pages and thus may suffer from increased pressure on the

TLB. As these pages are relatively infrequent, this proves not to be a problem.

Virtual Address 8KB

64KB

page-offset bits set-index bits tag bits

512KB

4MB

63

12

63 63 63

21 15

21 18

21

0

0 0

0

Figure 4.4: Multigrain Indexing with 4 supported page sizes, shown here for a 512-entry 8-waySA TLB (6 set-index bits).


Tag Bits ContextPage Size G

Translation Entry

Set Index

v 27 22

63 28 21 13

0x000 (mask for 4MB)0x1c0 (mask for 512KB)0x1f8 (mask for 64KB)

63 28 27 22 21 13

Composed Tag for comparison with incoming virt. address on superpage prediction

Figure 4.5: Multigrain Tag Comparison for Figure 4.4’s TLB on superpage prediction. PageSize field (2 bits) included in every TLB entry.

While superpage prediction proves highly accurate, correctness must be preserved on mis-

predictions. Table 4.4 details all possible scenarios. The common case given the high TLB hit

rate and prediction accuracy is to have a TLB hit and a correct prediction. A TLB hit during

the primary TLB lookup implies a correct page size class (superpage or not) prediction, as each

entry’s page size information is used for the tag comparison. On a TLB miss, however, there is

a degree of uncertainty. A secondary TLB lookup is necessary, this time using the complement

page size class. For example, if the prediction was for an 8KB page, the secondary lookup uses

the superpage based index. In total, at most two lookups are necessary.

TLB LookupOutcome (w/

Predicted Page Size)

Page SizePrediction

Effect

Hit CorrectExpected common case. No further TLBlookups are required.

Hit Incorrect

Only possible if both the primary and the sec-ondary lookup probe the same TLB set (i.e.,the set index bits for 8KB and 4MB pages arethe same) and hardware supports it.

Miss X (Unknown)This could either be a misprediction (i.e., weused an incorrect TLB index) or a TLB miss.

Table 4.4: Primary TLBpred Lookup

Table 4.5 shows the two possible outcomes for this secondary TLBpred lookup - occurring

only on a primary TLBpred lookup miss - given a binary superpage predictor. A secondary

TLBpred hit implies an incorrect page-size prediction and doubles the TLB lookup latency.

However, this event is rare with an accurate predictor. Conversely, a secondary TLBpred miss


triggers a page-walk, making the latency of the secondary TLB lookup negligible compared to

the lengthy page walk latency.

Secondary TLBLookup Outcome

Original Page SizePrediction

Effect

Hit IncorrectMisprediction. Second TLB-lookup issuccessful.

Miss X (Irrelevant)True TLB miss. The page size is stillunknown and a page walk is needed.

Table 4.5: Secondary TLBpred Lookup Using a Binary Superpage Predictor.

Sections 4.4.1 and 4.4.2 next discuss how to (a) extend TLBpred for different page size usage

scenarios. and (b) deal with special TLB operations.

4.4.1 Supporting Other Page Size Usage Scenarios

In our analysis we have observed a bimodal page size distribution (i.e., two prominent page

sizes) both in SPARC and x86, which motivated our superpage predictor. This distribution

was expected for x86-64 which supports 4KB, 2MB and 1GB pages. The 1GB page size, when

enabled, is judiciously used so as not to unnecessarily pin such large memory regions. In all

cases, the proposed TLBpred correctly works for any page size distribution, possibly experiencing

increased set pressure for the non dominant page sizes (see Figure 4.15). We anticipate that the

observation that some page sizes dominate will hold in different architectures that also support

multiple page sizes.

4.4.1.1 Precise Page Size Prediction

Assuming that multiple page sizes may be actively used, one solution to avoid conflict misses

would be to use a predictor that predicts the exact page size [24]. Thus, contiguous pages of

all page sizes would map to subsequent sets. The downside is that all mispredicted TLB hits

and all TLB misses would pay the penalty of multiple sequential lookups, which could be hefty

in systems with a large number of supported page sizes. Section 4.7.6 touches upon such page

size usage scenarios further.

Table 4.6 summarizes the possible outcomes of the secondary TLB lookups for a page size

predictor predicting among N possible page sizes. A non-primary TLB lookup that hits in the

TLB signals a page-size misprediction. This misprediction overhead is high, making the TLB

hit latency anywhere from 2 to N times more than it would be in case of a primary TLB hit, and

may result in having to replay any dependent instructions that were speculatively scheduled

assuming a cache hit. The more page sizes are supported in a system with precise page size

prediction, the higher the misprediction overhead in case of multiple secondary lookups or a

TLB miss.


i-th TLB LookupOutcome

Original PageSize Prediction Effect

Hit Incorrect i-times TLB lookup latency.

Miss (i < N) X (Irrelevant) Repeat lookup with i+ 1 page size.

Miss (i = N) X (Irrelevant)True TLB miss. A page walk is in order, thusthe increase in latency is, in proportion, small.

Table 4.6: i-th TLB Lookup (1 < i ≤ N); N supported page sizes.

4.4.1.2 Predicting Among Page Size Groups

In SPARC, the two prominent page sizes were the smallest and the largest supported and the

difference between the page sizes was not stark. However, this might not be the case in other

systems. In x86-64, the largest page size is 1GB. Having 2MB pages share the same TLB set-

index as the 1GB pages could result in 512 contiguous 2MB pages competing for the same set,

whereas in our system at most 64 contiguous 64KB entries would map to the same set.

A preferable option to precise page prediction, that lowers its worst case penalty, would be to

have TLBpred predict among groups of page sizes, following the same principle of superpage pre-

diction. These architecture-specific groups should be judiciously selected to minimize potential

set pressure due to common indexing. For example, instead of predicting across five page-sizes

in Sparc T4, one could predict among the following three groups: (i) 8KB, (ii) 64KB and 4MB,

(iii) 256MB and 2GB. Within a group, the index of the largest page-size would be used by the

smaller pages. In all cases the TLB entries will have sufficient bits to host the translation of

the smallest page size, including each translation’s page size information. Finally, for very large

page sizes (GB range), that are by default sparsely used and may be limited to mapping special

areas of memory (e.g., the memory of graphics co-processors), it might be worthwhile exploring

the use of a small bloom filter as a TLBpred addition or use Direct Segments instead [12].

4.4.2 Special TLB Operations

MMUs can directly modify specific TLB entries via special instructions. In the Cheetah-MMU

that our emulated Ultrasparc-III system uses, it is possible to modify a specific entry in the

FA superpage TLB, for example to modify locked entries or to implement demap operations.

In TLBpred, it is possible that the virtual address of the original TLB entry and the virtual

address of the modified TLB entry map to different sets requiring some additional steps. In

general, any TLB coherence operation can be handled similar to a regular TLB operation,

potentially requiring multiple lookups only in the rare cases the exact page size is relevant to

the operation but is not known. A demap-context or a demap-page operation would not fall

under this category.


4.5 Skewed TLB

A design that supports multiple page sizes in a single structure is the Skewed TLB, TLBskew [69].

Unfortunately, no experimental evaluation of its performance exists to date. This section re-

views the TLBskew design and explains how we applied it in our evaluated system. In TLBskew,

similarly to the skewed associative caches [70], the blocks of a set no longer share the same

index. Instead, each way has its own index function. However, unlike skewed-associative caches

where all addresses see the same associativity, in TLBskew a page maps only to a subset of ways

depending on its actual page size and its address.

The TLBskew hash functions are designed in such a way that a given virtual address can

only reside in a subset of the TLB’s ways, resulting in a per page size effective associativity.

The page size of this address’s virtual page determines this subset. At lookup time, when the

page size is yet unknown, log2(TLB associativity) bits of the virtual address are used by a

page size function, which determines that this address can reside in way-subset X as page

size Y. This expected size Y information is incorporated in each way’s set index, ensuring that

both the page offset bits and the page size function bits for this way are discarded.

Table 4.7 shows the page size function mapping proposed by Seznec for an 8-way skewed-

associative TLB [69] for the Alpha ISA. Our system supports the same page sizes. With this

mapping, a translation for a given virtual page - that has a specific page size- can only reside

in two out of the eight TLB ways. For this page size function and TLB organization, virtual

address A with bits 23-21 zero can map (i) to ways 0 and 4 if part of an 8KB page, (ii) to ways

1 and 5 if part of a 64KB page, (iii) to ways 2 and 6 if part of a 512KB page, or (iv) to ways

3 and 7 if part of a 4MB page. As Table 4.7 shows, bit VA[23] does not matter for mapping

8KB and 64KB pages (i.e., it is a don’t care value), while bit VA[21] is a don’t care value for

mapping 512KB and 4MB pages respectively.

Virtual Addr. Bits 23-21 8KB 64KB 512KB 4MB

000 ways 0 & 4 ways 1 & 5 ways 2 & 6 ways 3 & 7








Table 4.7: Page Size Function described in Skewed TLB [69].


Figure 4.6 shows the set-index selection bits of the virtual address for the four page sizes of

our baseline system. The set index and page size function bits are based on a 512-entry, 8-way

TLBskew. The set index has six bits, discarding any page size selection bits. These are the

indexing functions presented in the original skewed TLB paper [69], adjusted here for a smaller

TLB size. During lookup, these eight indices are computed, one per-way. In our previous

example with virtual address A where VA[23:21] was 0, a hit in way 0 signifies an 8KB page.

v

v

v

VirtualAddress

8KB

64KB

512KB

4MB

bitsusedforpage-sizefunc@on

page-offsetbits

set-indexbits(ways0–3)

63 12 0

63

63

63

15

18

21

0 0

0

(xor):set-indexbits(ways4–7)

Figure 4.6: Skewed Indexing (512 entries, 8-way skewed associative TLB) with 4 supportedpage sizes.

The number of supported page sizes is hard-wired in the hash indexing functions, and they

all have the same effective associativity (two in our example). When an entry needs to be

allocated, and a replacement is in order, the effective-associativity ways of that given page size

are searched for an eviction candidate. In our prior example, if virtual address A belongs to an

8KB page (page size is known during allocation), then only ways 0 and 4 are searched. Since

the potential victims will reside in different sets, unless all set-index bits for ways 0-3 or 4-7 are

zero, an LRU replacement policy could be quite expensive.

Section 4.7.4 will evaluate both an LRU and an easier to implement “Random-Young”

replacement policy.


4.5.1 Prediction-Guided Skewed TLB

TLBskew allows a workload to utilize the entire TLB capacity even if it only uses a single page

size. However, the effective associativity limits the replacement candidates, causing translation

contention. The default skewed indexing scheme better caters to a uniform use of page sizes,

but Section 4.2 and also Chapter 3 showed that this is not the common case. There are three

considerations:

1. As superpages cover coarser memory regions they are not as many as the 8KB pages.

Thus a uniform distribution might not be the best fit.

2. Some workloads, e.g., scientific workloads, mainly use 8KB pages. For them the effective

associativity is an unnecessary limitation.

3. For TLBs which use contexts or ASIDs to avoid TLB shootdowns on a context-switch,

the same virtual page could be used by different processes, with all those entries mapping

onto the same set. This mapping can apply more pressure due to the imposed effective

associativity limit.

We propose enhancing TLBskew with page size prediction with the goal of extending the

effective associativity per page size. Specifically, one way of increasing effective associativity

would be to perform two lookups in series. In the first we could check for 8KB or 4MB hits

and in the second for 64KB or 512KB hits. This way we can use more ways on each lookup

per size as there are only two possible sizes each time. The downside of this approach is that

it would prolong TLB latency for TLB misses and for 64KB/512KB pages.

We can avoid serial lookups for TLB hits while still increasing effective associativity by using

a prediction mechanism. Specifically, we adapt the superpage prediction mechanism so we do

not predict between 8KB pages and superpages, but between pairs of page sizes. We group the

most used page sizes (i.e., 8KB and 4MB) together and the less used page sizes (i.e., 64KB and

512KB) into a separate group. Our binary base-register value based page size predictor, with

the same structure as before, now predicts between these two pairs of pages.

The TLBskew hash functions are updated accordingly so that now only the 22nd bit (count-

ing from zero) of the address is used for the page sizing function. For example, if this bit is

zero and we have predicted the 8KB-4MB pair, then this virtual address can reside in ways 2,

3, 6 or 7 as a 4MB page and in ways 0, 1, 4 and 5 as an 8KB page. Similar to TLBpred, if we

do not hit during this primary lookup, we use the inverse prediction of a page size pair and do

a secondary TLB lookup. Choosing which page sizes to pair together is crucial; in our case the

design choice was obvious, as our workloads’ page size usage was strongly biased.


4.6 Methodology

This work uses SimFlex [35], a full-system simulator based on Simics [52]. Simics models the

SPARC ISA and boots Solaris. All experiments are run on a 16-core CMP system, for 1 Billion

instructions per-core, for a total of 16 Billion executed instructions. As Section 3.2 stated,

to achieve reasonable simulation time, we collected D-TLB access traces for all our workloads

during functional simulation. We relied on Simics API calls (e.g., probing TLBs/registers) to

extract translation information.

In Simics each core models the Cheetah-MMU, the memory management unit for the

UltraSPARC-III processors with the D-TLB sizes shown in Table 4.1. Table 4.8 summarizes

the relevant fields of each TLB entry. In our system, the TLBs are software-managed. There-

fore, on a TLB miss a software trap handler walks the page tables and refills the TLB. This

is contrary to x86 systems where the TLBs are hardware-managed. Software-managed TLBs

allow for a more flexible page table organization, but at the cost of flushing the core’s pipeline

and potentially polluting hardware structures such as caches. In the simulated system, the trap

handler checks the Translation Storage Buffer (TSB) before walking the page tables. The TSB

is a direct-mapped, virtually-addressable data structure, which is faster to access than the page

tables. Most TLB misses hit in the TSB requiring only 10-20 instructions in the TLB handler

and a single quad load to access the TSB. All accesses are included in our trace, however, the

traces should be representative even of systems with hardware page walkers as the number of

references due to the TSB is very small compared to the overall number of references and those

needed on page walks.

TLB Field(size in bits)

Description

VPN Virtual Page Number

Context (13)The equivalent of the Address Space Identifier (ASI) in x86; preventsTLB flushing on a context-switch. The same VPN could map to differ-ent page frames based on its context3.

Global Bit (1)Global translations are shared across all processes; context field is ig-nored.

Page Size (2)Specifies the page size in ascending order: 8KB, 64KB, 512KB and4MB. Superpages are only allocated in the fully-associative TLB.

PPN Physical Page (Frame) Number

Table 4.8: TLB Entry Fields

Workloads: This chapter uses the set of eleven commercial, scale-out and scientific workloads

summarized in Table 3.3. These workloads were selected as they are sensitive to modern TLB

configurations.

3The context that should be used for a given translation is extracted from a set of context MMU registers. Thecorrect register is identified via the current address space identifier (i.e., ASI PRIMARY, ASI SECONDARY, orASI NUCLEUS). For a given machine, the latter depends on the instruction type (i.e., fetch versus load/store)and the trap-level (SPARC supports nested traps).


4.7 Evaluation

This section presents the results of an experimental evaluation of various multi-grain designs.

Section 4.7.1 shows how accurate our superpage predictors are. Section 4.7.2 demonstrates

that TLBpred reduces TLB misses for the applications that access superpages and that it is

robust, not hurting TLB performance for the other applications. Section 4.7.3 contrasts the

energy of different TLB designs, including our TLBpred. Section 4.7.4 evaluates the TLBskew

and TLBpskew skewed TLB designs, while Section 4.7.5 models the resulting overall system

performance. Finally, Section 4.7.6 investigates how TLBpred performs under hypothetical,

worst case page usage scenarios.

4.7.1 Superpage Prediction Accuracy

To evaluate the effectiveness of the superpage predictor we use its misprediction rate, i.e., the

number of mispredictions over the total number of TLB accesses. The superpage predictor

described in Section 4.3.1 is used, with the transitions summarized in Figure 4.7.

P_P P_SP SP_P SP_SP

superpage superpage superpage

superpage 8KB page

8KB page 8KB page 8KB page

Predict 8KB Page Predict Superpage

Figure 4.7: PT Entry Transition Diagram

Figure 4.8 shows how the misprediction rate varies over different PT indexing schemes

(x-axis labels) and different PT sizes (series). The misprediction rate is independent of the TLB

organization. A lower misprediction rate is better as it will reduce the number of secondary

TLB lookups. Three predictor handles are shown: (1) PC, (2) Base-Register (src1) Value, and

(3) the 4MB-page granularity of the actual virtual address. The last scheme is impractical since

it places the prediction table in the critical path between the address calculation and the TLB

access. However, it serves to demonstrate that the base register value scheme comes close to

what would be possible even if the actual address was known.

All prediction schemes perform well. The PC-based index is the worst due to aliasing and

information replication. These phenomena are less pronounced for the scientific workloads (e.g.,

canneal) that have smaller code size. Using the register-value based index performs consistently

better than the PC-index. The BRV-based predictor is almost as accurate as the exact address-


02468

101214161820

PC

4MBVP

N

BRV PC

4MBVP

N

BRV PC

4MBVP

N

BRV PC

4MBVP

N

BRV PC

4MBVP

N

BRV PC

4MBVP

N

BRV PC

4MBVP

N

BRV PC

4MBVP

N

BRV PC

4MBVP

N

BRV PC

4MBVP

N

BRV PC

4MBVP

N

BRV

apache TPC-C2 TPC-C1 canneal ferret x264 cassandraclassifica@oncloud9 nutch streaming


Workloads and PT index type

Page-SizeTypeMispredic2onRate(%)-Predic2ng8KBvs.Superpages

32PTentries 128PTentries 512PTentries 1024PTentries

Figure 4.8: Superpage-Prediction Misprediction Rate (%)

based predictor (4MB VPN ), which demonstrates that the source register src1 dominates the

address calculation outcome as expected.

The different series per PT index explore how the size of the prediction table influences

the misprediction rate. The bigger the table, the lower the risk of destructive aliasing. With

a miniscule 128-entry PT, which requires a meager 32B of storage, the average misprediction

rate across the workloads is 0.4% for the base register-value based PT index. Canneal exhibits

the worst misprediction rate of just 1.2%. Unless otherwise noted, the rest of this evaluation

uses this 32B superpage predictor.

4.7.2 TLBpred Misses Per Million Instructions and Capacity Distribution

Our goal was an elastic set-associative TLB design that would have the low MPMI of Sparc-T4-

like 128-entry FA TLB, the fast access time of Haswell-like split L1 TLBs, and the dynamic read-

energy per access of AMD-like 48-entry FA TLB within reasonable hardware budget. Figure 4.9

compares the MPMI of different TLBpred configurations to the MPMI of commercial-based TLB

designs. We vary the TLBpred associativity to ensure a power of two TLB sets. The results are

normalized over the AMD12h-like TLB. Numbers below one correspond to MPMI reduction;

the lower the better.

The 128-entry FA TLB (SPARC-T4-like), targeted for its low MPMI, is consistently better


3.8 10.6

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

apache TPC-C2 TPC-C1 canneal ferret x264 cassandra classific. cloud9 nutch streaming


Workloads

TLBMPMIRela2vetoAMD-like48-entryFATLBAMD12h-likeSPARC-T4-likeHaswell-likeUltrasparc-III-likeTLBpred:128-entry4-waySATLBpred:160-entry5-waySATLBpred:256-entry4-wayTLBpred:512-entry4-waySA

Figure 4.9: TLBpred MPMI relative to AMD-like 48-entry FA TLB

than the smaller 48-entry FA TLB (baseline); its MPMI ranges from 10.9% better for ferret to

97.5% better for classification. Our 256-entry set-associative TLB with its small 32B binary

predictor is the TLBpred configuration which meets that goal. Its MPMI ranges from 12.4%

to 82.5% better than the 48-entry FA baseline, and its AMEAN MPMI across all workloads is

7.7% better than the SPARC-T4-like. While this configuration uses twice as many entries as

the corresponding SPARC T4-like configuration, it is set-associative and, as it will be shown,

it is faster and more energy efficient.

Compared to the Haswell-like TLB configuration, even the smallest 128-entry TLBpred is

considerably better. The 128-entry FA TLB has lower MPMI than the 256-entry TLBpred

for classification. This workload has the highest number of private per core contexts than all

other workloads, resulting in many pages (from different processes) that have the same virtual

address to conflict in the set-associative TLBpred. Even so, TLBpred still achieves a low MPMI

than the baseline, and is considerably better than even larger set-associative designs like the

UltraSparc-III-like whose MPMI is 10.6 relative to the baseline for that workload.

4.7.2.1 TLBpred Capacity Distribution

TLBpred’s goal was to allow translations of multiple page-sizes to co-exist in a single set-

associative structure. Figure 4.10 shows a snapshot of the TLB capacity distribution for the

256-entry 4-way SA TLBpred, for all 16 cores, at the end of our simulations for a subset of our

workloads; the remaining workloads exhibit similar behaviour. Contrary to split-TLB designs

that have a fixed hardware distribution of the available L1-TLB capacity to different page sizes,


TLBpred’s capacity is dynamically shared across translations of different page sizes as needed.

Thus, for workloads like cassandra and classification, which heavily use superpages, 30-40%

of the available TLB capacity is occupied by translations for 4MB pages, whereas workloads

like canneal or cloud9 use almost 98-99% of their capacity for 8KB page translations. The

TLBpred capacity distribution also varies across CMP cores. For example, in TPC-C1 53% of

the core’s #6 TLBpred capacity is for 4MB page translations, while it is 17% on average for the

other cores.

0%20%40%60%80%

100%

0 15 0 15 0 15 0 15 0 15 0 15 0 15

apache TPC-C2 TPC-C1 canneal cassandra classifica:on cloud9

WorkloadsandCoreID

256-entry4-waySATLBpredPerCoreCapacityDistribuBon(overtransla:onsofdifferentpagesizes)

8KBpages 64KBpages 512KBpages 4MBpages Unoccupied/InvalidEntries

Figure 4.10: TLBpred per core capacity distribution over translations of different page sizes.

4.7.3 Energy

Figure 4.11 presents the total dynamic energy (in mJ) for a set of TLB designs. Using McPat’s

Cacti [49], we collected the following three measurements for every TLB configuration for a 22nm

technology: (i) read energy per access (nJ), (ii) dynamic associative search energy per access

(nJ), added to the read energy (i) in case of fully-associative structures, and (iii) write energy

per access (nJ). For TLB organizations with multiple hardware structures (e.g., Haswell) these

measurements were per TLB structure. In all cases, in Cacti, we used the high performance

itrs-hp transistors and the cache configuration option that includes a tag array, and specified

the appropriate TLB configurations (e.g., number of sets, ways, etc.). The total dynamic energy

of the system was then computed based on Cacti’s measurements along with the measured, via

simulation, TLB accesses (hits/misses) of each structure and workload.

In principle, every TLB access (probe) uses read energy, whereas only TLB misses (alloca-

tions) consume write energy. For fully-associative structures (e.g., AMD12-like, Sparc-T4-like

designs), the read energy is the sum of components (i) and (ii). For TLB designs with distinct

TLBs per page-size (e.g., Haswell), the read energy per probe is the sum of each TLB’s read

energy as the page size is yet unknown. However, TLB misses only pay the write energy of

a single TLB structure, the one corresponding to the missing page’s size. The read energy of

TLBpred’s secondary TLB lookups was also accounted for, along with the read energy of the


128-entry superpage predictor.

05

101520253035404550

apache TPC-C2 TPC-C1 canneal ferret x264 cassandraclassific. cloud9 nutch streaming

Commercial PARSEC Cloud-SuiteWorkloads

DynamicTLBEnergy(mJ)AMD-like(0.19ns) Sparc-T4-like(0.25ns)Haswell-like(0.08ns) UltraSparc-III-like(0.18ns)TLBpred256-entry4-waySA(0.09ns) TLBpred512-entry4-waySA(0.10ns)

Figure 4.11: Dynamic Energy

As Figure 4.11 shows, the UltraSparc-III design has the highest energy. It probes its SA

and FA TLBs in parallel and the FA TLB access dominates. The Sparc-T4-like FA TLB, with

the lowest MPMI of all the designs, also has significantly high energy mostly due to its costly

fully-associative lookup. The Haswell-like TLB design incurs comparable dynamic energy costs

due to the multiple useless TLB probes of its distinct per-page structures. Our binary page-size

prediction mechanism could be employed to avoid this energy waste, serializing these lookups

on mispredictions and misses. Finally, the 256-entry TLBpred TLB is the nearest to the target

energy of the 48-entry FA TLB, which however has a significantly higher MPMI. The 256-entry

TLBpred is the smallest TLBpred design (with lower energy and latency) that meets our MPMI

target. Alternatively, the 512-entry TLBpred can yield lower MPMI but at a somewhat higher

energy/latency cost.

4.7.4 TLBskew and TLBpskew MPMI

This section evaluates different skewed TLB configurations. Figure 4.12 shows the MPMI

achieved by TLBskew and TLBpskew relative to the AMD-like baseline for a 256-entry 8-way

skewed-associative TLB. In the interest of space, we limit our attention to 256-entry TLB

designs. The first graph series shows the original TLBskew with a random-young replacement

policy. We use the hashing functions described in Section 4.5 where the effective associativity

for each page size is two [69]. “Random-Young” is a low-overhead replacement policy based

on “Not-Recently Used” [71]. A single (young) bit is set when an entry is accessed (on a hit or

on an allocation). All young bits are reset when half the translation entries are young. Upon


replacement, the policy randomly chooses among the non-young victim candidates. If no such

candidate exists, then it randomly selects among the young entries.

The second series in Figure 4.12 reports the relative MPMI of the predictor assisted TLBpskew

of Section 4.5.1. In any primary or secondary TLB lookup only two page sizes are possible.

Therefore the effective associativity for each page size is now four, which proves beneficial in

reducing conflict misses. By coupling the 8KB with the 4MB pages during prediction, the

predictor achieves a nearly zero misprediction rate (0.07% maximum).

Figure 4.12 also explores the impact of the replacement policy. The third and fourth columns

in the graph depict the TLBskew and TLBpskew design with LRU replacement. Due to the

deconstructed notion of a set, LRU would be expensive to implement in hardware [71] compared

to the more realistic “random-young”. The graph nevertheless reports it as a useful reference

point.

Finally, the last column in Figure 4.12 is our multigrain 256-entry TLBpred design. Our

TLBpred, with half the associativity of the skewed designs, reduces AMEAN MPMI - computed

over all our workloads - by 45.7% over the AMD-like baseline, whereas TLBskew and TLBpskew

with the “random-young” replacement policy reduce it by 35.6% and 38.7% respectively. The

TLBpskew with the harder to implement LRU policy reduces MPMI by 48.2% on average.

00.10.20.30.40.50.60.70.80.91


Commercial PARSEC Cloud-SuiteWorkloads

TLBMPMIRela2vetoAMD12h-likeTLBTLBskew,TLBpskew:8-waySA,256-entry

TLBpred:4-waySA,256-ENTRY

TLBskeww/random-youngrepl.

TLBpskeww/random-young

TLBskeww/LRUrepl.

TLBpskeww/LRUrepl.

TLBpred

Figure 4.12: TLBskew, TLBpred, and TLBpskew: MPMI relative to AMD-like 48-entry FA TLB


4.7.5 Performance Model

This section uses an analytical model, as in prior work [19,68], to gauge the performance impact

of the best performing design TLBpred, as the use of software-managed TLBs with the overhead

of a software trap handler and the presence of TSB hindered simulation modeling. Saulsbury

et al. were the first to use such a model [68] modeling performance speedup as:

CPIcore + CPITLBNoOptimization

CPIcore + CPITLBWithOptimization(4.1)

where

• CPIcore is the Cycles Per Instruction (CPI) of all architectural components but the TLB,

• CPITLBNoOptimization is the TLB CPI contribution of the baseline, and

• CPITLBWithOptimization is the TLB CPI contribution under their proposed TLB prefetch-

ing mechanism.

Bhattacharjee et al. quantify performance impact as “Cycles per Instruction (CPI) Saved” over

baseline [19]. This metric is valid irrespective of the application’s baseline CPI.

Following Equation 4.1, the CPITLBNoOptimization can be computed as MPMI ∗ 10−6 ∗TLBMissPenalty. Compared to our baseline, the TLBpred has two additional CPI contributors:

1. All page size mispredictions that hit in the TLB pay an extra TLB lookup penalty.

2. All misses also pay an extra TLB lookup penalty to confirm they were not mispredictions.

Therefore:

CPIMultigrain = (MPMI ∗ 10−6 ∗ (TLBMiss Penalty + TLBLookup T ime))

+Mispredicted TLBHits ∗ TLBLookup T ime

Total Instructions

(4.2)

Figure 4.13 plots shows the cycles saved by the 256-entry 4-way SA TLBpred compared to

the 128-entry FA Sparc-T4-like TLB, i.e., (CPISparcT4−like−TLB − CPITLBpred). We assume

a 2-cycle TLB lookup latency for both designs, even though the FA TLB has a much higher

access latency. The x-axis shows different TLB miss penalties. In our system that models a

software-managed TLB, most TLB misses hit in the TSB as discussed in Section 4.6. The TLB

miss penalties we have observed in our system range from 20 cycles assuming the TSB hits in

the local L1 cache, to 60 cycles when the TSB hits in a remote L2, to over 100 cycles when

the TSB is not cached. As the TLB miss penalty increases, so does the TLB CPI contribution

of each design. Because the two contributions increase at different rates, according to the

aforementioned equations, the plotted difference does not always change monotonically in the

same direction. As Figure 4.13 shows, the two designs are comparable in terms of the CPI

component due to the TLB. Canneal experiences minimal CPI increase, within acceptable error


margins, mostly due to its slightly higher misprediction rate, since TLBpred’s MPMI is less than

that of the Sparc-T4-like TLB. Classification also experiences minimal CPI increase compared

to the FA TLB due to TLBpred’s higher MPMI, as Figure 4.9 showed. This is a workload that

benefits a lot from a large FA TLB, a benefit most likely reduced if a replacement policy other

than full-LRU is used. Conversely, Classification performs extremely poorly for a split-TLB

baseline compared to TLBpred which would thus reap significant CPI benefits. Overall, the

results in Figure 4.13 indicate that TLBpred with its highly accurate superpage predictor meets

the performance target of a fully-associative design, despite the additional TLB lookup in case

of a misprediction or a TLB miss.

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

12

25

50

100

150 12

25

50

100

150 12

25

50

100

150 12

25

50

100

150 12

25

50

100

150 12

25

50

100

150 12

25

50

100

150 12

25

50

100

150 12

25

50

100

150 12

25

50

100

150 12

25

50

100

150


WorkloadsandTLBMissPenalty(cycles)

CPISaved(TLBpredcomparedtoSparc-T4-like)-2cyclesTLBLookup

Figure 4.13: CPI saved with TLBpred

4.7.6 Sensitivity to the Page Size Access Distribution

In our experiments we have seen that there are two prominent page sizes that dominate all

TLB accesses. However, there are various factors that can influence the observed page size

distribution. For example, (a) the OS’s memory allocation algorithm, (b) how fragmented the

system is (i.e., there might not be sufficient memory contiguity to allocate large pages), and

(c) whether transparent superpage support is enabled or the user requested a specific page size.

For completeness, this section explores how TLBpred performs under hypothetical, worst case

scenarios.

We chose canneal, the workload with the largest memory footprint, to explore how our

proposed TLB design would behave under a different page-size distribution. We used the ppgsz

utility to set the desired page size for the heap and the stack. Most of the memory footprint is

due to the heap. First, we created four canneal spin-offs each with a different preferred heap

page size. Each of these configurations has a different prominent page size as Table 4.9 shows.

Secondly, we created a composite workload with a larger footprint by running two canneal


instances, each with a different heap page size (64KB and 4MB). We purposefully selected

64KB and not 8KB to put extra pressure on our TLBpred where consecutive 64KB pages map

to the same set potentially resulting in more conflict misses. For the last spin-off we dynamically

changed the heap page size throughout execution. This change resulted in the highest page size

diversity. In all cases, we set the page size for the stack to 64KB. The stack is small.

Table 4.9 reports the resulting distribution of page sizes, while Figure 4.14 shows the TLB

miss contribution of each page-size for all our canneal spin-offs for the AMD-like baseline.

Canneal Spin-Offs(Heap Page-Size)

Avg. PerCore 8KB

Pages

Avg. PerCore 64KB

Pages

Avg. Per Core512KB Pages

Avg. PerCore 4MB

Pages

8KB heap 68087 1 0 5

64KB heap 901 9272 0 6

512KB heap 658 1 1160 6

4MB heap 837 1 0 151

4MB and 64KB heap 843 9258 0 152

dynamic heap 39682 8962 62 153

Table 4.9: Canneal Spin-Offs: Footprint Characterization

Unlike the original canneal workload whose misses were solely to 8KB pages, we now observe

a different miss distribution. Most of the misses are due to the page size selected for the heap

via ppgsz as that memory dominates the workload’s footprint.

0%10%20%30%40%50%60%70%80%90%

100%

8KB 64KB 512KB 4MB 4MBand64KB(2instances)

dynamic

HeapPage-Size

TLBMissDistribu6on(%)PerPageSize

8KBpages 64KBpages 512KBpages 4MBpages

Figure 4.14: Canneal Spin-Offs: Miss Distribution for 48-entry FA (AMD12h-like) TLB


Figure 4.15 compares the MPMI achieved via our TLBpred over the AMD-like baseline. A

lower relative MPMI value is better. The TLBpred is not as good as the SparcT4 TLB for

workloads where 64KB and 512KB page sizes dominate but the differences are small. We also

modeled a precise page size predictor with larger saturating counters, similar to [24]. Values

0 to 1 correspond to strongly-predicted 8KB page and weakly-predicted 8KB pages, values

2-3 to strongly and weakly predicted 64KB pages, and so on. A correct prediction with an

even counter (i.e., strong prediction) results in no updates, while for an odd counter value

the state is decremented by one. On mispredictions the counters are incremented by one if

the page size is greater than the predicted one or decremented if it is smaller. The last bar in

Figure 4.15 corresponds to this predictor and uses the least significant bits of the predicted VPN

for the TLB set index. This precise TLBpred design is consistently better than the Sparc-T4-like

configuration. As expected, for workloads that use 8KB or 4MB page sizes it performs similar

to the superpage prediction based TLBpred in terms of MPMI. For these cases however, which

were the observed page size distributions, the precise TLBpred will yield worse latency/energy

than our superpage predictor based TLBpred design.

00.10.20.30.40.50.60.70.80.91

8KB 64KB 512KB 4MB 4MBand64KB(2instances)

dynamic

HeapPageSize

TLBMPMIRela2vetoAMD12h-likeTLB

AMD12h-like SPARC-T4-like

TLBpred(256-entry4-waySA) TLBpredPrecise(256-entry4-waySA)

Figure 4.15: Canneal Spin-Offs: MPMI relative to AMD-like TLB. Includes TLBpred withprecise page-size prediction.


4.8 Related Work

The most closely related to our work is by Bradford et al. which proposes but does not evaluate

a precise page size prediction mechanism [24]. The patent lists a variety of potential page size

prediction indexing mechanisms, based on PC, register value, and register names and targets

exact size prediction. In case of a misprediction, this approach would require as many sequential

TLB lookups as the number of supported page sizes. Our binary superpage prediction mecha-

nism balances prediction accuracy with misprediction overhead, taking advantage of observed

application behavior. Binary prediction and common indexing for different page sizes are the

key differences, yielding lower latency/energy. Our TLBpred design seamlessly supports mul-

tiple superpage sizes without having to predict their exact size. Moreover, we experimentally

evaluate the performance of prediction-guided TLB designs including one that is representative

of this design (precise TLBpred).

Talluri et al. were the first to research the “tradeoffs in supporting two page sizes” [79]. Even

though they target 4KB and 32KB pages in a uniprocessor setting, their design observations

remain relevant: a fully-associative TLB would be expensive; hosting all translations in a set-

associative TLB would require either parallel/serialized accesses with all possible indices or the

presence of split TLB structures. The latter is today’s design of choice. They also explored the

impact of always indexing the set-associative TLB with one of the two supported page numbers

showing that indexing with the 32KB page number is slightly worse but generally comparable

to “exact” indexing. Our work approximates exact indexing with the use of a binary page size

predictor.

An orthogonal approach to superpages is to pack translations for multiple pages within

the same TLB entry, as Section 2.3.1.3 more extensively reviewed. Talluri and Hill proposed

the “complete-subblock” and the “partial-subblock” TLB designs [78]. Pham et al. proposed

CoLT which takes advantage of relatively small scale page contiguity [58]. CoLT’s requirement

that contiguous virtual pages are mapped to contiguous physical frames is later relaxed [57],

allowing the clustering of a broader sample of translations. CoLT coalesces a small number of

small pages which cannot be promoted to superpages; it uses a separate fully-associative TLB

for superpages. Our TLBpred proposal is orthogonal as it can eliminate the superpage TLB.

Basu et al. revisited the use of paging for key large data structures of big-memory workloads,

introducing Direct Segments i.e., untranslated memory regions [12]. In their workloads’ analysis

they also observed inefficiency due to limited TLB capacity when large page sizes were used.

Our work addresses this inefficiency. TLBpred can (a) complement Direct Segments for those

regions that use paging and (b) can do so without OS changes.

More recently, Karakostas et al. proposed RMMLite [43]. As we reviewed in Section 2.3.4,

they dynamically downsize TLB ways of split structures adapting to different page-size distri-

butions, while also including a small FA range TLB that holds translation mappings. Their

evaluation includes TLBpp, a TLBpred implementation with a perfect superpage predictor. Their

results for a set of TLB-intensive workloads show that TLBpp reduces “dynamic energy by 43%


and the cycles spent in LTB misses by 67%” compared to a system with split-TLBs where Trans-

parent Huge Pages (THP) support is enabled. We believe that TLBpred is likely to capture a

significant portion of this idealized potential, given the fairly row misprediction rates of our

binary-superpage predictor demonstrated in Section 4.7.1.

Subsequent to our work [55], Cox and Bhattacharjee [28] proposed MIX TLBs targeting

energy-efficient address translation in the presence of multiple page sizes. Their MIX TLBs are

also set-associative designs that can host translations of different page sizes in a single struc-

ture. However, unlike our TLBpred designs that use either a small-page or superpage based

indexing scheme according to a binary superpage predictor, MIX TLBs use a single small-page

set-indexing scheme for all pages irrespective of their size. This design choice decouples the

TLB lookup from the burden of the unknown page size and eliminates the need for a page-size

predictor. MIX TLBs can retrieve a translation with a single lookup; they do no need the

secondary lookup which TLBpred designs need in case of TLB misses and superpage mispredic-

tions. However, using the small-page size index for a virtual address belonging to a superpage

causes superpage translation “mirroring”. That is, translations for the same superpage are

replicated in multiple TLB sets because page-offset bits are part of the TLB set index. MIX

TLBs counterbalance this mirroring challenge by coalescing contiguous superpages in a single

TLB entry. These superpage coalescing candidates are identified during a page-walk; their

translations all exist in the same 64B cache line that holds eight translations on x86 systems.

The authors explain that as long as the number of superpages they coalesce closely matches the

number of superpage mirrors, no TLB capacity is wasted and they can achieve “energy-efficient

performance”.

Allocating mirror translations on multiple (or even all) the TLB sets after a superpage TLB

miss, does not appear trivial in terms of energy/latency and also raises scalability questions.

Furthermore, because replacement decisions across sets are independent, there is a high likeli-

hood of duplicates even within a set. For example, a superpage might miss in one set (because

it was previously evicted) while most of its mirrors are present in other sets. The proposed

design will allocate mirror entries of the missed superpage, after the page-walk completes, on all

sets regardless. Duplicates within a set will be eventually identified and eliminated during sub-

sequent set lookups. But unfortunately, many potentially useful TLB entries across sets might

have been unecessarily evicted in this process. Despite the significant challenges of mirroring,

the high hit-rates of MIX TLBs, with the fewer page-walks etc., counteract the energy over-

heads of mirroring. MIX TLBs achieve up to 80% energy improvement and 55% performance

improvement over area-equivalent split-TLB designs.

They also evaluate our prediction-enhanced TLBpred and TLBpskew designs, presumably

with our 128-entry BRV-based PT configuration, and show TLBpred can achieve up to ∼44%

energy improvement and ∼42% performance improvement over area-equivalent split-TLB de-

signs for various workloads on native and virtualized CPUs, as well as GPUs. The TLBpred

designs are consistently more energy-efficient than the TLBpskew designs. In a few cases, their


results show up to 5% performance degradation for our designs, likely an effect of either inaccu-

rate superpage prediction for some workloads or TLB thrashing. Overall, these results further

demonstrate that TLBpred designs can achieve significant energy and performance benefits when

coupled with an accurate superpage predictor. We hope that research in supporting multiple

page-sizes will continue, and be actively adopted in commercial TLB designs.


In this work we proposed and evaluated two prediction-based superpage-friendly TLB designs.

Our analysis of the data TLB behavior of a set of commercial and scale-out workloads demon-

strated a significant use of superpages which is at odds with the limited superpage TLB capacity.

Thus, we considered elastic TLB designs where translations of all page sizes can coexist with-

out any a priori quota on the capacity they can use. We proposed the TLBpred, a multi-grain

set-associative TLB design which uses superpage prediction to determine the TLB set index of

a given access. Using only a meager 32B prediction table, TLBpred achieves better coverage

and energy efficiency compared to a slower 128-entry FA TLB. In addition, we evaluated the

previously proposed Skewed TLB, TLBskew, and augmented it with page size prediction to

increase the effective associativity of each page size. TLBpskew proved comparable to TLBpred.

Finally, we showed that TLBpred remains effective even when multiple page sizes are actively

used and also evaluated an exact page size predictor guided TLB.

Chapter 5

The Forget-Me-Not TLB

5.1 Overview

Even though TLB capacities have increased over the past decade, this capacity growth has not

been commensurate with the ever increasing memory footprints of today’s “big-data” applica-

tions. The need for a fast TLB access-time, to avoid a negative impact to the processor’s critical

path, is the main inhibiting factor. Since increasing the L1-TLB hit-rate would conventionally

require increasing the L1-TLB capacity, thus hitting the critical latency barrier, an alternative

approach is to implement a secondary translation storage such as an L2-TLB. Some of today’s

systems have added an L2-TLB, e.g., Intel’s Haswell has an 1024-entry 8-way set-associative

L2-TLB. The access latency of an L2-TLB is not on the critical path of every memory access

because it is only accessed on an L1-TLB miss. L1-TLBs, like L1-D caches, have high tem-

poral locality that translates to a high hit-rate, usually well over 90%. However, even though

a longer L2-TLB access latency could be accommodated, the benefits gained via the allocated

TLB-capacity to such a design need to be scrutinized.

Our measurements indicate that adding an L2-TLB can yield negligible performance benefits

or even cause minor performance degradation, in some cases, when compared to a one-level

TLB hierarchy. Workloads that have very large footprints, and thus a low L2-TLB hit-rate,

or workloads heavily relying on superpages, are usually the culprits. In these cases, the extra

latency overhead of probing the L2-TLB before initiating a page walk is not counterbalanced

by the latency reduction achieved via L2-TLB hits.

This chapter presents the FMN TLB, a cacheable TLB design that aims to reduce the TLB-

miss handling latency using the existing cache capacity, and not a dedicated hardware storage,

to store translation entries. The design choices, benefits and trade-offs of this virtualized TLB

design are explored. In its core, this work harnesses the observations that (a) the vast majority

of TLB misses are to previously seen translation-entries, and that (b) translation modifications

(e.g., invalidations) are rare. Therefore, if a CMP system had sufficient capacity to store, and

not to forget, all previously seen translations, page-walks would be a rarity.

FMN can be used to back up the traditional one or two-level TLB hierarchy of current

90

Chapter 5. The Forget-Me-Not TLB 91

systems, or can be used as an alternative to a dedicated L2-TLB. The proposed FMN does

not require any additional dedicated capacity to store translation entries but instead utilizes

the existing cache capacity to store translations transparently and on-demand. This cached

translation-storage is probed with regular memory requests (memory loads) on a D-TLB miss.

If the running application’s translations can seamlessly fit in the existing hardware TLBs, no

extra storage is wasted, as it would be the case for a dedicated L2-TLB. Further, the cacheable

nature of the FMN is fertile ground for more flexible TLB organizations. For example, a shared

TLB across all cores, different TLB indexing schemes, or different TLB sizes are a few of the

optimizations that can be easily applied.

A per core private 1024-entry direct-mapped FMN reduces the average L1-TLB miss latency

across all simulated workloads by 31.4% over a baseline with only L1-TLBs, while a dedicated

1024-entry 8-way set-associative L2-TLB reduces it by 24.6%. FMN’s L1-TLB miss latency

reduction results in up to 1.97% overall execution time reduction (performance). Overall,

the L1-TLB miss latency reduction does not translate in commensurate performance benefits.

This behaviour is also observed in case of the dedicated L2-TLB that can, in some cases,

cause performance degradation up to -1.6%. This chapter also presents an L2-TLB bypassing

mechanism as a potential first-step solution to mitigate such cases.

The remainder of this chapter is organized as follows. Section 5.2 first describes the idea

behind FMN and its operating scenarios. Section 5.3 then introduces the FMN organization and

discusses different design choices from FMN indexing schemes to allocation and replacement

policies, and Section 5.4 describes how the FMN is cached. Section 5.5 details our simulation

methodology including the timing model used and its limitations. Section 5.6 presents an

analytical model to estimate FMN’s performance potential, followed by a description of a set of

synthetic traces (Section 5.7) and the baseline configuration (Section 5.8). Section 5.9 showcases

the results of a case study using synthetic traces, while Section 5.10 evaluates FMN using

commercial workloads. Section 5.11 presents our L2-TLB bypassing optimization. Finally,

Section 5.12 concludes this chapter.

5.2 FMN’s Goal and Operation

Imagine a system where on a TLB miss a virtualized TLB is accessed in parallel to the page

table. However, unlike the page walk which usually requires multiple memory requests (up

to four for x86 systems, potentially fewer if MMU caches are used [10, 15, 16]), only a single

cache request is now needed to retrieve the translation, if the latter exists in this new cacheable

structure. If this single cache access completes faster than the page walk, and the retrieved

translation is valid, you have just avoided a significant percentage (up to 75% if we assume four

memory accesses per page-walk, all with the same access latency as the FMN) of the TLB miss

penalty and have improved your system’s performance. Figure 5.1 illustrates this best case

scenario that the proposed hardware-managed Forget-Me-Not (FMN) TLB scheme aims for.


time

A C B

TLB miss FMN probe returns

Page walk completes

Figure 5.1: FMN’s Best Case Scenario

As Figure 5.1 illustrates, when a TLB miss A occurs, both a page walk and an FMN

probe are initiated. If the FMN probe C returns before the page walk B does, and with

a correct translation, then the processor can make forward progress and save execution time

between events C and B (dashed region), which would not be otherwise possible. In today’s

systems, the processor would execute any instructions dependent on the memory request that

triggered the TLB miss after the page walk completed (event B ). Instead, in the scenario

described above, the processor will be at that time executing instructions further ahead in the

instruction stream. The greater the timeframe between events B and C , the better.

FMN is a cacheable and speculative TLB which significantly reduces TLB miss handling

latency without requiring any changes to the operating system or large dedicated on-chip re-

sources. It leverages the observation from prior work that large on-chip memory caches can be

shared transparently and on demand with properly engineered virtualized structures [25, 26].

The proposed design also investigates the use of speculation in providing highly accurate address

translation without keeping the FMN coherent with the page tables. For example, if a page is

demapped, the FMN is not immediately updated. Because such translation modifications are

rare, the design decision not to update the FMN does not reduce the potential performance

improvement. The FMN can be configured either as a per-core private table or as a single table

shared across all cores, thus adapting to the different requirements and memory behavior of

applications.

The FMN TLB scheme extends the reach of conventional private TLBs and has the following

main characteristics:

• It provides the MMU with a fast yet speculative translation based on recent translation

history.

• It utilizes part of the cache hierarchy, transparently and on demand, to store its speculative

translations.

Section 5.2.1 next presents the common operating scenarios of an FMN probe.

5.2.1 FMN Operating Scenarios

This section explains how an FMN-capable system handles a TLB miss. Traditionally, on a

last-level TLB miss, the MMU initiates a hardware page walk (assuming hardware-managed


TLBs). In the proposed system, the MMU also initiates an FMN probe in parallel to the

page walk. Both the page walk and the FMN probe share the same objective: retrieving the

translation. As with any scenario where two operations proceed in parallel, the order in which

the two operations complete is important. Only two possibilities exist timeliness-wise: (a) the

page walk completes before the FMN probe, or (b) the FMN probe completes before the page

walk.

Figure 5.2 shows the timeline for the first scenario. Once the page walk completes, the MMU

observes its precedence over the still pending FMN probe. Program execution then continues

with the page-walk retrieved translation, which is guaranteed to be correct. The FMN probe

reply, which will arrive later in time, is treated as useless from the MMU. Whether the reply

had the correct translation or not is irrelevant in terms of performance.1

time

A

TLB miss Page walk completes FIRST

PageWalk B C

FMN probe reply is useless

FMNProbe

Figure 5.2: FMN Operation Timeline - Page Walk completes before FMN probe

Figure 5.3 depicts the three possible timelines for the second scenario in which the FMN

probe completes first. As with any tagged structure lookup, a miss or a hit are the two possible

outcomes. In case of a hit, the FMN retrieved translation needs to be eventually checked against

the one retrieved via the page walk to ensure correctness because the proposed FMN is not

kept coherent with the page-tables.

Figure 5.3a shows the timeline for an FMN miss. The FMN probe reply indicates to the

MMU that an FMN miss took place. Waiting for the page walk to complete, as if no FMN

support existed, is the only option. Figures 5.3b and 5.3c show the timeline for an FMN hit.

If an FMN hit occurs, the processor enters speculative execution while taking a checkpoint

of its architectural state. No stores are propagated to the memory hierarchy and no I/O

operations are permitted. Once the page walk completes, the MMU compares the speculative

FMN translation with the one retrieved from the page table.

If the two translations match (Figure 5.3b), any speculative changes are committed. The

functionality required to commit speculative state (i.e., make it architectural state) already

exists in today’s processors. A common example is branch prediction; speculative instructions

- potentially in the wrong code path - are allowed to execute but cannot retire until the predicted

branch has been resolved. After this commit, the program continues its execution from an

instruction that is further ahead in the dynamic instruction stream from the instruction which

1Depending on the FMN allocation policy, discussed later, we could avoid sending an FMN allocation requestif the retrieved translation was correct.


initially triggered the TLB miss. This scenario is the expected common case and results in a

reduced TLB miss handling latency.

If the two translations mismatch (Figure 5.3c), the speculative execution using the FMN-

retrieved translation was useless. Any changes made during speculative execution are discarded

and the dynamic instruction stream starting from the offending instruction gets re-executed.

This misspeculation scenario is expected to be rare as translation mappings are usually persis-

tent.

time

A

TLB miss Translation retrieved via

page walk

PageWalk B

FMN probe completed first, but it was a miss. FMN probe was useless.

FMNProbe

C

(a) FMN miss

time

A

TLB miss FMN translation was correct! Commit speculative state.

PageWalk

FMN probe completed first and is a hit!

FMNProbe

C

Specula5veexecu5onusingFMN-retrievedTransla5on

B

(b) FMN Hit - Translation Correct

time

A

TLB miss FMN translation was wrong! Rollback to checkpoint and re-execute*

PageWalk

FMN probe completed first and is a hit!

FMNProbe

C

Specula5veexecu5on*usingFMN-retrievedTransla5on

B

(c) FMN Hit - Incorrect Translation

Figure 5.3: FMN Operation Timelines - FMN probe completes before page walk


5.3 FMN Organization

The FMN design shares many traits of how TLBs are commonly organized. However, some

design requirements can be relaxed in the FMN due to its speculative nature. FMN’s cacheable

nature further influences some design choices. This section presents the FMN design require-

ments and discusses various design considerations. In this section, FMN is treated as a stan-

dalone structure; Section 5.4 presents how this structure is cached (i.e., virtualized).

FMN is a tagged structure, similar to a regular TLB; the presence of tags is necessary to

ensure we do not use translations from other virtual pages or processes, which - barring any

synonyms - would always lead to misspeculation. An FMN probe can only result in misspecula-

tion when it returns an older, but no longer valid, translation mapping for a given virtual page

and process. This scenario happens rarely and is due to the design choice to lazily propagate

any translation modifications to the FMN.

Each FMN entry can be thought of as a replica of a TLB entry. The following set of

conditions triggers an FMN hit:

1. The VPN of the missing page matches the VPN in a valid FMN entry.

2. The ASID of the process with the TLB miss, also referred to as a context in this work,

should also match the context in the FMN entry. The only exception occurs if the Global

bit in the FMN entry is set. The context comparison is then skipped and a VPN match

suffices.

FMN can be organized as either a private or a shared, potentially associative structure,

similar to a regular TLB. Different trade-offs exist for each design choice. A shared structure

can avoid translation replication, thus making a more efficient use of the overall capacity com-

pared to private structures, especially for workloads that either share data across cores or have

drastically different per core capacity requirements. However, as Section 2.3.1.2 discussed, these

potential benefits come at the cost of a slower access time. The trade-offs discussed earlier, and

in the literature review section in Chapter 2, apply to the FMN too. For example, replicas of

the same translations can exist in multiple private FMNs. However, FMN’s cacheability adds

another dimension to these trade-offs. Section 5.4.3 discusses the effect of this added dimension.

But before Section 5.4 presents how the FMN is cached, Sections 5.3.1 to 5.3.3 examine

important aspects of the FMN’s organization. Namely, (i) how the presence of multiple page

sizes can be handled, (ii) how the FMN can be indexed, and finally (iii) what are some possible

FMN allocation and replacement policies.

5.3.1 Page Size

The page size is not known during a TLB lookup, and by extension during an FMN lookup

as well. One design choice would be to have the FMN only support the most common page

size; the smallest and most prevalent page-size is 4KB in x86 and 8KB in SPARC. In the


few cases when other page sizes are used, this FMN lookup will be wasteful. Wasteful TLB

lookups happen in conventional systems too. For example, in systems with split L1-TLBs at

most one of the multiple parallel split lookups will result in a hit, while the other lookups will

waste energy, as Chapter 4 demonstrated. Unlike an L2-TLB - commonly accessed before the

page-tables - where any unnecessary access for an unsupported page-size adds to the TLB miss

handling latency, FMN requests, which always proceed in parallel with the page-walk, can affect

performance only indirectly as a result of increased memory pressure.

It is possible for a superpage FMN lookup to not be wasteful, if on a superpage FMN miss

the translation for the 8KB page of that superpage is allocated in the FMN. The challenge

with such a design is that the FMN capacity can be unnecessarily wasted when almost all 8KB

pages of a superpage are used. However, if entries are only allocated in the FMN in case of an

L1-TLB miss, and assuming the L1-TLB(s) are not thrashed by superpage accesses, then likely

only the first 8KB page trigger of a superpage will be allocated in the FMN; the rest will hit

in the superpage translation in the L1-TLB(s).

FMN could support multiple page-sizes but at the cost of multiple sequential FMN lookups,

one per page-size. However, this choice is anticipated to have diminishing benefits the later

in this sequence the successful probe happens. Page-size prediction, similar to the proposal in

Chapter 4, could be a compelling design choice, doing the lookup with the predicted page-size

first. On a miss, any subsequent sequential lookups could be dropped to avoid wasting energy.

5.3.2 FMN’s Indexing Scheme

Like any set-associative structure, the FMN index requires log2(FMN sets) bits. One design

option in par with conventional cache indexing schemes is to use the bits of the virtual address

immediately after the page-offset bits. But solely using the log2(FMN sets) bits after the page-

offset could result in conflict-misses in a CMP environment with multiple running processes, as

different processes can have the same virtual pages contend for the same FMN set.

The aforementioned behavior stems from the fact that each process has its own address

space, and therefore the start of different address space segments (e.g., heap, kernel address

space) would coincide. These translations are differentiated in TLB-entries via an ASID (con-

text) field. Incorporating context information in the FMN indexing scheme, e.g., via xor-ing the

original cache-like index with the context bits, could reduce contention for VPNs shared across

different processes. Including context information along with the VPN in the FMN index, will

map the same VPN to different sets, if used by different processes.

5.3.3 FMN’s Allocation and Replacement Policies

Different FMN allocation and replacement policies can be implemented. One possible policy is

to allocate a translation entry in the FMN upon its eviction from the TLB, thus having FMN

act as a victim TLB. Another policy - the one used in this work - would be to allocate in the

FMN the translation entry that just missed in the TLB. The latter would facilitate sharing


across cores in case of a shared FMN because as soon as a core misses in the FMN other cores

accessing the same data would benefit from that FMN entry. Choosing - on occasion during

runtime- not to allocate a translation-entry in the FMN is another viable option, especially for

cases when a translation entry is predicted to get little to no reuse, or when there is contention

in the memory system.

FMN entries are not kept coherent with the page tables, reducing synchronization overheads.

Not propagating any page-table modifications to the FMN does not affect correctness because

the FMN-retrieved translation is always compared with the correct translation that is retrieved

via the page-walk. Inconsistent FMN entries will be eventually updated once the discrepancy

is identified.

Within an FMN set, an LRU replacement policy can be implemented. PTEs traditionally

have some unreserved bits, often used for diagnostics; these bits can be used to store replacement

information in case of an associative FMN. Other variations such as pseudo-LRU or random

replacement policies could also be used. Section 5.4 provides more details.

5.4 Caching the FMN

FMN is a hardware-managed cacheable and virtual2 structure that uses on-chip cache capacity

upon demand, without requiring its own hardware budget. Caches implicitly store address

translations under different scenarios. For example, SPARC’s Translation Storage Buffer (TSB)

is a per-process software data structure which holds recent translations. Page table entries are

also cacheable both in hardware and software-managed TLB schemes.

Figure 5.4 illustrates how FMN affects cache contents when compared with a system with

no FMN. The cache depicted in this example is set-associative, with each row corresponding to

a cache set. In the baseline system where no FMN exists, the cache contains only demand data

(e.g., data, instructions, etc.) and page-table data, whereas in the system where the FMN is

enabled some cache blocks are now occupied by FMN data. In this example, FMN has displaced

demand data from the cache, but it could have displaced page-table data or a combination of

both types, or even no data, if the FMN entries did not survive in the cache.

The key take-away is that the existing cache capacity is not partitioned in any way. Instead,

all types of cache blocks (demand, page-table, and FMN) freely contend for the entire cache

capacity via the existing cache replacement and allocation policies, similar to how demand cache

blocks compete with each other in regular caches. Choosing to treat FMN data differently, e.g.,

by employing a different replacement policy for the FMN, could be an interesting option, but

it is not explored in this work.

In order to access the FMN data in an FMN-capable system, an FMN probe (lookup) is

needed, which requires a load (read) request to be sent to the cache hierarchy, starting from the

L1 cache. Equation 5.1 shows the physical-address calculation for the FMN-probe address.

2The term virtual does not refer to the type of addresses used to access FMN.


(a) Cache Contents when FMN is disabled. (b) Cache Contents when FMN is enabled.

Page-TableData

DemandData

FMNData

Legend

. . .

. . .

. . .

. . .

. . .

. . .

Figure 5.4: FMN’s effect on cache contents.

The FMN is probed with physical-addresses the same way the page-tables are accessed in

memory. FMNbase is the starting physical address of each FMN structure, if private, while

FMNset is computed via the current FMN indexing scheme. Both the FMNentry size and the

FMNassociativity used are a power of two to avoid expensive multiplication costs.

FMNprobe address = FMNbase + (FMNset ∗ FMNassociativity ∗ FMNentry size) (5.1)

The FMNbase address is page-aligned and should have a fixed address in each system. The

physical address space occupied by the FMN should be reserved. As we will explain shortly,

the size of each FMN entry is 16 bytes, and hence, for the simulated FMN sizes, a single 4MB

superpage would be more than sufficient, supporting up to 16K FMN entries per core in a

16-core CMP. The address computation for the probe address can be performed fast enough

so that an FMN probe request can be issued the cycle immediately after a TLB miss. For

the direct-mapped FMN configuration modeled in this work, this address computation involves

only a left shift for the multiplication and an addition.

To limit the FMN-lookup latency, an FMN-set should not span multiple cache lines. A cache

line can contain one or multiple FMN sets, with the cache line size (CLS) being a multiple of

the FMN set size as the equation below formally describes.

CLS modulo (FMNassociativity ∗ FMNentry size) = 0 (5.2)

In the modeled SPARC ISA (see Section 4.6 for details), the translation-entry tag and data fields

require 16B in total (8B each) without any compression. Thus, one could conservatively pack

four FMN entries in a single 64B cache line, the common cache line size for current processors,

and the one used for all cache levels in the simulated CMP system. PTEs traditionally have

some unreserved bits, often used for diagnostics; these bits can be used to store replacement

information in case of an associative FMN.

In a nutshell, the FMN’s associativity should be determined by the cache line size along

with the size of a page-table (translation) entry in the native machine. In some cases, satisfying


the Equation 5.2 might require increasing the FMNentry size by a few bits beyond what is

absolutely necessary. Otherwise, some FMN sets would span multiple lines or extra padding

would be needed to avoid this but at the cost of a more complicated FMN indexing scheme and

wasted space.

Figure 5.5 illustrates how four sets of a 2-way SA FMN are mapped to the physical address

space and, by extension, to cache lines, in par with the above equations.

Tag1a Data1a

Tag2a Data2a

Tag3a Data3a

Tag4a Data4a

Tag1b Data1b

Tag2b Data2b

Tag3b Data3b

Tag4b Data4b

FMNset 0

FMNset 1

FMNset 2

FMNset 3

FMNway 0 FMNway 1

(a) Standalone FMN view

CacheLine

CacheLine

Tag1a Data1a Tag1b Data1b Tag2a Data2a Tag2b Data2b

Tag3a Data3a Tag4a Data4aTag3b Data3b Tag4b Data4b

FMNentry_size

(b) The virtualized FMN: mapping sets to cache lines

Figure 5.5: Virtualizing a small 8-entry 2-way SA FMN.

Until now this section has presented how, i.e., with what memory address, one can access the

FMN. Given a TLB miss, an FMN set is determined, as is the case in any cache-like structure,

and the FMN probe address is formulated. To complete the FMN lookup, two types of memory

requests are needed: (i) FMN probes, and (ii) FMN allocations. The former are memory reads

issued to the cache hierarchy to retrieve the translation, while the latter are memory writes

issued to the cache hierarchy to modify FMN’s contents. The two subsequent sections detail

both the functionality of these two request types and how they interact with the existing cache

controllers, thus concluding the necessary architectural support for caching the FMN.

5.4.1 FMN Probes

Effectively two types of lookups take place on every FMN probe: a regular cache lookup and

a secondary lookup (search) within the data contents of the returned cache block. The first

lookup, like any cache access, uses the existing cache tags to determine if the current FMN set

is present and valid in the cache or not (i.e., FMN-set hit versus FMN-set miss). The cache is

not aware that this lookup is targeting the FMN. The second lookup takes place on the cache

block contents, once they have been returned to the TLB controller. It is this second lookup

that determines if the required translation is present in the cached FMN-set (FMN translation

hit). The timelines depicted in Figures 5.3b and 5.3c both occur only on an FMN translation

hit.

On an FMN-set miss in the LLC, the probe request is currently dropped and an empty

cache block is returned to the requesting core. FMN’s contents do not currently spill to off-

chip memory, when evicted from the LLC, because with the long off-chip latency it is unlikely


there would be any TLB miss latency reduction compared to the page-walk, and the additional

off-chip traffic would be wasteful. The LLC controller would need to know the FMN’s address

range, in some way, to properly handle FMN probes that miss, as well as LLC evicted FMN data.

LLC FMN evictions can either be dropped or alternatively the cache, if using writeback, could

never set the dirty bit for any LLC FMN block. With this design, when a cache line holding

FMN-data gets evicted from the LLC, the associated information is lost. This information will

be recreated whenever an FMN allocation takes place for the same FMN data. Deciding not

to spill to memory is a design choice; future work may evaluate an alternative.

5.4.2 FMN Allocation Requests

Unlike regular memory requests which are filled from memory, the cached FMN retrieves its

data via FMN allocation requests. Depending on the employed FMN allocation policy, these

write requests occur either when a TLB-entry is evicted or when the page-walk for the missing

virtual-address that triggered the FMN probe completed and thus the correct translation was

retrieved. In the second scenario, one could choose not to issue an allocation request on an

FMN-set-hit with a translation hit because the translation is already present in the FMN. In all

cases, FMN allocations are not in the critical path of a TLB miss and can thus proceed lazily.

5.4.3 Discussion

As Section 5.3 mentioned, caching the FMN adds another dimension to some FMN design

decisions. In the shared versus private FMN domain, the latency trade-off in a shared FMN

is slightly different than with a shared TLB. The same FMN translation entry can now be

replicated across the private upper-level (e.g., L1 or L2) caches of different cores. Even though

this replication wastes no FMN capacity, it could waste some cache capacity displacing more

regular demand data. Latency-wise, the replicated data might be faster to access, compared to

a centralized FMN structure, but coherence among FMN entries could add additional latency.

FMN’s cacheability should also guide FMN’s associativity, even beyond the constraint of a

cache line size to be a multiple of the FMN set size (Equation 5.2). Given a 64B cache line and

a 16B FMN entry (accounting for both FMN tag and data), the possible associativity options

are one, two, or four. In caches, limited associativity facilitates faster access time and less

time/space spent for replacement selection, but with the cost of additional conflict misses. A

direct-mapped FMN is an attractive choice because it allows multiple consecutive FMN sets

(four in the previous example) to map to the same cache line. This organization is effectively

equivalent to a next-line prefetcher. Here the translations for four consecutive VPNs (assuming

a cache-like indexing scheme) will map to the same cache line. Once one of these four pages

is accessed via an FMN hit, and the cache line with that FMN data is brought closer to the

processor (L1 cache), the remaining three pages will experience shorter FMN access times, and

thus reap more benefits, if the workload accesses consecutive pages (high spatial locality at the

page granularity).


Other designs that also use caches to store translations are SPARC’s Translation Storage

Buffer (TSB) and the Part-Of-Memory TLB (POM-TLB) [67]. Contrary to SPARC’s TSB,

which is a software translation cache managed by the operating system, FMN is a hardware-

managed structure. The TSB is accessed as part of the TLB miss software trap handler before

the page walk commences, whereas FMN’s lookups are initiated by the hardware and occur in

parallel with the hardware page walk. FMN is also, by design, not kept coherent with the page

tables. Further, as discussed earlier, FMN is not a per-process structure, but instead can be

configured as a per-core private structure (the configuration evaluated in this work) or a shared

structure, with the potential for various in-between configurations.

Concurrently with this work, Ryoo et al. proposed their “Part-of-Memory TLB (POM-

TLB)” [67]. Their proposal targets virtualized environments where the page walk latency is

significantly longer due to the two-dimensional page walk. POM-TLB is a large structure,

acting as a shared L3-TLB, that is stored in DRAM. Because POM-TLB is part of memory, as

FMN is, POM-TLB’s translations can also be cached in on-chip data caches. POM TLB entries

are cached in the L2 and L3 caches, but not in the L1 cache (by design choice). Contrary to

FMN, POM-TLB is probed before the page-walk commences, as a TSB would be, and it is

reported to eliminate most page-walk accesses due to its large size (e.g., 16MB).

5.5 Simulation Methodology

This section presents the methodology used to evaluate the proposed Forget-Me-Not design.

Section 5.5.1 explains the simulation challenges we encountered, Section 5.5.2 details the timing

front-end model we developed to address them, and Section 5.5.3 describes how we simulate

page walks. Finally, Section 5.5.4 discusses the limitations and trade-offs of this methodology.

5.5.1 Simulation Challenges - Software-Managed TLBs in Simics

This work uses a full-system simulator based on Simics [52] that models the SPARC ISA and

boots Solaris. The TLBs in this system are software-managed. Unfortunately, the presence of

software-managed TLBs complicates the modeling of any architectural optimization that would

affect either the TLB hit ratio or how TLB misses and page walks are managed.

The TLB configurations present in the existing simulated system dictate whether a TLB miss

will be triggered or not. The employed Simics simulator models the Cheetah-MMU memory

management unit, which includes two per-core private TLBs: (i) a 512-entry 2-way SA TLB

for 8KB pages and (ii) a 16-entry FA TLB for superpages and locked translations.

If a memory access misses in the aforementioned Simics TLBs, which are different than

the TLB designs explored in this work, the operating system traps to the appropriate MMU

handler. Before walking the page-tables (in software), the MMU software trap handler probes


the Translation Storage Buffer (TSB). A single 128-bit atomic memory load is required for this

purpose. On a TSB hit, which is the anticipated fast common case, the trap handler updates

the TLB with the retrieved translation and retries the memory instruction which had triggered

the TLB miss in the first place.

Table 5.1 shows a listing of the D-MMU trap handler code for a TLB miss that had resulted

in a TSB hit. This information was retrieved from Simics via the disassemble command. Added

comments explain the purpose of the various assembly instructions in that code snippet. The

key part of the TSB probe in the MMU trap handler is the 128-bit atomic load ldda which loads

the translation table entry from the TSB, both tags and data, into a set of global registers. See

instruction #9 in Table 5.1.

Unfortunately, in this simulation environment, it is challenging to evaluate FMN’s impact

on an x86-like baseline. Due to the presence of a TSB and the high frequency of TSB hits,

any results comparing the TLB miss handling cost achieved with the FMN design against that

baseline would be skewed. The objective is to compare the TLB miss handling overhead using

FMN with the overhead observed in an x86-like system, i.e., a system without a TSB that

has hardware-managed TLBs. Even disabling the TSB, e.g., by forcing TSB misses in Simics,3

would still compare FMN lookups with the latency overhead of a software TLB miss handler

walking the page tables, and will be directly influenced by how these page tables are organized

in this architecture.

To avoid the aforementioned challenges and negative side-effects of a system with software-

managed TLBs, we created a trace-driven timing simulator without Simics. In this new sim-

ulator, TLB misses do not probe the TSB, but they instead initiate a page-walk as in an x86

system. The page table walk now follows the x86 format and is thus not constrained by how the

page tables are organized in Solaris. The next section details our timing model and discusses

its trade-offs.

3One could effectively disable the TSB by storing zeros to the destination registers of the ldda instruction(#9 in Table 5.1). Doing so would trigger a TSB miss by making the comparison (instruction #10) fail. Thus,a Solaris page-walk would ensue. In the sun4u architectures where UltraSparc-III processors belong to, the pagetables are organized as Hashed Page Tables (HPT); “HPTs use a hash of the virtual address to index into apage table. The resulting hash bucket points to the head of a list of data notes containing table entries that aresearched for a matching virtual address and context” [53].


Instr. SPARC v9 Assembly Explanation

1 ldxa [%g0 + %g0] 0x58, %g2

# ASI_DMMU

Read the contents of the D-TSB Tag Target Register. ThisMMU register holds information about the virtual address andcontext that missed in the D-TLB. SPARC uses the special loadextended word from alternate address space (ldxa) instructionto access special MMU registers. Global register g2 is the des-tination register.

2 ldxa [%g0 + %g0] 0x59, %g1

# ASI_DMMU_TSB_8KB_PTR_REG

Read the contents of the D-TLB 8KB Pointer MMU register.There is MMU hardware support that forms this TSB pointerto speedup the TSB lookup.

3 srlx %g2, 48, %g3 Global register g3 now holds the context in bits [12:0].

4 brz,pn %g3, 0x10000d38

Branch if register g3 contains 0 (i.e., this is a global page and nocontext comparison should take place). This is predicated not-taken (pn) and indeed it is not. Sparc v9 has branch delay slots:the instruction after a branch is commonly executed unlessannulled by the branch.

5 sll %g3, 4, %g5 Destination register g5 holds context in bits [16:4].

6 sra %g2, 11, %g6Reg. g6 contains virtual address bits [53:33] of missing addressin its least significant bits.

7 brgz,pt %g6, 0x10008840 Branch if the contents of reg. g6 are greater than zero. Thisbranch is incorrectly predicated as taken (pt).

8 xor %g5, %g1, %g1Xor context with TSB 8K pointer to come up with TSB ad-dress.

9 ldda [%g1 + %g0] 0x24, %g4

# ASI_NUCLEUS_QUAD_LDD

This is the only memory request that goes to the cache hier-archy (quad load). It is a 128 bit atomic load which loads theTTE (Translation Table Entry) tag entry to reg. g4 and theTTE data entry to register g5.

10 cmp %g2, %g4Compare TSB entry (retrieved from previous load) with TSBTag target (i.e., vaddr and context comparison).

11 bne,pn %xcc, 0x100088c0Branch on a TSB miss. Predicated not-taken as TSB hits arethe common case.

12 sethi %hi(0xffff8000), %g4

13 stxa %g5, [%g0 + %g0] 0x5c

# ASI_DTLB_DATA_IN_REG

Write the contents of g5 to the D-TLB Data In register. Reg-ister g5 was holding the TTE data after the ldda instructiongot executed (i.e., the translation (phys. address), protectionbits, etc.)

14 retryRetry the offending instruction (i.e., the memory request thathad missed in the DTLB).

Table 5.1: TSB hit code in D-MMU Trap Handler (Solaris)


5.5.2 Timing Model

This work uses a trace-driven timing simulator that follows a blocking in-order core model for

all memory requests in a 16-core CMP. Figure 5.6 depicts a high-level model of our simulator’s

front-end. We do model a detailed memory system in our simulator (back-end), including

full-timing for caches, TLBs, on-chip network and memory (DRAMSim [66]). A trace parsing

component parses the collected memory TLB traces, or the synthetically generated ones, and

feeds them to per-core memory FIFO queues. There are also separate queues to keep track

of page-walks and FMN probes/allocations, as will be explained later. Please note that this

is the high-level software implementation for simulation purposes and not the architectural

implementation. For example, one could think of the Memory FIFO of Figure 5.6 as the Load-

Store Queue equivalent.

Trace To

Timing

Trace To

Timing

Trace Parsing Component Memory Trace (16-‐cores from

Simics)

. . .

C1 C2 C3 C16

TLBs

L1-‐D

Trace To

Timing

Trace To

Timing

. . .

TLBs

L1-‐D

Trace To Timing Engine

Memory FIFO

Page Walk Req. FIFO

FMN Probe/Alloc. FIFOs

Remaining Cache hierarchy/Network/Memory

Figure 5.6: Timing Model - Front End

In our system, each FIFO entry is tagged with a state. Initially all requests are inserted in

the Memory FIFO queue in the Unprocessed state, except for the request in the head of the

queue that is TLB ready. From that point, the life-cycle of that regular memory request is the

following:

1. The request at the head of the FIFO (TLB ready) is sent to the TLB hierarchy. The

entry then transitions to a TLB stalled state depending on it being a TLB hit or a miss

and the associated TLB latencies. Once all TLB associated latencies have elapsed, the

request is either ready to be sent to the memory hierarchy if its translation is known, or

a page-walk is in order.


2. TLB Miss: On a TLB miss both a page-walk and an FMN-probe, if the FMN is enabled,

are initiated in parallel for that request. Entries are allocated into the per-core Page-Walk

and FMN-Probe FIFOs. The page-walk involves multiple memory requests to walk the

multi-level page tables; the Page-Walk FIFO state describes which part of the page-walk

each request corresponds to. The page-table format is presented in Section 5.5.3. De-

pending on which returns first (i.e., page-walk or FMN-probe), the next steps are:

(2.a) Page-Walk returned first (Figure 5.2): The FMN probe will be useless once it

comes back as it failed to speed up the page-walk process. The corresponding FMN FIFO

entry is thus marked as useless. Now that we have retrieved the translation for the mem-

ory request (physical address is known), the request can be sent to the memory hierarchy.

Depending on the FMN allocation policy, we might also send the FMN allocation request

to the cache hierarchy to fill the FMN. Modeling a direct-mapped FMN allows us to send

the FMN allocation request in advance of the FMN probe reply. For a set-associative

FMN, the system should wait for the FMN probe reply, do the LRU stack update there,

and then send the FMN allocation request that will update the entire FMN set if cached.

(2.b) FMN-probe returned first (Figure 5.3): In case of an FMN miss (either

due to a miss in the set, or a non-cached FMN-set), we fall back to the previous case

and wait for the page walk to complete. On an FMN hit, we do not issue a memory

request after an FMN-probe reply unless we are certain this will not be a misspeculation.

In other words, in the scenario illustrated in Figure 5.3c, we pay the latency penalty of

waiting for the page walk to complete, as in an actual system, but we do not issue mem-

ory requests to the memory hierarchy. The only side-effect of using this oracle knowledge

is that fewer requests are sent to the memory hierarchy, which is negligible given that

translation modifications are rare and latency is always properly modeled.

3. Once the page-walk returns, irrespective of whether this was earlier or later in time than

the FMN probe, we know that we have the most up-to-date translation for the given

virtual address. An FMN allocation message, i.e., a write request, is sent to the memory

hierarchy with the correct translation. This serves a dual purpose: it keeps the FMN

information up-to-date and it helps the cached FMN blocks survive in the cache.

4. Once the memory reply from the back-end reaches the memory FIFO, the subsequent

queued request is ready to be sent to the TLB. This process repeats.

5.5.3 Page Walk Modeling

To make the page-walk representative of a real, x86-like, system, the required multi-level page-

tables are modeled. To make them compatible with our SPARC ISA traces, some modifications

were needed. Each simulation starts with a known pool of free 8KB pages (physical frames), the


smallest page-size in SPARC. These are pages that are not accessed within our trace and can

thus be freely allocated to the page-tables or the FMN without interfering with the application’s

access pattern. We populated this per-application pool of free 8KB pages during a preprocessing

step. Figure 5.7 shows how we model the page-walk in our infrastructure.

51 42 41 32 31 22 21 13 12 0 63 52

PDE L4 Page Table (1024 entries)



Page Offset (8KB page)

PDE

PTE

PDE


CR3 (x86) equivalent

Process Info. (e.g., context)

Page Offset (4MB page)

8KB page translaMon

Virtual Address (TLB miss)

Page Walk

Figure 5.7: Page Walk Model

Usually in architectures with multi-level page tables, there is a specialized register that

points to the beginning of the first-level page-table for the currently running process. In x86,

the CR3 register contains this physical memory address. Since we do not have an x86 system,

we dynamically emulate this behavior by keeping a map of a (process ID, virtual address bits

[63:52]) tuple to a physical address that marks the beginning of the upper-level page-table

for the process in question. No access/hardware cost is associated with this access as this

information is readily available to the hardware page-walkers of real systems.

From this point onwards we inject page-walk requests into the memory hierarchy, similar

to the hardware page-walker. We maintain 4-level page-tables. Bits from the TLB missing

virtual address are used to index into each page table level as shown in Figure 5.7. Each table

entry is 8B and is either a page directory entry, PDE, (i.e., a pointer to the beginning of a

lower page-table) or a page translation entry, PTE, (i.e., the final translation, present in the

lowest page table level or in a higher level if this address belongs to a superpage). Each of

the smaller tables have either 512 entries (lowest level which holds the page translation entry)

or 1024 entries (upper three levels which usually hold page directory entries). Page walks for


8KB pages require 4 page-table accesses to retrieve the translation, whereas 4MB pages require

3 page-table accesses. In all cases, each small table fits within an 8KB page. Any time the

contents of a page table entry are invalid, we grab a new 8KB page from the free list. We

also specify if an entry needs to be a PTE rather than a PDE, as indicated by the page size

information retrieved from our Simics.

5.5.4 Discussion of Limitations

This section discusses the limitations our methodology.

• Blocking In-Order Core Model: This work uses a trace-driven timing simulator that

follows a blocking in-order core model for all memory requests in a 16-core CMP. In this

blocking in-order core model, a core C1 cannot issue another memory request to its L1

cache unless its previously pending memory request has completed. This constraint does

not apply to memory requests issued by the MMU: page-walks and FMN probes/alloca-

tions. This simple front-end reflects recent trends for simpler core microarchiectures [51].

POWER6 [48], ARM Cortex-53 [7], and Intel Xeon Phi [63] are a few examples of com-

mercial in-order machines.

In an Out Of Order (OoO) core, part of the TLB miss latency will likely be hidden

by the extracted Instruction Level Parallelism (ILP). However, there is a limit on how

much ILP can hide; some of the TLB miss handling time (in systems with hardware-

managed TLBs) will be non overlapping similar to our in-order core model. Many of the

simulated commercial workloads tend to have low Memory-Level Parallelism (MLP) and

hence low Instructions Per Cycle (IPC), thus making an in-order core model a reasonable

approximation. For example, Ferdman et al. report that the MLP for the scale-out

workloads (Cloud-Suite) ranges from 1.4 to 2.3 even in the presence of an aggressive 4-wide

issue OoO core with a 48-entry load/store buffer and a 128-entry instruction window [30].

The MLP numbers are even lower for “traditional server workloads” like TPC-C and

SPECweb09 [30]. They also report application IPC numbers “in the range of 0.6(Media

Streaming) to 1.1 (Web Frontend)” for the scale-out workloads, whereas workloads like

TPC-C exhibit even lower IPC.

• Memory-Only Traces: The traces contain only memory accesses, an approximation of

a memory-bound core which is expected for the commercial and Cloud-suite workloads

(Section 3.2) used in this work. Because the baseline system has software-managed TLBs

(as Section 5.5.1 discussed), the memory traces also include memory requests that cor-

respond to TSB probes and page walks from the original full-system simulation. These

requests are not distinguished from all the other memory requests, since neither the TSB

probes nor the page-walks would coincide with TLB misses in the simulated TLB base-

lines, and the page-table format in x86 is different. Further, because the TLB in the

original system was relatively large (512-entry SA TLB for 8KB pages and 16-way FA


TLB for superpages), when compared to split L1-TLBs, the MPMI was much lower.

Therefore, these requests are a very small portion of the memory traces, with page-walk

requests being even fewer as TSB hits are the common case. Our infrastructure treats

these requests as accesses to yet another software hash-based data-structure.

• Synchronization: No synchronization is modeled across the memory requests of multiple

cores, other than the “synchronization” happening implicitly due to coherence. It is

possible that a request X from core C2 completes before a request Y from core C1 that

was stored earlier in the trace. The timing order of requests in the trace represents

one possible memory ordering in functional mode. Different permutations/orderings are

possible. In all cases, we do not anticipate this lack of synchronization/ordering to affect

the observed trends. First and foremost because any such variation would equally affect a

system with an FMN as well as all the baselines. Second, the lack of synchronization could

underestimate the page-walk impact on performance, and thus the potential FMN benefit.

For example, if a memory access Y in one core should succeed time-wise a memory access

X from another core due to a synchronization barrier, and X happens to miss in the TLB,

the performance benefit of reducing that page-walk overhead would be more significant

in reality than in the simulated system where access Y can proceed before access X. In

this context, the reported FMN benefits could be underestimated.

• Page-Walk Modeling: The physical addresses the page tables map to were determined

according to the process described in Section 5.5.3. These addresses and their spatial

vicinity in the memory address space would be OS and system dependent. Even for the

same OS and architecture, the system’s load would determine the free list from which page

frames will be allocated upon request. In our methodology, the four page-tables (one per

level) that map a virtual address for an 8KB page to its corresponding physical frame

would all be allocated contiguously in physical memory if they had no prior accesses.

This approach might favour the page-walk latency. It is thus possible that FMN’s benefit

might be greater had another scheme been followed. Regardless, the employed scheme

meets the following two requirements: (i) it reflects a multi-level page table walk, and (ii)

it is consistently used on all configurations including the baseline.

5.6 Reasoning about FMN’s Performance Potential

This section presents an analytical model to estimate the potential performance improvement

(i.e., execution time reduction) for the proposed FMN technique. Given such a model, measure-

ments from actual applications can be then plugged in to estimate what performance benefits,

if any, should be expected. FMN aims at reducing the TLB miss handling time. It will thus

achieve the following performance improvement:


Performance Improvement =Tbaseline − TFMN

Tbaseline(5.3)

where Tbaseline is the execution time in cycles of a given workload on the baseline system, while

TFMN is the execution time of that same workload on a system with the proposed FMN design.

The execution time for the baseline system can be approximated as:

Tbaseline = TTLB Misses + TMemory + TOther (5.4)

which is the sum of the time spent servicing TLB misses (TTLB Misses), the time spent servicing

memory requests once their translation is known (TMemory), and the time spent on computation

(TOther).

FMN’s goal is to reduce the amount of time spent servicing TLB misses by trading the

latency of lengthy page-walks with hits in the proposed cached TLB. However, since FMN

introduces more memory requests compared to the baseline along with a new cached structure,

it could slightly increase the time spent servicing memory requests. FMN does not affect the

computation time (TOther) which will be the same both for the baseline and the FMN system.

The FMN execution time can thus be expressed as:

TFMN = (1 + ∆mem) ∗ TMemory + (1−∆TLB miss) ∗ TTLB Misses + TOther (5.5)

It is extremely unlikely for FMN to increase the TLB miss penalty (negative ∆TLB miss) as this

would imply a page walk latency increase not counterbalanced by any reduction due to FMN

hits. No such scenarios were encountered in simulation. Both delta values were measured to be

in a positive [0, 1] range.

For convenience, two ratios r and c are defined as:

r =TMemory

TTLB Misses(5.6) c =

TOther

TTLB Misses(5.7)

The execution times for the two systems can now be rewritten as:

Tbaseline = (1 + r + c) ∗ TTLB Misses (5.8)

and

TFMN = r ∗ (1 + ∆mem) ∗ TTLB Misses + (1−∆TLB miss) ∗ TTLB Misses + c ∗ TTLB Misses (5.9)

Therefore, the performance improvement (Equation 5.3) can be rewritten as:


Performance Improvement =Tbaseline − TFMN

Tbaseline=

∆TLB miss − r ∗∆mem

1 + r + c(5.10)

Upper Bound Projection: Figure 5.8 plots a possible upper bound on the projected perfor-

mance % improvement achieved with FMN. The computations assume (i) no increase in memory

latency (∆mem = 0), and (ii) a 75% decrease in TLB miss handling latency (∆TLB miss = 0.75).

The latter assumes four memory accesses per page walk, all with the same latency as an FMN

probe, that are substituted by a single FMN memory request. Thus, the following figure plots

the equation:

Performance Improvement(%) =∆TLB miss − r ∗∆mem

1 + r + c=

75

1 + r ∗ (1 + TOtherTMemory

)(5.11)

0

5

10

15

20

25

30

35

40

1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Execu

tion T

ime R

ed

uct

ion (

%)

r (TMemory / TTLB_Misses)

TOther = 0 * TMemoryTOther = 0.25 * TMemory

TOther = 0.5 * TMemoryTOther = 1 * TMemoryTOther = 2 * TMemoryTOther = 4 * TMemoryTOther = 8 * TMemory

Figure 5.8: Projected ideal % performance improvement based on Equation (5.11) with∆TLB miss = 0.75 and ∆mem = 0.

The x-axis lists different values for r, while each series corresponds to a different value for theTOther

TMemoryfraction. Memory-bound workloads, which this work targets, will have a ratio less

than one. As anticipated, the lower the value for this ratio, i.e., the more memory-bound a

workload is (top few series in this figure), the higher the potential performance improvement

for a given r value. Large values of r indicate that the proposed FMN scheme will have very

little, if any, performance benefit. Workloads with r in the [4, 16] range are projected to achieve


performance improvement in the [2%, 38%] range for ratios of one or lower. Even though the

projected performance benefits are negligible for higher values of r, the proposed FMN scheme

is still projected to not harm performance while not requiring any dedicated on-chip resources.

We believe this can be a compelling design choice for systems where chip real estate is at a

premium.

Figure 5.8’s projections assumed FMN’s data did not increase the memory latency for

demand requests. In this context, having a non-zero ∆Mem will vertically move this figure’s

series towards the x-axis, slightly reducing FMN’s projected performance improvement. Next,

Figure 5.9 plots the projected performance improvement for various values of ∆TLB miss, ∆mem

and r based on the Equation (5.10), assuming c = 0, i.e., TOther is significantly less than

TMemory. The ranges of values for the three other parameters of that equation reflect simulation

measurements. For the workload traces used in this work, ∆TLB miss was measured in the range

of 0.12 to 0.33, while ∆mem from 0.005 to 0.015. For ∆TLB miss the figure plots the entire

spectrum of possible valid values, starting from zero which stands for no reduction in TLB miss

cycles. It is highly unlikely for ∆TLB miss to be greater than 0.75, given that the four memory

requests of the page walk will be substituted with a single memory request in case of an FMN

hit. The figure also plots four ∆mem configurations. A zero ∆mem value means FMN has no

negative influence in the execution time of memory requests.

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

2

4

6

8

10

12

14∆mem = 0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8202468

101214

∆mem = 0.005

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8202468

101214

Perfo

rman

ce Im

prov

emen

t (%

)

∆mem = 0.01

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8∆TLB_misses (reduction in time spent servicing TLB misses)

2

0

2

4

6

8

10

12∆mem = 0.02

r=5r=10r=15r=20r=25r=30

Figure 5.9: Projected % performance improvement based on Equation (5.10) with c = 0.

The ratio r solely depends on the workload’s access pattern and the baseline configuration

(e.g., TLBs, cache hierarchy, etc.). The lower the value of r, the more page-walks’ latency


dominates execution time. For the traces used in this work, r was measured in the [6, 27]

range. It was six for canneal (i.e., ∼14% of baseline execution time was spent servicing TLB

misses), while r was 27 for TPC-C1 (i.e., ∼4% of the baseline execution time was spent servicing

TLB misses). Figure 5.9 plots r values from 5 to 30 in increments of five. The smallest r value

of five represents a workload that spends 20% of its execution time servicing TLB misses.

As Figure 5.9 shows, for workloads that have high values of r, ∆TLB miss needs to be quite

high, e.g., more than 0.5, for the workload to experience even a small, one or two percent,

performance improvement. Otherwise, FMN can result in a slowdown. This behavior can

be accentuated the higher the ∆mem is. Disabling the FMN for workloads that have been

profiled to have a high r value could be one possible solution. For TLB intensive workloads,

characterized by low r values, a ∆TLB miss above 0.3 is beneficial. The lower the r is, the higher

the performance benefit of FMN will be.

5.7 Synthetic Memory Access Patterns

As Section 5.6 explained, FMN’s potential relies on the memory access patterns of the simulated

workloads. We created a suite of synthetic memory traces to explore how the proposed FMN

design behaves under different loads. We believe such an exploration is valuable. Even though

we simulated commercial and cloud workloads, one could anticipate that future workloads will

stress TLBs and memory hierarchies even more or in different ways.

The following design knobs, detailed below, are of interest: memory footprint size, memory

access patterns, presence of data sharing, and number of processes.

(1) Memory Footprint: Memory footprint refers to the memory space, i.e., number of bytes,

a workload accesses. In memory hierarchy research, this metric often relates to the required

cache capacity. For the synthetic traces in this work, a workload’s memory footprint comprises

(a) the number of unique 8KB pages this workload accesses during its execution, and (b) the

number of 64B cache blocks accessed within each page. Both these configuration parameters

affect the ratio r presented in Equation (5.6).

The first parameter, number of unique pages, directly relates to the number of translation

entries required for the address translation structures (e.g., TLB and FMN) to avoid a page

walk for non cold accesses. Scenarios where all unique pages fit in the TLB, thrash the TLB

but fit in the FMN, as well as scenarios where both the TLBs and FMN are thrashed, are all

modeled, as they directly affect the number of TLB misses and the time spent servicing them.

The second characteristic of the footprint, the number of cache blocks accessed, is also

important. Accessing only one cache block per page puts the TLB miss handling latency in

the forefront, whereas accessing many cache blocks from each page, blocks that might miss in

the caches, can reduce the performance impact of a potential TLB miss. Accessing more cache

blocks from each page can also result in more contention for the existing L1-D cache capacity,


as FMN and page tables might need to contend more with the workload’s memory footprint.

As Section 5.4 mentioned, both FMN probes and page-walk requests percolate through the

memory hierarchy starting from the L1 cache.

(2) Access Patterns: This work explores various memory access pattern combinations, both

at the granularity of pages as well as at the granularity of cache blocks. These two categories

of access patterns are orthogonal, and together they reflect how different algorithms access

memory. The memory footprint configuration option, mentioned earlier, controls the size of a

pool that contains only unique pages. The memory access pattern dictates how this pool is

populated.

The page-level access pattern controls the page number relationship among consec-

utively accessed pages. For example, contiguous page numbers reflect a streaming (i.e., se-

quential) pattern, while a fixed stride between consecutively accessed page numbers reflects a

striding pattern. A random permutation of page numbers is also modeled. All pages in the

pool are accessed in a round-robin fashion.

The different memory patterns influence the intensity of accesses each TLB or FMN set sees.

For example, a stride pattern of two which accesses only even pages (e.g., page numbers: 2, 4,

6, etc.) would cause contention for a few TLB sets (all sets with an even index), leaving half

of the TLB (or FMN) sets underutilized. On the other hand, a streaming pattern (e.g., page

numbers: 1, 2, 3, etc.) would uniformly stress all sets.

The block access pattern, i.e., how cache blocks are accessed within a page, does not

influence TLB misses directly. However, it can affect the significance of TLB misses for the

baseline, as discussed earlier. We also explore how our system would behave if all cores followed

the same access pattern or different permutations of it.

(3) Data Sharing: This option controls the amount of data sharing present across cores.

Per core pools of unique pages cover multiprogrammed workload scenarios where no sharing is

present. Prepending a unique identifier (e.g., core ID) to each unique page number achieves this

purpose without interfering with the TLB (or FMN) indexing and skewing the measurements

on different cores. For multi-threaded workloads, all per core pools contain the same unique

pages.

Different degrees of data sharing can influence whether a shared FMN would be a beneficial

design choice. Shared footprints could also reduce the page walk latency for the baseline,

compared to private per-core footprints, as the various page table entries will most likely already

be cached.

(4) Processes: TLB entries contain an ASID that identifies the process a given translation

belongs too. In SPARC terminology, different contexts denote different processes, while the

global context specifies data shared across all processes. Systems that have a single context

running on all cores could take advantage of a shared FMN, whereas systems where every core


has its own private context would not. This is similar to the effect of different degrees of data

sharing discussed earlier.

Scenarios with multiple processes running per core are also modeled. The lack of ASID-

aware TLB indexing schemes can result in increased contention and more TLB misses for the

baseline system. This also opens an interesting avenue for FMN indexing scheme exploration.

Section 5.9 will next present a case study with synthetic traces that follow the sequential page

access pattern, using the baseline CMP configuration described in Section 5.8. An evaluation

with commercial workloads will be presented later in this chapter.

5.8 Baseline CMP Configuration

All simulated configurations involve a 16-core CMP with a 4x4 mesh interconnect. There is a

3-level cache hierarchy with private L1 and L2 caches, and a distributed shared L3 cache. All

caches have a 64B cache block size and an LRU replacement policy. There are four memory

controllers. Table 5.2 presents the main parameters of the baseline configuration (i.e., caches

and TLBs) that are of interest. For the L1-TLBs, a split Haswell-like configuration is modeled.

An L2-TLB is not included in the baseline unless explicitly noted (e.g., B + L2).

Caches

L1-D Caches private, 4-way SA 32KB, 2-cycle latency

L2 Caches private, 8-way SA 256KB, 3-cycle tag / 9-cycle data latency

L3 Cache shared, 16-way SA 16MB (1MB per tile)

4-cycle tag / 10-cycle data latency

TLBs

L1-TLBs 4-way SA 64-entry (8KB pages)

three 4-way SA 32-entry TLBs for 64KB, 512KB and 4MB pages respectively

L2-TLB 8-way SA 1024-entry, 8-cycle latency

Table 5.2: System Configuration Parameters


5.9 Sequential Page Access Patterns - A Case Study with Syn-

thetic Traces

This section presents an analysis of the synthetic trace results for a sequential (i.e., streaming)

page access pattern and demonstrates how different design knob values affect performance via

their interaction with the baseline TLB hierarchy. With this specific pattern, a core accesses

a pool of PS (Pool Size) contiguous 8KB pages in a round-robin manner. The pool can be

either private (per core) or shared (replicated across cores). 32 million memory accesses (reads)

are modeled in total (two million per core). The first half of the execution time warms-up the

memory hierarchy; results are presented for the second half (i.e., the last 16 million requests).

Because of the pool sizes modeled and the round-robin manner in which the pools are accessed,

the simulated number of memory accesses completely captures the behaviour of this synthetic

trace.

In every pool pass, a fixed number of cache-block reads is performed per each 8KB page; we

refer to these accesses as BCPP (Block Count Per Page). These accesses are cache block (64B)

aligned. A 64-byte offset is applied to these accesses to avoid L1-D cache contention. Without

this offset the first cache-block of all 8KB pages would map to the first of the 128 cache sets of

the modeled 4-way SA 32KB L1-D cache. Applying this offset is similar to compiler padding

optimizations that reduce cache conflict misses. For example, assume a unique pool of 80 8KB

pages per core with a streaming page access pattern which accesses two cache blocks per page.

That is, PS is 80 and BCPP is 2. Core 0 accesses the first two cache blocks of the first 8KB

page (e.g., virtual addresses 0x0, 0x40), and then accesses two blocks from the subsequent 8KB

page (i.e., virtual addresses 0x2080, 0x20c0). The blocks accessed in that second 8KB page are

the third and fourth blocks of that page and not the first two. Having this padding prevents

these accesses from mapping to the same L1-D cache set.

The padding can be formally computed as:

((pool index ∗BCPP ) + (block index ∗ 64)) Modulo 8192

where 64 is the cache line size and 8192 is the page size. The pool index is in the range [0, PS-1]

and is incremented every BCPP number of accesses, while the block index, with values in the

range of [0, BCPP - 1], is incremented on every access and it is reset to zero once equal to

BCPP. BCPP’s value should never exceed (page size / cache line size).

The remaining section is organized as follows. First, Sections 5.9.1 to 5.9.4 demonstrate

the impact the various design knobs have on the baseline configuration. Then, the subsequent

sections measure FMN’s effectiveness via metrics such as performance and TLB miss handling

latency.

5.9.1 Impact of Workload’s Footprint on Baseline Configuration

Because of the sequential page access pattern, three distinct groups of pool sizes (PS) exist

with respect to TLB hit-rate for the baseline L1-TLB configuration (Table 5.2). The groups


shown below apply to any set-associative TLB with an LRU replacement policy; wherever

relevant, specific numbers are provided for the baseline 64-entry 4-way SA L1-TLB for 8KB

pages.

• PS <= # TLB entries (i.e., 64): For a pool with at most 64 contiguous pages, only

cold TLB misses occur in the baseline SA TLB that uses an LRU replacement policy.

Thus, as Figure 5.10 shows, the TLB hit-rate during the last 16M requests is consistently

at 100%, irrespective of the per-page accesses (i.e., BCPP, the number of 64B cache blocks

accessed within each page).

• PS >= (1 + 1TLBassociativity )∗ # TLB entries (i.e., 80): Any pool size above, or equal

to, the 80-entry boundary will thrash the L1-TLB, with 0% TLB hit-rate when only one

cache block is accessed per page (BCPP = 1). As Figure 5.10 shows in its second series,

the 0% measured TLB hit-rate, for a block count of one, becomes 50% if two cache blocks

are accessed per page, and 99.2% if all 128 cache blocks in each page are accessed. All hits

for this series are due to multiple accesses (BCPP > 1) to the same page. The hit-rates

listed in the figure (data labels) apply to any PS >= 80. In all these cases, having a single

TLB-entry that keeps the most recently used translation would have the same behaviour

as the baseline TLB.

• 64 < PS < 80: For any pool size between the two aforementioned boundaries, the

hit-rate curve would lie in the area between the two plotted series. As PS grows towards

80, and fewer translations persist from the first 16 pages in the pool (L1 TLB’s LRU way),

the corresponding curve moves towards the PS>=80 series.

0

50

7587.5

93.75 96.88 98.44 99.22

0

20

40

60

80

100

120

1 2 4 8 16 32 64 128

L1-TLBHitRa

te(%

)

BCPP(#64Bblocksaccessedperpage)

PS <= 64 PS >= 80

Figure 5.10: Effect of pool size on TLB hit rate.

For workloads with pool sizes that exceed the current TLB capacity and thrash the TLB,

the more spatial locality exists within a given page (i.e., the more cache blocks are accessed from

that page), the less significant the total TLB miss handling time becomes for the workload’s


performance. Figure 5.11 depicts the percentage of execution time spent servicing TLB misses

for different pool sizes (graph series) and different degrees of per page spatial locality (x-axis).

If 16 cache blocks are accessed per page, TLB miss handling will account for less than 1.4% of

execution time. This percentage falls below 1% if more than 25% of a page’s cache blocks are

accessed.

020406080100

1 2 4 8 16 32 64 128

%Execu(o

nTime

servicingTLBMisses

BCPP

PS-64 PS-80 PS-128 PS-256 PS-512 PS-1024 PS-2048

Figure 5.11: TLB Miss Latency as percentage of execution time with varying PS and BCPPvalues. Figure 5.12 presents how the execution time changes.

As anticipated, the execution times for these synthetic traces vary as PS and BCPP change

(Figure 5.12). But one should not draw the simplistic conclusion that “TLB miss latency is a

larger fraction of execution time when the latter is shorter” as this would be misleading. It is

the memory footprint as determined by the pool sizes and the number of cache block accesses

within a page that affects both the execution time and the TLB miss latency. When more cache

blocks are accessed, either due to larger pool sizes (PS) or a combination of large PS and BCPP

values, the memory latency dominates: the requested cache blocks no longer fit in the L1-cache

but eventually spill in the L2 and L3 caches resulting in the drastic increases of the execution

time shown in Figure 5.12. The average memory request latency, depicted in Figure 5.13, also

reflects the same trends. For the PS-64 series, the average memory latency is 3-cycles for the

BCPP-1 to BCPP-8 configurations since all the memory data fit in the L1-D cache.

050100150200250

1 2 4 8 16 32 64 128

Execu&

onTim

e(M

illionCycles)

BCPP

PS-64 PS-80 PS-128 PS-256 PS-512 PS-1024 PS-2048

Figure 5.12: Execution time with varying PS and BCPP values.


1

10

100

1000

1 2 4 8 16 32 64 128

AverageMem

ory

Req

uestLaten

cy(C

ycles)

BCPP

PS-64 PS-80 PS-128 PS-256 PS-512 PS-1024 PS-2048

Figure 5.13: Average Memory Request Latency in cycles. Note the logarithmic y-axis scale.

For similar reasons, the average TLB miss handling latency (page-walk latency) also in-

creases alongside PS and BCPP (Figure 5.14). While all pools with more than 80 pages have

the same number of TLB misses, for a given BCPP, these misses start getting - on average -

more costly with larger pool sizes. The increased data footprint no longer allows the page-table

entries to survive in the upper-level caches (e.g., the L1 cache). These results illustrate how

interconnected the TLB miss latency is with the workload’s footprint and access pattern. Even

though anticipated, they remind us that we should not look at the TLB miss latency in a

vacuum.

0102030405060

1 2 4 8 16 32 64 128

AverageL1TLB

MissL

aten

cy(C

ycles)

BCPP

PS-64 PS-80 PS-128 PS-256 PS-512 PS-1024 PS-2048

Figure 5.14: Average L1-TLB Miss Latency in cycles. No TLB misses exist for the PS-64 seriesin these last 16M references, as explained earlier.

5.9.2 Effect of Per-Page Access Pattern on Baseline

The block access pattern, i.e., how cache blocks are accessed within each page (e.g., sequential,

stride, random, etc.), can also affect execution time, even though it has no bearing on the

number of TLB misses. As page tables contend with the workload’s data for cache capacity,

the latency of a page walk as well as the latency of each memory request can vary.


5.9.3 Effect of Data Sharing on Baseline

Figures 5.15 and 5.16 demonstrate the impact of data sharing on the average TLB miss latency

and memory request latency respectively. Results are presented as the percentage cycle reduc-

tion achieved with a shared data pattern compared to a private pattern. For the shared pool

pattern the same pool contents are replicated for each core. For the 2K pool size, a shared

pattern has a significant positive impact both on the TLB miss penalty and on the average

latency of each memory request, up to 38% and 62% decrease respectively. This impact is more

pronounced for large BCPP values where the data footprint and the execution time are much

higher. The two main reasons are: (i) the overall CMP data footprint is now 1/16th of the

private pattern footprint and can thus fit in upper level caches, and (ii) all cores will access

the same page-table entries which will thus occupy fewer cache resources, and which can also

survive in upper caches due to the smaller memory footprint discussed earlier. For small pool

sizes (e.g., up to PS-128), a shared pattern can, in some instances, increase execution time

(negative percentages); coherence overheads and remote private cache accesses are the most

likely causes.

-40

-20

0

20

40

60

1 2 4 8 16 32 64 128

%L1-TLBMissL

aten

cy

Redu

c4on

over

PrivatePa

9ern

BCPP

PS-64 PS-80 PS-128 PS-256 PS-512 PS-1024 PS-2048

Figure 5.15: Shared versus Private: Effect of data sharing on L1-TLB miss latency.

-40

-20

0

20

40

60

80

1 2 4 8 16 32 64 128

%M

emoryRe

questLaten

cy

Redu

c3on

over

PrivatePa

7ern

BCPP

PS-64 PS-80 PS-128 PS-256 PS-512 PS-1024 PS-2048

Figure 5.16: Shared versus Private: Effect of data sharing on average memory latency.


5.9.4 Effect of Process Mix on Baseline

The process mix running on the CMP can also impact the TLB miss handling latency. The

following three configurations were modeled: (i) global : a single process is running across all 16

CMP cores (this is the configuration for the previous sections), (ii) private: a separate process

is running on each core, and lastly (iii) ctxt 2 : two processes are running on each core with no

process overlap across cores.

For this experiment, when multiple processes are running per core (private, ctxt 2 configu-

rations), the number of unique virtual page numbers in the pool is the pool size divided by the

number of contexts. This ensures that for the same pool sizes the same number of TLB entries

is required irrespective of the process mix. A 64-page pool with the ctxt 2 configuration means

32 unique contiguous pages are first accessed under context a and then the same 32 unique

contiguous pages are accessed under context b.

TLB miss count remains the same for private and ctxt 2 configurations compared to global

except for ctxt 2 and PS-80 which incurs 40% less misses than global or private for the same

BCPP. Because all processes share the same virtual page numbers, some translations will persist

in the TLB across pool passes. For a 4-way SA 64-entry TLB this behaviour is observed when

the pool size is in the [80, 94] range, i.e., when the pages for each context require between 2.5

to less than 3 TLB ways.

Figure 5.17 reports the number of cycles spent servicing a TLB miss on average for different

process mixes (graph series) and different PS and BCPP values (x-axis labels) for a private

sharing pattern. Figure 5.18 does the same but for a shared sharing pattern.

0

10

20

30

40

50

60

80

128

256

512

1024

2048

80

128

256

512

1024

2048

80

128

256

512

1024

2048

80

128

256

512

1024

2048

80

128

256

512

1024

2048

80

128

256

512

1024

2048

80

128

256

512

1024

2048

80

128

256

512

1024

2048

1 2 4 8 16 32 64 128

L1-TLBM

issL

aten

cy(cycles)

BCPP(lower)andPS(upper)

DataSharing:Private

global private ctxt_2

Figure 5.17: Private sharing pattern: Effect of process mix on baseline’s TLB miss latency.


For the ctxt 2 process mix, the pages in the pool no longer share the same first-level page-

table causing an increased page-walk latency. This increase becomes more pronounced for

larger footprints when the extra cached page-table entries push the workload’s footprint to a

lower cache level (e.g., L3). For BCPP >= 32, the footprint no longer fits in the L3 cache

due to conflicts even for PS-80 and PS-128, causing a drastic increase in the average memory

latency from 70 to 230 cycles. The private process mix also sees a TLB-miss latency increase

compared to global, but consistently less than ctxt 2. Even though there is no change from the

perspective of a core, having a separate per-core context rather than a global one means that

there can be no sharing of the PDEs for the upper page-walk levels across cores.

0

10

20

30

40

50

60

80

128

256

512

1024

2048

80

128

256

512

1024

2048

80

128

256

512

1024

2048

80

128

256

512

1024

2048

80

128

256

512

1024

2048

80

128

256

512

1024

2048

80

128

256

512

1024

2048

80

128

256

512

1024

2048

1 2 4 8 16 32 64 128

L1-TLBM

issL

aten

cy(cycles)

BCPP(lower)andPS(upper)

DataSharing:Shared

global private ctxt_2

Figure 5.18: Shared sharing pattern: Effect of process mix on baseline’s TLB miss latency.

Similar conclusions hold for the shared access pattern as well (Figure 5.18) but with a greater

percentage TLB miss latency increase compared to the private sharing pattern. This behaviour

is expected as the shared access pattern with the global process mix configuration had all cores

sharing not only all their page-translations but also all their memory accesses.


5.9.5 Private FMNs

Having explored the impact the various design knobs have on the baseline, this section will

now evaluate the proposed FMN design. This section’s results are for the private data sharing

pattern and the global process mix. Figure 5.19 depicts the percentage decrease in execution

time two FMN configurations - explained shortly - achieve over the baseline (denoted as B).

The higher the number the better; negative numbers signal performance degradation.

The ideal-FMN series is an unrealistic 1K-entry FMN configuration that asssumes all FMN

probes always hit in the L1-D cache (2-cycle latency) without causing any interference with

the cache data. While neither of the two aforementioned conditions can hold in a real system,

the ideal FMN offers an absolute upper bound on FMN’s performance benefits, reflecting the

benefits of a standalone, not-cached FMN. This configuration is different than a similarly sized

L2-TLB because, contrary to FMN, L2-TLB probing precedes a page walk. The ideal-FMN

results show that good performance benefits can be achieved when PS <= FMNentries and

PS ∗ BCPP <= 1024, with higher speedup for smaller BCPP values. An explanation for

the significance of 1024 will follow later in this section. Although rare, it is possible for ideal-

FMN to cause negligible performance degradation compared to the baseline; the worst case

measured here is -1.16% for PS-256 and BCPP-32. The cause is the different time-ordering

and interaction of memory requests and page-walk requests in the caches. Since the ideal-FMN

has perfect hit-rate and a very fast constant access time, speculative memory requests can be

issued well before their corresponding page walk completes.

-80

-60

-40

-20

0

20

40

60

80

1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128

PS-80 PS-128 PS-256 PS-512 PS-1024 PS-2048%Execu(o

nTimeRe

duc(on

PS(lower)andBCPP(upper)

B+FMN(private,1K-entries) B+idealFMN

Figure 5.19: Performance Impact: FMN versus Baseline.

The FMN series in Figure 5.19 achieves significant performance improvement for PS-80 and

PS-128 when one or two cache blocks are accessed per page. The measured execution time

reduction is around 58% for BCPP-1 and 49% for BCPP-2 for the two aforementioned pool

sizes that fit in the simulated private 1K-entry FMN. For BCPP values greater than eight,

FMN has negligible performance impact similar to the ideal-FMN, since TLB miss penalties do


not account for a significant portion of the execution time. However, Figure 5.19 shows a few

outliers where FMN causes significant performance degradation. For example, the PS-512 and

BCPP-2 run sees a 22.26% execution time increase under FMN, while the PS-1024 and BCPP-1

run sees a 77.77% increase. This behaviour at first appears to be counter-intuitive because in

both cases the required number of translation entries can fit in each per core FMN. Examining

FMN’s effect on L1-TLB miss latency and memory access time, the two components described

earlier in Section 5.6, will help us understand this behaviour.

Figure 5.20 illustrates the average TLB miss latency in cycles. Only memory accesses that

miss in the baseline TLB experience this latency. For the baseline configuration, this is the

average latency of the page walk, while when FMN is enabled it is min(FMN latency, page

walk latency). As anticipated, FMN significantly reduces TLB miss latency for BCPP <= 8

when PS <= FMNentries. For example, for PS-1024, FMN reduces TLB miss latency by

60.13% for BCPP-1 and by 56.88% for BCPP-8. FMN has no impact for PS-2048 that thrashes

the 1K-entry Direct-Mapped (DM) FMN. Therefore, FMN achieves its original goal of reducing

the TLB-miss latency without using any additional hardware resources.

0

10

20

30

40

50

60

1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128

PS-80 PS-128 PS-256 PS-512 PS-1024 PS-2048

AverageL1-TLB

MissL

aten

cy(cycles)


Baseline(noFMN) FMN

Figure 5.20: Average TLB Miss Latency in cycles.

Figure 5.21 depicts the average latency for each memory access, after its translation has been

retrieved either via the page walk or an FMN probe. As the pool size and accessed block count

increases, the memory latency also naturally increases because the required cache blocks start

spilling to lower level caches. The cases where the memory latency for an FMN-enabled system

is visibly greater than that of the baseline are the cases which suffer a performance degradation

in Figure 5.19. For example, for PS-512 and BCPP-2 the memory latency increases by 47.5%

from 24.98 to 36.85 cycles, while for PS-1024 and BCPP-1 the latency increases by 146.49%

from 30.92 to 76.21 cycles.


0

50

100

150

200

250

1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128

PS-80 PS-128 PS-256 PS-512 PS-1024 PS-2048

AverageMem

oryLatency

(cycles)


Baseline(noFMN) FMN

Figure 5.21: Average Memory Latency in cycles; this is measured after translation is retrieved.

But what causes such a drastic memory latency increase? The cause of this dramatic mem-

ory latency increase is, to a big extent, a side-effect of this sequential synthetic memory access

pattern. For all PS and BCPP values, consecutive in time accesses, from a core’s perspective,

map to consecutive L1-D sets. However, these same accesses - assuming they exceed the L1

capacity of 512 cache blocks - will only occupy 25% of the L2 capacity, i.e., 1024 of the 4096 L2

cache blocks. This L2 mapping becomes relevant only when the capacity of the 512 L1 cache

blocks is exceeded, and lower-level caches are accessed. As a result, when data just barely fits

in the L2 cache in the baseline, adding an FMN will spill some of this data to the L3 cache,

causing a non-negligible performance degradation.

Because the FMN is a software structure that linearly maps to the physical address space,

it will map to contiguous cache sets. It will thus compete with, and displace, some regular

(demand) memory accesses even though a significant portion of the L2 cache is unoccupied. The

configurations “PS-1024 with BCPP-1” and “PS-512 with BCPP-2” are two such examples. In

these cases, FMN drastically increases the average memory latency because it causes a portion

of memory requests to access the L3 cache.

The aforementioned behavior might be one isolated pathological scenario but it is an inter-

esting exercise to identify the potential shortcomings of the proposed FMN. One can anticipate

such behaviour to manifest in boundary cases where the workload’s footprint nicely fits in a

subset of the cache hierarchy, and FMN causes data spills to lower cache levels. Note that

“PS-512 with BCPP-4”, not such a boundary case as the data does not fit in the L2 cache in

the first place, does not experience any performance degradation under FMN.

The following mechanisms that would warrant future exploration could be potential solu-

tions for these problematic configurations. Compressing the size of each FMN entry would

allow more translations to fit in a cache line, thus reducing the contention between FMN en-

tries and memory requests for the same FMN size. This approach could take advantage of

FMN’s speculative nature to achieve this area reduction. A more ambitious approach would


be to depart from the sequential structure paradigm and have FMN steal cache blocks that are

invalid. Identifying the FMN entries which cause significant performance degradation by evict-

ing useful data could be another design option that would trade FMN hit rate to preserve the

baseline’s memory latency. Sections 5.9.7 and 5.9.8 explore two simple optimization examples.

The first uses probe filtering, while the second targets FMN allocation/replacement. But first,

Section 5.9.6 contrasts a design with FMN versus one with a L2-TLB.

5.9.6 Private FMNs versus Private L2-TLBs

Whereas Figure 5.19 compared the performance of FMN (B + FMN series) with a baseline

system with private L1-TLBs (B), Figure 5.22 also examines the performance of a system

where an L2-TLB has been added to the baseline (B + L2). The L2-TLB has the same

number of entries as the FMN (1K entries), but, unlike the direct-mapped FMN, the L2-

TLB is 8-way set-associative to match commercial state-of-the-art L2-TLB configurations (e.g.,

Haswell’s L2-TLB [38]). This section assumes an 8-cycle penalty for L1-TLB misses that hit in

the L2-TLB; Intel reports seven cycles for a TLB with half the entries and half the associativity

(SandyBridge’s 512-entry 4-way SA L2-TLB) [38]. The B+L2 series with half that penalty

is simulated strictly to demonstrate how sensitive performance can be to the L2-TLB access

latency, and it is not a realistic configuration.

-80

-60

-40

-20

0

20

40

60

80

1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128

PS-80 PS-128 PS-256 PS-512 PS-1024 PS-2048%Execu(o

nTimeRe

duc(on


B+L2-TLB(4cycles) B+L2-TLB(8cycles) B+FMN(private,1K-entries)

Figure 5.22: Performance Impact: FMN compared to L2-TLB.

The FMN series significantly outperforms the “B + L2 (8 cycles)” configurations for all cases

where PS ∗BCPP < 1024. Latency is the sole reason why this happens. In all aforementioned

cases, the needed translations can fit in both the FMN and the L2-TLB structures and thus a

page-walk is technically never needed, assuming warmed-up structures. In these cases, FMN

provides a translation in 3-cycles (L1 cache latency), while the L2-TLB requires 8 cycles. The

unrealistic 4-cycle L2-TLB closes the performance gap, most of the time still in favour of FMN.

When some of FMN’s pathological scenarios that were discussed in earlier sections occur,


when the workload footprint increases, the L2-TLB continues to reap performance benefits while

FMN results in performance degradation. “PS-1024 with BCPP-1” is the most characteristic

example. Once the needed translations exceed 1024 and no longer fit in the L2-TLB, the two

design options are comparable. In some of these cases, e.g., for “PS-2048 with BCPP-1”, even

the L2-TLB, which does not interfere with the on-chip cache data like the FMN does, can cause

minor performance degradation, as all L2-TLB probes, which will definitively result in a miss,

precede the page-walk.

To summarize, Figure 5.22 illustrated a number of cases where the FMN, which needs no

dedicated hardware storage for translation entries, can perform comparably or better than a

dedicated hardware L2-TLB of the same size. Also, in cases where the L2-TLB yields no or

limited performance benefits, the FMN can offer comparable performance. Thus, if someone

is able to accommodate the potential performance loss for the boundary cases, determined by

workload’s access pattern in conjunction to the underlying cache hierarchy, the FMN can be a

compelling design choice. FMN can also be a compelling design alternative for systems where

chip real-estate is at a premium and thus they cannot accommodate large dedicated second-level

TLB structures. Sacrificing performance for area and power is a common design choice for such

architectures. Optimizing FMN to further limit the extent of such a potential performance loss

can make FMN designs even more appealing. Two simple optimizations are presented next.

5.9.7 Private FMNs: Filtering Optimization

Figure 5.23 illustrates the effect FMN-probe filtering has on the problematic scenarios discussed

earlier in this chapter. Since the average memory latency of a request, with and without FMN,

cannot be easily measured at runtime (offline profiling would be an option), FMN-filtering

targets FMN probes that are either too slow or useless. By avoiding FMN probes and allocation

requests that do not provide any TLB-miss latency reduction, we reduce any unnecessary

destructive interference FMN has with demand memory accesses in the cache hierarchy, thus

minimizing data spills to lower cache levels that result in a hefty memory latency increase.

The proposed FMN-filtering mechanism operates as follows. A 4-bit saturating counter

is used for every four FMN entries (here also FMN sets); these four FMN entries would fit

in the same 64B cache line. Tracking FMN’s usefulness at this granularity is natural as this

is the granularity FMN data is allocated in the caches. For a 1K-entry FMN, only 128B of

additional storage are needed, the equivalent of just two extra cache lines. This extra storage

is not virtualized but resides at the TLB-miss controller.

Initially, all saturating counters are set to 15. The counter values define two operating

regions [15, 8] and [7, 0] as follows:

• Values [15, 8]: FMN-probes are always issued on a TLB miss. If the FMN-probe returns

after the page-walk or returns before but without valid information, the corresponding

saturating counter is decremented by one. If the FMN-probe returns first and is useful,


the counter is incremented by two. Timely FMN-hits should be more beneficial than

FMN-misses.

• Values [7, 0]: When the saturating counter reaches 7 (half of its initial value), no FMN

probes are issued. This approach continues while the counter value is in the [7, 0] range.

The counter is decremented by one on every TLB miss that would have triggered an FMN

probe in the four FMN sets represented by this counter. Once zero is reached. the counter

is reset to 12.

When properly resetting the saturating counters, the benefits are very little if any. For the

“PS-1024 with BCPP-1” configuration, filtering decreases FMN’s execution time by 3%, slightly

limiting the original performance degradation; 1.93% fewer FMN probes are issued in this case.

The results indicate that a more aggressive filtering mechanism is needed. Also, one shortcoming

of this filtering approach is how it decides which probes to filter. For example, in the PS-128 with

BCPP-4 example, no probes are filtered since the vast majority of probes are both useful and

timely. FMN degrades performance because FMN probes, and their corresponding allocation

requests, push useful demand data down the cache hierarchy. The filtering mechanism proposed

here fails to capture and act upon this behaviour.

-80-60-40-20020406080

1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128 1 2 4 8 16

32

64

128

80 128 256 512 1024 2048

%Execu(o

nTimeRe

duc(on


FMN FMN-filtered

Figure 5.23: FMN Filtering and FMN vs. Baseline

5.9.8 Private FMNs: Replacement Optimization

The PS-2048 with BCPP-1 configuration does not benefit from the 1K-entry FMN as the

sequential access pattern thrashes the cached FMN. Filtering FMN-probes will prevent many

probes from being issued, since none of them will be FMN-hits; however, the FMN-structure

will be in principle useless. This section explores a replacement optimization that strives to

maintain part of the required translations in the FMN. This approach is motivated by adaptive

insertion and replacement policies that were proposed for caches [41, 61]. The idea here is to

not send allocation requests to replace FMN entries based on the value of a saturating counter.

This optimization does not affect FMN-probes but only FMN-allocation requests and the data

they might replace.


Similar to the filtering optimization, 4-bit saturating counters are used, one for every four

FMN-sets. The counters, located at the TLB miss controller, are initialized to 15 and they

operate as follows:

• Values [15, 8]: Allocation and replacement requests are always sent for counter values in

that range (inclusive). If the FMN-probe returns after the page-walk, or returns before

but without valid information, the corresponding saturating counter is decremented by

one.

• Values [7, 0]: No allocation requests are sent. The counters are decremented by one like

above for useless or slow FMN-probe replies. However, when the FMN-probe returns first

and with correct information, the saturating counter is incremented by 2, capped to 8.

The rationale here is that if hits take place while we have chosen not to issue allocation

requests, then this design choice is working and part of the working footprint survives in

the FMN. Therefore, it is preferable to keep the saturating counter for the FMN in that

range and perpetuate this decision. However, if the counter reaches 0, the counter is reset

back to its initial value of 15. Reaching zero means there were 8 FMN accesses to these

four FMN-sets without any hits. Thus, no benefits arise from not allocating or replacing

existing FMN data. It is possible that these clusters of misses are due to a working set

change or the access pattern is so irregular that no part of it survives in the FMN for the

current saturating counter value range.

Using this replacement optimization for the PS-2048 with BCPP-1 configuration, made the

original 0% FMN-hit rate go to 43%. It also yielded a small 1.39% performance improvement.

5.10 FMN’s Evaluation for Commercial Workloads

This section evaluates FMN using the commercial and cloud workloads described in Section 3.3

and canneal from PARSEC. All timing runs in this section were run until 1600 million memory

references across the entire CMP were committed (excluding page walks and FMN requests).

The first 400 million requests were used to warm-up the TLBs and cache structures; the mea-

surements presented here are for the remaining 1200 million. This is not a complete execution

of the traces used in the previous two chapters. The time overheads of timing simulation made

that impossible within a reasonable timeframe (a week per simulation).

The remainder of this section is organized as follows. Section 5.10.1 first quantifies the

impact of address translation on the baselines, including the performance impact of adding

an L2-TLB. Section 5.10.2 reports the L1-TLB miss latency reduction FMN achieves, and

Section 5.10.3 presents FMN’s impact on the average memory request latency. Finally, Sec-

tion 5.10.4 presents the performance measurements.


5.10.1 Impact of Address Translation on Baseline’s Performance

Figure 5.24 depicts the percentage of execution time spent servicing L1-TLB misses in a system

without FMN support with and without the presence of an L2-TLB. The higher this value

is, the more potential for improvement from address translation optimizations exists. Two

baseline systems are listed with respect to their L1-TLB configurations: (i) B denotes the

baseline L1-TLBs from Section 5.8, while (ii) HB denotes a Half-Baseline where the L1-TLBs

are half in size (same associativity as B). The HB configuration is included as a potential proxy

for future systems. Specifically, the intention is to study what would be the effect of growing

data footprints. Since scaling the data footprint of existing applications is not practical due to

excessive simulation times, we instead scale down the size of the TLB to half. While this is only

an approximation, we believe that in lieu of an actual future application with a large footprint,

this is a relevant and thus valuable measurement that is well defined and feasible today.

0

5

10

15

20

apache TPC-C2 TPC-C1 canneal cassandraclassific. cloud9 nutch streaming

Workloads

%Execu0onTimeServicingL1-TLBMisses

B B+L2 HB HB+L2

Figure 5.24: Percentage of execution time spent in L1-TLB miss handling.

Figure 5.25 depicts the execution time reduction achieved by adding the aforementioned

L2-TLB over the respective B and HB baselines. These results complement the results in Fig-

ure 5.24 translating them into performance. Negative values signify performance degradation.

This L2-TLB only hosts translations for 8KB pages. Table 5.3 reports the L2-TLB hit-rates.

The performance measurements show that adding a per core private L2-TLB can in some cases

yield negligible performance benefits (cloud9, cassandra) or even cause performance degradation

(canneal, classification).

These observations were surprising given that many systems employ such a structure and

it is usually considerable in size (e.g., 1K entries or more). Unfortunately, to the best of our

knowledge, there are no results, in the literature, of the performance benefit of an L2-TLB,

other than the MPMI reduction it can achieve. There are two factors - pertinent to the L2-


-2

0

2

4

6

apache TPC-C2 TPC-C1 canneal cassandra classific. cloud9 nutch streaming

Workloads

Execu/onTimeReduc/onbyAddingL2-TLBtoBandHBrespec/vely

B+L2 HB+L2

Figure 5.25: Percentage of execution time reduction due to L2-TLB.

Workload L2-TLB Hit-Rate (%)

B + L2 HB + L2

apache 67.9 79.2

TPC-C2 74.4 80.9

TPC-C1 87.2 80.6

canneal 31 54.1

cassandra 63.8 31.1


cloud9 59.8 80.6

nutch 88.2 87.3

streaming 93.7 95.7

Table 5.3: L2-TLB Hit-Rate (%)

TLB design - that set the ground for its poor performance:

• The L2-TLB access latency following an L1-TLB miss is considerable. Intel reports seven

cycles for a 512-entry 4-way SA L2-TLB (SandyBridge) [38]. We model an eight-cycle

L2-TLB access latency because our TLB has double the associativity and size (i.e., it is

1024-entry 8-way SA). This latency is comparable to a page walk where all four accesses

hit in a 2-cycle L1 cache.

• The L2-TLB is probed in series with the page tables. Thus, in case of an L2-TLB miss,

an extra 8-cycles latency is added to the page walk latency.

Section 5.11 discusses this issue and proposes an L2-TLB bypassing mechanism.


5.10.2 FMN’s Impact on L1-TLB Miss Latency

FMN’s goal was to reduce the time spend servicing L1-TLB misses. Figure 5.26 shows the

average L1-TLB miss latency reduction over HB. The first column corresponds to a dedicated

L2-TLB. The remaining columns report several FMN configurations labeled as (FMN indexing

scheme, FMN entries). FMN is configured as either a 1K-entry or 8K-entry direct-mapped per

core virtualized structure. The precise indexing scheme (precise) uses the least significant bits

of the virtual page number assuming oracle page size knowledge. The 8KB_VPN scheme assumes

accesses are to 8KB pages. If during allocation the page is found to be a superpage, then only

the 8KB page that triggered the miss is allocated in the FMN.

As anticipated, the two schemes perform the same for workloads that mostly rely on 8KB

pages. The L2-TLB hurts the average L1-TLB miss latency for classification, and to a lesser

extent for canneal, as explained earlier. For workloads like cassandra or classification, the

8KB_VPN scheme does not reduce the TLB-miss latency by as much as the precise indexing.

A 1K-entry FMN with 8KB_VPN indexing improves the average L1-TLB miss latency across

all workloads by 31.4%, while a dedicated 1K-entry 8-way SA L2-TLB reduces it by 24.6%.

Figure 5.27 shows how this L1-TLB miss latency reduction is reflected in the percentage of

execution time spent servicing L1-TLB misses under FMN.

-73.96-20

0

20

40

60

80

apache TPC-C2 TPC-C1 canneal cassandra classific. cloud9 nutch streaming AMEAN

Workloads

%Avg.L1-TLBMissLatencyReduc=onoverHB

HB+L2 HB+FMN(precise,1K) HB+FMN(8KB_VPN,1K)

HB+FMN(precise,8K) HB+FMN(8KB_VPN,8K)

Figure 5.26: FMN or L2-TLB: Percentage L1-TLB Miss latency reduction over HB.

The average L1-TLB miss latency is 28.2 cycles for HB, 21.3 cycles for HB+L2 and 19.3

cycles for HB + FMN(8KB VPN, 1K). This latency greatly depends on where in the cache

hierarchy (assuming an L2-TLB miss) the required page-table entries are cached. If we were to

look only at the page-walk latency, we would see that adding an L2-TLB (HB+L2 configuration)

increases it. There might be fewer TLB-Misses in the HB+L2 configuration, but they are

more costly because now the required page-table entries are cached in lower-levels of the cache

hierarchy as they are accessed further apart in time. For instance, streaming has 162.4 cycles

L2-TLB miss latency for HB+L2, whereas its average L1-TLB miss latency is 15 cycles due to


0

2

4

6

8

10

12

14

16

apache TPC-C2 TPC-C1 canneal cassandra classifica8on cloud9 nutch streaming

Workloads

%Execu0onTimeSpentHandlingL1-TLBMisses

HB HB+L2 HB+FMN(precise,1K)

HB+FMN(8KB_VPN,1K) HB+FMN(precise,8K) HB+FMN(8KB_VPN,8K)

Figure 5.27: FMN or L2-TLB: Percentage of execution time spent handling L1-TLB misses.

its high L2-TLB hit-rate.

FMN’s benefit greatly relies on its operating scenarios, described in Section 5.2.1. Not

only should an FMN probe return before a page walk, but it should also (a) find its FMN

set cached and (b) find a translation within that set. Figure 5.28 presents a breakdown of

FMN probes for the FMN(8KB VPN, 1K) configuration. The lower portion of the stacked

column represents the FMN probes that not only returned before the page-walk but also had

a correct translation. The second stacked part corresponds to FMN probes that returned first

but without a translation, due to FMN’s size and associativity constraints, as is the case with

any tagged structure. The FMN cache misses part of the bar, which is too small to be visible,

is for cases when the FMN set this probe mapped to was not cached anywhere in the 3-level

on-chip cache hierarchy. Finally, the upper portion of each column corresponds to FMN probes

that were useless because the page walk returned first. In this latter case, FMN data do not

remain as hot in the caches as the corresponding page table entries. Because FMN is configured

here as a private per core structure (an evaluation of a shared FMN structure is left for future

work), there is no potential for translation sharing across cores. The pages table entries, on the

other hand, can be transparently shared across cores. Entries for the upper page table levels

are more likely to remain hot in the caches; superpage accesses see more benefits.

5.10.3 FMN’s Effect on Average Memory Latency

As discussed earlier in this chapter, FMN targets a TLB-miss-handling latency reduction but

at the potential cost of higher memory latency. In this context, memory latency represents

the time needed to retrieve the data for a load instruction, or to perform a store, once the

virtual address translation is known. Since the FMN injects more memory references to the

cache hierarchy, the FMN data will now compete both with (a) the demand data needed by


0%20%40%60%80%

100%


Workloads

BreakdownofFMNProbes

FMNHits(Useful) FMNMissesincachedFMNset

FMNCacheMisses Useless(pagewalkreturnedfirst)

Figure 5.28: Characterization of FMN probes for a 1K-entry per core FMN with 8KB VPNindexing scheme.

the application, and (b) with the page-table entries. Figure 5.29 depicts the average memory

latency increase for the HB, HB+L2, and the HB+FMN configurations.

-2

-1

0

1

2


Workloads

%Avg.MemoryLatencyIncreaseoverHB

HB+L2 HB+FMN(precise,1K) HB+FMN(8KB_VPN,1K)

Figure 5.29: FMN or L2-TLB: Percentage memory latency increase over HB.

The 1K-FMN configuration incurs an increase in the average memory request latency in the

range of 0.3% to 1.72% for the simulated workloads. This increase is due to increased contention

for the same cache blocks, as Figure 5.4 showed. As our analytical model showed (Figure 5.9),

this increase can be amortized by a high TLB miss latency reduction. Alternatively, filtering

techniques could be used to throttle the number of issued FMN probes. Figure 5.29 also shows

that, contrary to FMN, the L2-TLB can slightly reduce the memory-latency by up to 1.62%,

most likely due to less contention of the demand requests with the page-table entries in the L1


and L2 private caches, a positive side-effect of the TLB-miss reduction.

5.10.4 FMN’s Effect on Performance

Figure 5.30 contrasts the performance benefit of a dedicated private L2-TLB with that of an

FMN. FMN is on average less effective than an L2-TLB achieving at most 1.9% performance

improvement over HB for the streaming workload. However, FMN achieves this without any

dedicated hardware for translation storage. In a couple cases, FMN performs better than the

dedicated L2-TLB. For instance, for canneal, FMN does not degrade performance. The Ideal-

FMN series in this graph is a utopian upper bound for FMN’s performance; it assumes an

FMN with a fixed 2-cycle latency (ideal hit-case in L1 cache) and no memory contention/al-

location requests. Its purpose is to show that if we could somehow reduce how certain less

than ideal scenarios occur (e.g., contention with data requests, etc.), then FMN could approach

this performance. There is significant upside motivating further work on optimizing the FMN.

FMN-filtering/replacement is one such optimization that is described next.

-4-202468

101214


Workloads

%Execu0onTimeReduc0onoverHB

HB+L2 HB+FMN(precise,1K) HB+FMN(8KB_VPN,1K) HB+Ideal-FMN

Figure 5.30: FMN or L2-TLB: Performance over HB.

FMN-Filtering Mechanism: Since we saw that FMN can in some cases cause negative

interference with other requests, we considered the FMN filtering and replacement optimizations

discussed in Sections 5.9.7 and 5.9.8 respectively. Unfortunately, neither results in significant

performance speedup compared to FMN. The only exception is the application of filtering for

TPC-C2: it yields a 1.18% performance improvement over the FMN(precise, 1K) configuration.

A more robust mechanism is thus needed.


5.11 L2-TLB Bypassing

Our results in Figure 5.25 indicate that adding a per core private L2-TLB can in some cases

yield negligible performance benefits (cloud9, cassandra) or even cause performance degradation

(canneal, classification). These observations were surprising given that many systems employ

such a structure and it is usually considerable in size (e.g., ∼1K entries). Unfortunately, to the

best of our knowledge, there are no results, in the literature, of the performance benefit of an

L2-TLB, other than the MPMI reduction it can achieve.

We identified two factors -pertinent to the L2-TLB design- that are responsible for its poor

performance: (i) the high L2-TLB access latency and (ii) the serialization of L2-TLB lookups

with page-walks. The workload’s access pattern, and thus its TLB behaviour, is the determining

factor. If page walks are rare, because the L2-TLB hit-rate is high, then the extra L2-TLB

lookup latency added to the rather rare or infrequent page walk would be negligible. Also, the

latency of the page walk is important too. Adding seven or eight cycles on top of a 50-cycle

page walk might be a small penalty by comparison. However, if most page walk requests hit

in the upper two cache-levels and the page walk is in the range of 15 cycles, the overhead is

considerable.

Two scenarios when L2-TLB is likely to hurt performance are: (i) When the workload’s

translation footprint is too large and no working set fits in the L2-TLB. For example, in canneal

almost half the L2-TLB accesses are misses, even after warm-up. In these cases, the extra over-

head of the L2-TLB lookup is not counterbalanced by the page-walk latency it saves. (ii) When

most of a workload’s L1-TLB misses are to superpages, and the set-associative L2-TLB only

supports the smallest page size. For example, in classification most TLB misses are to 4MB

pages, and the L2-TLB only supports translations for 8KB pages. Thus, the L2-TLB hit-rate

is less than 10% (see Figure 3.6.4).

Redesigning the L2-TLB to support multiple page-sizes (e.g., by employing our superpage-

friendly TLB design TLBpred or splitting superpage translations into their smaller 8KB pages)

could potentially mitigate the second scenario’s overhead to some extent. However, L2-TLBs

with low hit-rates due to large workload footprints and/or poor TLB capacity utilization from

translations of different page sizes would still be an issue. Furthermore, the proposed bypassing

mechanism is a low-cost alternative to a drastic L2-TLB redesign.

Probing the L2-TLB in parallel to the page walk would, in theory, resolve this issue. The

L2-TLB access will no longer be serialized with the page-walk and so the translation will

be retrieved with either the L2-TLB lookup latency, in case of a TLB hit, or with the page

walk latency, in case of a TLB miss. Unfortunately, this approach has several shortcomings. It

unnecessarily initiates page-walks even in case of L2-TLB hits. Most of the simulated workloads

have L2-TLB hit-rates well above 60%. Thus, more than half the time these page walks would

be wasteful. These page walks would waste energy because of extra cache-lookups, increase

memory traffic on-chip, and likely displace useful demand data from the upper level caches in

the process. This approach would also require significant modifications of the existing hardware


page walkers. For example, on an L2-TLB hit, the in-progress page walk should be canceled

to limit some of the negative side-effects. The most appropriate point to do so would likely be

once the current page-walk request, for one of the multiple page levels, returns. And it would

increase energy.

5.11.1 Proposed Solution: Bypassing the L2-TLB

We propose using an interval-based predictor to decide when to commence the page walk:

(a) immediately after an L1-TLB miss (L2-TLB bypassing) or (b) on an L2-TLB miss (the

system’s default option). A bypass condition triggers the transition from the default to the

L2-TLB bypassing option when:

(# L2-TLB hits <= M * # L2-TLB misses) and (# L2-TLB misses > threshold_misses)

The decision is made on a per core basis. The first part of the expression ensures we start a

page-walk on an L1-TLB miss only when the L2-TLB hit-rate is low. 50% was the empirical

hit-rate threshold value originally used for our workloads (M = 1). The second part of the ex-

pression ensures there is a non negligible number of L2-TLB misses within the selected interval;

otherwise, the hit-rate value would be meaningless. The threshold_misses value will depend

on the interval length. A disproportionally large threhsold_misses value for a small interval

would never enable L2-TLB bypassing. We found 1024 misses to be a good threshold for the

evaluated intervals.

Table 5.4 lists all the possible scenarios for consecutive in time intervals. But first, the

following terms are defined:

• Interval: An interval determines the granularity at which bypassing decisions are made.

This evaluation defines an interval as the number of memory replies, excluding page-walk

replies, received per core. Using a timestamp interval would be an alternative.

• Bypass Interval: An interval during which a page walk is initiated on an L1-TLB miss.

• Fallback Interval: An interval during which the system falls back to no L2-TLB bypass-

ing. A fallback interval occurs trigger intervals after a positive bypassing decision was

made. Having fallback intervals ensures bypassing decisions adapt to workloads’ changing

behaviour, and potentially harmful decisions are rectified.

As Table 5.4 indicates, once a decision to bypass the L2-TLB has been made, all subsequent

intervals conform to it until a fallback interval is reached. At that time, bypassing is disabled

for one interval and the decision is then re-evaluated. Translations are allocated in the L2-TLB

even when it is bypassed to ensure it remains warmed-up for the next fallback interval.


Intervaln Intervaln+1 Trigger/Explanation

No Bypassing No Bypassing The bypassing condition was not met.

No Bypassing BypassingThe bypass condition was met. Starting countdown forthe next fallback interval.

Bypassing No Bypassing Intervaln+1 is a fallback interval.

Bypassing Bypassing Default, unless a fallback interval was reached.

Table 5.4: L2-TLB Bypassing Scenarios

Fallback intervals allow the technique to adapt to the workloads’ changing patterns and

also safeguard against poor decisions. However, in cases where there are no such changes,

the fallback intervals can be unnecessary. We propose to further adapt the frequency of these

fallback intervals based on how stable (i.e., repetitious) the application’s behaviour proves to

be. That is, if two consecutive fallback intervals4 reach the same decision, we double the trigger

value, to make them less frequent. If two consecutive fallback intervals contradict each other,

we halve the trigger value, unless the minimum trigger value has been reached. We are using

an empirically determined trigger value of 10 for this evaluation.

The proposed technique strikes a balance between alleviating the latency overhead of an

L2-TLB lookup likely to result in a TLB miss and blindly initiating all page walks in parallel to

L2-TLB lookups. It does not save L2-TLB lookup energy, as translations are allocated in the

second-level TLB at all times to ensure a warmed-up structure. No changes to the hardware

page-walker are required; no page-walks need to be aborted because an L2-TLB lookup returned

earlier as in the scenario above. The page walks simply start earlier on bypassing intervals.

The proposed scheme treats the L2-TLB as a black box and decides whether it is likely

that a translation, any translation, will hit in it. But one can also envision different types of

L2-TLB bypassing techniques. For example, predicting whether a specific translation would

hit in the L2-TLB might be an interesting future direction albeit at a higher area cost. The

L2-TLB bypassing predictor design could also be reminiscent of predictors for large stacked

DRAM caches which tried to determine whether the tag lookup should proceed in parallel with

the data lookup or whether the DRAM cache should be bypassed, and main memory be directly

accessed, when a miss is predicted [50,62].

5.11.2 L2-TLB Bypassing: Evaluation

This section evaluates the proposed L2-TLB bypassing predictor. Figure 5.31 presents the

percentage of execution time reduction over HB when using L2-TLB bypassing with different

(interval size, misses threshold) configurations. L2-TLB bypassing benefits classification and

cloud9, reducing close to zero the performance degradation caused by the L2-TLB for classifi-

4Consecutive fallback intervals are not consecutive intervals.


cation. For the remaining workloads, it either has no impact, as expected, or slightly decreases

L2-TLB’s benefits. Canneal, which was the other workload targeted, sees no benefit due to the

bypassing condition selected; it needs a slightly higher hit-rate threshold.

-2

0

2

4

6

apache TPC-C2 TPC-C1 canneal cassandra classifica7on cloud9 nutch streaming

Workloads

%Execu0onTimeReduc0onoverHB.viaBypassing(interval,missesthreshold)

HB+L2 Bypassing(100K,0) Bypassing(100K,1024) Bypassing(500K,0) Bypassing(500K,1024)

Figure 5.31: Percentage of execution time reduction with L2-TLB bypassing.


This chapter presented our Forget-Me-Not TLB, a cacheable TLB design. FMN reduces TLB

miss latency without using any dedicated on-chip translation storage. Instead, it uses the

existing on-chip capacity to transparently and on-demand store translation entries. A private

per core 1K-entry FMN configuration reduces L1-TLB miss latency by up to 45% in a set of

commercial and cloud workloads. However, it also increases the average memory request latency

by up to 1.72%, yielding at most 1.9% performance improvement. This chapter also presented

dynamic selective L2-TLB bypassing, a technique that results in more robust performance when

using an L2-TLB. We were motivated by the observation that an L2-TLB, which contrary to

the FMN has a fixed access latency, does hurt performance in some cases. Our technique

dynamically determines when the page-walk should commence immediately after an L1-TLB

miss, thus bypassing the L2-TLB; it reduces the performance degradation in classification over

the baseline from 1.61% to almost zero (0.02%).

Chapter 6

Concluding Remarks

Address translation is and will continue to be an intrinsic facility of computer systems, at

least in the foreseeable future. However, as both the software (applications, programming

paradigms) and hardware architectures continue to evolve, we need to continuously revisit the

hardware and/or software facilities that support it. As is often the case in computer architecture

research, computer architects opt for making the common case fast. But the current diversity

of workloads makes identifying this common case more nuanced than in the past, challenging

rigid hardware designs that are biased by designing towards this common case. This thesis

advocates for TLB designs and policies that dynamically adapt to the workloads’ behaviour for

a judicious use of the available on-chip resources.

To understand which are the workload behaviour and system architecture aspects that

influence TLB usage, in a chip multiprocessor system, this thesis analyzed the TLB-related

behaviour for a set of commercial and cloud workloads (Chapter 3). These workloads stress

the memory subsystem, and thus the existing address translation infrastructure. We classify

our measurements according to the following taxonomy: (i) characteristics inherent to the

workloads, that is, characteristics or metrics not influenced by the architecture of translation

caching structures like the TLBs, and (ii) other metrics that are influenced by these structures’

architecture. The first helps us relate the application requirements to system design choices and

identify opportunities for optimization. The latter helps us identify shortcomings of existing

state-of-the-art TLB hierarchies. The analysis covered a broad spectrum of questions from

the sizing requirements of translation caching structures, and their per core variations, the

characterization of the reach and lifetime of different contexts, the frequency of translation

modifications, to more nuanced observations about the compressibility of translation entries

and the predictability of the cache block address within a page triggering a TLB miss. A key

result of our analysis was quantifying the drastically different page size usage distributions across

workloads, with a bias for one superpage size when superpages are used. Our TLB capacity

sensitivity study shows this characteristic is at odds with current split L1-TLB designs that

make rigid sizing decisions about the translation capacity allocated to each page size.

The Prediction-Based Superpage-Friendly TLB designs proposed in Chapter 4 target this

139

Chapter 6. Concluding Remarks 140

discrepancy. Their key ingredient is a highly accurate superpage predictor that predicts, ahead

of time, if the next access is to a small page or to a superpage. A small 128-entry predictor

table with a meager 32B of storage, has an average misprediction rate of 0.4% across all our

workloads. This predictor enables TLBpred, a set-associative TLB where translations of multiple

page sizes can co-exist as needed. A 256-entry 4-way SA TLBpred has comparable energy to a

much smaller 48-entry FA TLB which has significantly higher MPMI and which cannot scale

as well. Chapter 4 also presented an evaluation of the previously proposed Skewed-TLB [69]

and augmented it with a predictor that extends its per page size effective associativity.

Finally, Chapter 5 presented our Forget-Me-Not (FMN) design, a cacheable TLB that uses

the on-chip caches to host translations as needed. A per core private 1K-entry direct-mapped

FMN with 8KB VPN indexing improves the arithmetic mean of the L1-TLB miss latency across

all workloads by 31.4%, over a baseline with only L1-TLBs, while a dedicated private 1K-

entry 8-way SA L2-TLB improves it by 24.6%. Nevertheless, the overall performance impact

was relatively small. Even in the case of a dedicated L2-TLB, our evaluation shows small

performance benefits, at most 5.6%, or in some cases performance degradation up to -1.6%.

We also proposed an L2-TLB bypassing mechanism as a potential first-step solution to limit

the latter. One of the key takeaways of this work is the observation that TLBs, memory

accesses, and cached translation entries, be it in page-tables or the FMN, are all parts of a

highly interrelated ecosystem. Optimizing one aspect of this ecosystem (e.g., by reducing the

frequency of TLB misses, or reducing the latency of a TLB miss) has ramifications, often

unwelcomed, for another aspect of the ecosystem. For example, reducing the percentage of

execution time spent servicing TLB misses, does not necessarily imply a performance speedup.

Thus, it is imperative that we do not think of these address translation components in a vacuum,

even though it is significantly easier to do so.

6.1 Future Research Directions

There are a multitude of ways that not only this current research can be extended, but also

address translation support in general. For example, one of FMN’s design challenges is to ensure

the TLB-miss latency reduction does not come at a significant expense of the memory latency

of regular application data. This work relies on the default allocation and replacement policies

of the caches for this purpose without distinguishing between FMN data and other cached data.

Ideally, the FMN would occupy cache space that is not utilized, in the short-term, by other

data. A dead block inspired predictor could predict which cache blocks, at each cache-level, are

least likely to be used in the near future. However, the current FMN implementation might not

be able to take advantage of these cache blocks because of how the FMN maps to the various

cache sets. Reserving a larger physical memory region for FMN, similar to the ballooning

memory allocators used in virtualization, along with a more flexible indexing scheme could

potentially take advantage of this additional space. Dynamic FMN resizing, compressing more

Chapter 6. Concluding Remarks 141

FMN entries within a cache line or use of prefetching could further extend the FMN coverage.

This work focused solely on data TLBs in terms of analysis and evaluation. However,

instruction TLBs can also suffer from the growing instruction footprints especially for OLTP-

like workloads like TPC-C. They could thus be an interesting research avenue, especially since

front-end processor stalls due to I-TLB misses would likely make any such optimizations quite

impactful. Address translation optimizations in virtualized systems may also be much more

impactful, given the higher overhead of two-dimensional page-walks.

Having the OS or the programmer provide hints to the underlying architecture about the

criticality of different memory regions (e.g., via ISA extensions) or the Quality-of-Service re-

quirements of a process could help the dynamic hardware policies make more informed deci-

sions. For example, when two processes stress a TLB but only one of them is critical to the

user, the TLB controller could either filter out translations for the low-importance process or

use a context-aware TLB indexing scheme to limit its negative interference with the translation

caching structure. Instead of relying on the hardware to dynamically relearn what the pro-

grammer already knows about a workload’s important data structures, this information could

be explicitly communicated to the hardware.

Bibliography

[1] D. H. Albonesi, “Selective cache ways: On-demand cache resource allocation,”

in Proceedings of the 32Nd Annual ACM/IEEE International Symposium on

Microarchitecture, ser. MICRO 32. Washington, DC, USA: IEEE Computer Society,

1999, pp. 248–259. [Online]. Available: http://dl.acm.org/citation.cfm?id=320080.320119

[2] AMD, “AMD I/O Virtualization Technology (IOMMU) Specification.” [Online].

Available: http://developer.amd.com/wordpress/media/2012/10/34434-IOMMU-Rev 1.

26 2-11-09.pdf

[3] AMD, “AMD-VTM nested paging,” 2008, [White Paper; accessed May-2017].

[Online]. Available: http://developer.amd.com/wordpress/media/2012/10/NPT-WP-1%

201-final-TM.pdf

[4] AMD, “Software Optimization Guide for AMD Family 10h and 12h Processors,” 2011,

[Online; accessed August-2014]. [Online]. Available: http://support.amd.com/TechDocs/

40546.pdf

[5] AMD, “Software Optimization Guide for AMD Family 15h Processors,” 2014, [Online;

accessed February-2017]. [Online]. Available: https://support.amd.com/TechDocs/

47414 15h sw opt guide.pdf

[6] N. Amit, M. Ben-Yehuda, and B.-A. Yassour, “IOMMU: Strategies for mitigating the

IOTLB bottleneck,” in WIOSCA 2010: Sixth Annual Workshop on the Interaction between

Operating Systems and Computer Architecture, 2010.

[7] ARM, “ARM Cortex-A53 MPCore Processor, Technical Reference Manual,” [PDF accessed

June-2017]. [Online]. Available: https://static.docs.arm.com/ddi0500/f/DDI0500.pdf

[8] ARM, “ARM Cortex-A72 MPCore Processor, Technical Reference Manual,” [PDF

accessed February-2017]. [Online]. Available: http://infocenter.arm.com/help/topic/com.

arm.doc.100095 0003 06 en/cortex a72 mpcore trm 100095 0003 06 en.pdf

[9] R. H. Arpaci-Dusseau and A. C. Arpaci-Dusseau, Operating Systems: Three Easy Pieces,

0th ed. Arpaci-Dusseau Books, May 2015.

142

http://dl.acm.org/citation.cfm?id=320080.320119

http://developer.amd.com/wordpress/media/2012/10/34434-IOMMU-Rev_1.26_2-11-09.pdf

http://developer.amd.com/wordpress/media/2012/10/34434-IOMMU-Rev_1.26_2-11-09.pdf

http://developer.amd.com/wordpress/media/2012/10/NPT-WP-1%201-final-TM.pdf

http://developer.amd.com/wordpress/media/2012/10/NPT-WP-1%201-final-TM.pdf

http://support.amd.com/TechDocs/40546.pdf

http://support.amd.com/TechDocs/40546.pdf

https://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf

https://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf

https://static.docs.arm.com/ddi0500/f/DDI0500.pdf

http://infocenter.arm.com/help/topic/com.arm.doc.100095_0003_06_en/cortex_a72_mpcore_trm_100095_0003_06_en.pdf

http://infocenter.arm.com/help/topic/com.arm.doc.100095_0003_06_en/cortex_a72_mpcore_trm_100095_0003_06_en.pdf

Bibliography 143

[10] T. W. Barr, A. L. Cox, and S. Rixner, “Translation caching: Skip, don’t walk (the

page table),” in Proceedings of the 37th Annual International Symposium on Computer

Architecture, ser. ISCA ’10. New York, NY, USA: ACM, 2010, pp. 48–59. [Online].

Available: http://doi.acm.org/10.1145/1815961.1815970

[11] T. W. Barr, A. L. Cox, and S. Rixner, “SpecTLB: A mechanism for speculative address

translation,” in Proceedings of the 38th Annual International Symposium on Computer



[12] A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, “Efficient virtual memory

for big memory servers,” in Proceedings of the 40th Annual International Symposium on

Computer Architecture, ser. ISCA ’13. New York, NY, USA: ACM, 2013, pp. 237–248.

[Online]. Available: http://doi.acm.org/10.1145/2485922.2485943

[13] A. Basu, M. D. Hill, and M. M. Swift, “Reducing memory reference energy with

opportunistic virtual caching,” in Proceedings of the 39th Annual International Symposium

on Computer Architecture, ser. ISCA ’12. Washington, DC, USA: IEEE Computer

Society, 2012, pp. 297–308. [Online]. Available: http://dl.acm.org/citation.cfm?id=

2337159.2337194

[14] M. Ben-Yehuda, J. Xenidis, M. Ostrowski, K. Rister, A. Bruemmer, and L. van Doorn,

“The price of safety: Evaluating IOMMU performance,” in OLS ’07: The 2007 Ottawa

Linux Symposium, July 2007, pp. 9–20.

[15] R. Bhargava, B. Serebrin, F. Spadini, and S. Manne, “Accelerating two-dimensional

page walks for virtualized systems,” in Proceedings of the 13th International Conference

on Architectural Support for Programming Languages and Operating Systems, ser.

ASPLOS XIII. New York, NY, USA: ACM, 2008, pp. 26–35. [Online]. Available:

http://doi.acm.org/10.1145/1346281.1346286

[16] A. Bhattacharjee, “Large-reach memory management unit caches,” in Proceedings

of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, ser.

MICRO-46. New York, NY, USA: ACM, 2013, pp. 383–394. [Online]. Available:

http://doi.acm.org/10.1145/2540708.2540741

[17] A. Bhattacharjee, “Preserving the Virtual Memory Abstraction,” May 2017, [ACM

Sigarch Blog; accessed September-2017]. [Online]. Available: https://www.sigarch.org/

preserving-the-virtual-memory-abstraction/

[18] A. Bhattacharjee, “Translation-triggered prefetching,” in Proceedings of the Twenty-

Second International Conference on Architectural Support for Programming Languages

and Operating Systems, ser. ASPLOS ’17. New York, NY, USA: ACM, 2017, pp. 63–76.


http://doi.acm.org/10.1145/1815961.1815970

http://doi.acm.org/10.1145/2000064.2000101

http://doi.acm.org/10.1145/2485922.2485943



http://doi.acm.org/10.1145/1346281.1346286

http://doi.acm.org/10.1145/2540708.2540741

https://www.sigarch.org/preserving-the-virtual-memory-abstraction/

https://www.sigarch.org/preserving-the-virtual-memory-abstraction/

http://doi.acm.org/10.1145/3037697.3037705

Bibliography 144

[19] A. Bhattacharjee, D. Lustig, and M. Martonosi, “Shared last-level TLBs for chip

multiprocessors,” in Proceedings of the 2011 IEEE 17th International Symposium

on High Performance Computer Architecture, ser. HPCA ’11. Washington, DC,

USA: IEEE Computer Society, 2011, pp. 62–63. [Online]. Available: https:

//doi.org/10.1109/HPCA.2011.5749717

[20] A. Bhattacharjee and M. Martonosi, “Characterizing the TLB behavior of emerging

parallel workloads on chip multiprocessors,” in Proceedings of the 2009 18th International

Conference on Parallel Architectures and Compilation Techniques, ser. PACT ’09.

Washington, DC, USA: IEEE Computer Society, 2009, pp. 29–40. [Online]. Available:

http://dx.doi.org/10.1109/PACT.2009.26

[21] A. Bhattacharjee and M. Martonosi, “Inter-core cooperative TLB for chip multiprocessors,”

in Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for

Programming Languages and Operating Systems, ser. ASPLOS XV. New York, NY,

USA: ACM, 2010, pp. 359–370. [Online]. Available: http://doi.acm.org/10.1145/1736020.

1736060

[22] C. Bienia, “Benchmarking modern multiprocessors,” Ph.D. dissertation, Princeton Uni-

versity, January 2011.

[23] D. L. Black, R. F. Rashid, D. B. Golub, and C. R. Hill, “Translation lookaside

buffer consistency: A software approach,” in Proceedings of the Third International

Conference on Architectural Support for Programming Languages and Operating Systems,

ser. ASPLOS III. New York, NY, USA: ACM, 1989, pp. 113–122. [Online]. Available:

http://doi.acm.org/10.1145/70082.68193

[24] J. Bradford, J. Dale, K. Fernsler, T. Heil, and J. Rose, “Multiple page size address

translation incorporating page size prediction,” Jun. 15 2010, uS Patent 7,739,477.

[Online]. Available: https://www.google.com/patents/US7739477

[25] I. Burcea and A. Moshovos, “Phantom-BTB: a virtualized branch target buffer design,” in

Proceedings of the 14th International Conference on Architectural Support for Programming

Languages and Operating Systems, ser. ASPLOS ’09. New York, NY, USA: ACM, 2009,

pp. 313–324. [Online]. Available: http://doi.acm.org/10.1145/1508244.1508281

[26] I. Burcea, S. Somogyi, A. Moshovos, and B. Falsafi, “Predictor virtualization,” in

Proceedings of the 13th International Conference on Architectural Support for Programming

Languages and Operating Systems, ser. ASPLOS XIII. New York, NY, USA: ACM, 2008,


[27] M. Clark, “A new X86 core architecture for the next gener-

ation of computing,” August 2016, [Presentation in Hot Chips

https://doi.org/10.1109/HPCA.2011.5749717



http://doi.acm.org/10.1145/1736020.1736060

http://doi.acm.org/10.1145/1736020.1736060

http://doi.acm.org/10.1145/70082.68193

https://www.google.com/patents/US7739477

http://doi.acm.org/10.1145/1508244.1508281

http://doi.acm.org/10.1145/1346281.1346301

Bibliography 145

Symposium, accessed September-2017]. [Online]. Available: https:

//www.hotchips.org/wp-content/uploads/hc archives/hc28/HC28.23-Tuesday-Epub/

HC28.23.90-High-Perform-Epub/HC28.23.930-X86-core-MikeClark-AMD-final v2-28.pdf

[28] G. Cox and A. Bhattacharjee, “Efficient address translation for architectures with

multiple page sizes,” in Proceedings of the Twenty-Second International Conference


ASPLOS ’17. New York, NY, USA: ACM, 2017, pp. 435–448. [Online]. Available:

http://doi.acm.org/10.1145/3037697.3037704

[29] P. J. Denning, “Virtual memory,” ACM Comput. Surv., vol. 2, no. 3, pp. 153–189, Sep.

1970. [Online]. Available: http://doi.acm.org/10.1145/356571.356573

[30] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak,

A. D. Popescu, A. Ailamaki, and B. Falsafi, “Clearing the clouds: a study of emerging

scale-out workloads on modern hardware,” in Proceedings of the seventeenth International

Conference on Architectural Support for Programming Languages and Operating Systems,

ser. ASPLOS ’12. New York, NY, USA: ACM, 2012, pp. 37–48. [Online]. Available:

http://doi.acm.org/10.1145/2150976.2150982

[31] J. Fotheringham, “Dynamic storage allocation in the Atlas computer, including an

automatic use of a backing store,” Commun. ACM, vol. 4, no. 10, pp. 435–436, Oct. 1961.


[32] J. Gandhi, A. Basu, M. D. Hill, and M. M. Swift, “BadgerTrap: A tool to instrument

x86-64 TLB misses,” SIGARCH Comput. Archit. News, vol. 42, no. 2, pp. 20–23, Sep.

2014. [Online]. Available: http://doi.acm.org/10.1145/2669594.2669599

[33] B. Hall, P. Bergner, A. Housfater, M. Kandasamy, T. Magno, A. Mericas, S. Munroe,

M. Oliveira, B. Schmidt, W. Schmidt et al., Performance Optimization and Tuning

Techniques for IBM Power Systems Processors Including IBM POWER8. IBM Redbooks,

2017. [Online]. Available: https://books.google.ca/books?id=7ph0CgAAQBAJ

[34] P. Hammarlund, “4th Generation Intel Core Processor, codenamed Haswell,”

August 2013, [Presentation in Hot Chips Symposium, accessed August-2014].

[Online]. Available: http://www.hotchips.org/wp-content/uploads/hc archives/hc25/

HC25.80-Processors2-epub/HC25.27.820-Haswell-Hammarlund-Intel.pdf

[35] N. Hardavellas, S. Somogyi, T. F. Wenisch, R. E. Wunderlich, S. Chen, J. Kim,

B. Falsafi, J. C. Hoe, and A. G. Nowatzyk, “SimFlex: a fast, accurate, flexible

full-system simulation framework for performance evaluation of server architecture,”

SIGMETRICS Perform. Eval. Rev., vol. 31, no. 4, pp. 31–34, Mar. 2004. [Online].


https://www.hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.23-Tuesday-Epub/HC28.23.90-High-Perform-Epub/HC28.23.930-X86-core-MikeClark-AMD-final_v2-28.pdf



http://doi.acm.org/10.1145/3037697.3037704

http://doi.acm.org/10.1145/356571.356573

http://doi.acm.org/10.1145/2150976.2150982

http://doi.acm.org/10.1145/366786.366800

http://doi.acm.org/10.1145/2669594.2669599

https://books.google.ca/books?id=7ph0CgAAQBAJ

http://www.hotchips.org/wp-content/uploads/hc_archives/hc25/HC25.80-Processors2-epub/HC25.27.820-Haswell-Hammarlund-Intel.pdf

http://www.hotchips.org/wp-content/uploads/hc_archives/hc25/HC25.80-Processors2-epub/HC25.27.820-Haswell-Hammarlund-Intel.pdf

http://doi.acm.org/10.1145/1054907.1054914

Bibliography 146

[36] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, “Reactive NUCA:

Near-optimal block placement and replication in distributed caches,” in Proceedings

of the 36th Annual International Symposium on Computer Architecture, ser.

ISCA ’09. New York, NY, USA: ACM, 2009, pp. 184–195. [Online]. Available:

http://doi.acm.org/10.1145/1555754.1555779

[37] Intel, “Intel Virtualization Technology for Directed I/O, Architecture Specification.”

[Online]. Available: https://www.intel.com/content/dam/www/public/us/en/documents/

product-specifications/vt-directed-io-spec.pdf

[38] Intel, “Intel 64 and IA-32 Architectures Optimization Reference Manual,” June 2016,

[PDF accessed February-2017]. [Online]. Available: http://www.intel.com/content/dam/

www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

[39] Intel, “Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A:

System Programming Guide, Part 1,” April 2016, [PDF accessed June-2016]. [On-

line]. Available: http://www.intel.com/content/www/us/en/architecture-and-technology/

64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html

[40] Intel, “5-Level Paging and 5-Level EPT,” 2017, [White Paper; revision 1.1;

accessed September-2017]. [Online]. Available: https://software.intel.com/sites/default/

files/managed/2b/80/5-level paging white paper.pdf

[41] A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, “High performance

cache replacement using re-reference interval prediction (RRIP),” in Proceedings



http://doi.acm.org/10.1145/1815961.1815971

[42] G. B. Kandiraju and A. Sivasubramaniam, “Going the distance for TLB prefetching: An

application-driven study,” in Proceedings of the 29th Annual International Symposium on

Computer Architecture, ser. ISCA ’02. Washington, DC, USA: IEEE Computer Society,

2002, pp. 195–206. [Online]. Available: http://dl.acm.org/citation.cfm?id=545215.545237

[43] V. Karakostas, J. Gandhi, A. Cristal, M. D. Hill, K. S. McKinley, M. Nemirovsky,

M. M. Swift, and O. S. Unsal, “Energy-efficient address translation,” in 2016 IEEE

International Symposium on High Performance Computer Architecture (HPCA), March

2016, pp. 631–643. [Online]. Available: https://doi.org/10.1109/HPCA.2016.7446100

[44] V. Karakostas, O. S. Unsal, M. Nemirovsky, A. Cristal, and M. Swift, “Performance

analysis of the memory management unit under scale-out workloads,” in 2014 IEEE

International Symposium on Workload Characterization (IISWC), Oct 2014, pp. 1–12.

[Online]. Available: https://doi.org/10.1109/IISWC.2014.6983034

http://doi.acm.org/10.1145/1555754.1555779

https://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/vt-directed-io-spec.pdf

https://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/vt-directed-io-spec.pdf

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html

http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html

https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf

https://software.intel.com/sites/default/files/managed/2b/80/5-level_paging_white_paper.pdf

http://doi.acm.org/10.1145/1815961.1815971



https://doi.org/10.1109/IISWC.2014.6983034

Bibliography 147

[45] V. Karakostas, J. Gandhi, F. Ayar, A. Cristal, M. D. Hill, K. S. McKinley, M. Nemirovsky,

M. M. Swift, and O. Unsal, “Redundant memory mappings for fast access to large

memories,” in Proceedings of the 42nd Annual International Symposium on Computer



[46] S. Kaxiras and A. Ros, “A new perspective for efficient virtual-cache coherence,” in

Proceedings of the 40th Annual International Symposium on Computer Architecture,

ser. ISCA ’13. New York, NY, USA: ACM, 2013, pp. 535–546. [Online]. Available:

http://doi.acm.org/10.1145/2485922.2485968

[47] D. Kim, H. Kim, and J. Huh, “Virtual snooping: Filtering snoops in virtualized

multi-cores,” in Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium

on Microarchitecture, ser. MICRO ’43. Washington, DC, USA: IEEE Computer Society,

2010, pp. 459–470. [Online]. Available: http://dx.doi.org/10.1109/MICRO.2010.16

[48] H. Q. Le, W. J. Starke, J. S. Fields, F. P. O’Connell, D. Q. Nguyen, B. J. Ronchetti,

W. M. Sauer, E. M. Schwarz, and M. T. Vaden, “IBM POWER6 microarchitecture,” IBM

Journal of Research and Development, vol. 51, no. 6, pp. 639–662, Nov 2007.

[49] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, “Mcpat:

An integrated power, area, and timing modeling framework for multicore and manycore

architectures,” in Proceedings of the 42nd Annual IEEE/ACM International Symposium

on Microarchitecture, ser. MICRO 42. New York, NY, USA: ACM, 2009, pp. 469–480.


[50] G. H. Loh and M. D. Hill, “Efficiently enabling conventional block sizes for very large

die-stacked dram caches,” in Proceedings of the 44th Annual IEEE/ACM International

Symposium on Microarchitecture, ser. MICRO-44. New York, NY, USA: ACM, 2011, pp.

454–464. [Online]. Available: http://doi.acm.org/10.1145/2155620.2155673

[51] P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, O. Kocberber, J. Picorel, A. Adileh,

D. Jevdjic, S. Idgunji, E. Ozer, and B. Falsafi, “Scale-out processors,” in Proceedings

of the 39th Annual International Symposium on Computer Architecture, ser. ISCA ’12.



[52] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg,

F. Larsson, A. Moestedt, and B. Werner, “Simics: A full system simulation

platform,” Computer, vol. 35, no. 2, pp. 50–58, Feb. 2002. [Online]. Available:

http://dx.doi.org/10.1109/2.982916

[53] R. McDougall and J. Mauro, Solaris Internals: Solaris 10 and OpenSolaris Kernel Archi-

tecture (Second Edition). Upper Saddle River, NJ, USA: Prentice Hall PTR, 2007.

http://doi.acm.org/10.1145/2749469.2749471

http://doi.acm.org/10.1145/2485922.2485968

http://dx.doi.org/10.1109/MICRO.2010.16

http://doi.acm.org/10.1145/1669112.1669172

http://doi.acm.org/10.1145/2155620.2155673


http://dx.doi.org/10.1109/2.982916

Bibliography 148

[54] J. Navarro, S. Iyer, P. Druschel, and A. Cox, “Practical, transparent operating system

support for superpages,” SIGOPS Oper. Syst. Rev., vol. 36, no. SI, pp. 89–104, Dec. 2002.


[55] M. Papadopoulou, X. Tong, A. Seznec, and A. Moshovos, “Prediction-based

superpage-friendly TLB designs,” in 2015 IEEE 21st International Symposium on High

Performance Computer Architecture (HPCA), Feb 2015, pp. 210–222. [Online]. Available:


[56] C. H. Park, T. Heo, and J. Huh, “Efficient synonym filtering and scalable delayed

translation for hybrid virtual caching,” in 2016 ACM/IEEE 43rd Annual International

Symposium on Computer Architecture (ISCA), June 2016, pp. 90–102. [Online]. Available:

https://doi.org/10.1109/ISCA.2016.18

[57] B. Pham, A. Bhattacharjee, Y. Eckert, and G. H. Loh, “Increasing TLB reach by

exploiting clustering in page translations,” in Proceedings of the 2014 IEEE 20th

International Symposium on High Performance Computer Architecture, ser. HPCA ’14,

February 2014. [Online]. Available: https://doi.org/10.1109/HPCA.2014.6835964

[58] B. Pham, V. Vaidyanathan, A. Jaleel, and A. Bhattacharjee, “CoLT: Coalesced large-reach

TLBs,” in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium

on Microarchitecture, ser. MICRO-45. Washington, DC, USA: IEEE Computer Society,


[59] S. Phillips, “M7: Next Generation SPARC,” August 2014, [Presenta-

tion in Hot Chips Symposium, accessed February-2017]. [Online]. Avail-

able: http://www.oracle.com/us/products/servers-storage/servers/sparc-enterprise/

migration/m7-next-gen-sparc-presentation-2326292.html

[60] B. Pichai, L. Hsu, and A. Bhattacharjee, “Architectural support for address

translation on GPUs: Designing memory management units for CPU/GPUs with

unified address spaces,” in Proceedings of the 19th International Conference on

Architectural Support for Programming Languages and Operating Systems, ser.

ASPLOS ’14. New York, NY, USA: ACM, 2014, pp. 743–758. [Online]. Available:

http://doi.acm.org/10.1145/2541940.2541942

[61] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, “Adaptive insertion

policies for high performance caching,” in Proceedings of the 34th Annual International

Symposium on Computer Architecture, ser. ISCA ’07. New York, NY, USA: ACM, 2007,


[62] M. K. Qureshi and G. H. Loh, “Fundamental latency trade-off in architecting DRAM

caches: Outperforming impractical SRAM-tags with a simple and practical design,”

http://doi.acm.org/10.1145/844128.844138


https://doi.org/10.1109/ISCA.2016.18



http://www.oracle.com/us/products/servers-storage/servers/sparc-enterprise/migration/m7-next-gen-sparc-presentation-2326292.html

http://www.oracle.com/us/products/servers-storage/servers/sparc-enterprise/migration/m7-next-gen-sparc-presentation-2326292.html

http://doi.acm.org/10.1145/2541940.2541942

http://doi.acm.org/10.1145/1250662.1250709

Bibliography 149

in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on

Microarchitecture, ser. MICRO-45. Washington, DC, USA: IEEE Computer Society,


[63] R. Rahman, “Intel Xeon Phi Core Micro-architecture,” May 2013, [accessed August-

2014]. [Online]. Available: https://software.intel.com/sites/default/files/article/393195/

intel-xeon-phi-core-micro-architecture.pdf

[64] B. Romanescu, A. Lebeck, D. Sorin, and A. Bracy, “UNified Instruction/Translation/Data

(UNITD) coherence: One protocol to rule them all,” in High Performance Computer

Architecture (HPCA), 2010 IEEE 16th International Symposium on, Jan 2010, pp. 1–12.

[Online]. Available: https://doi.org/10.1109/HPCA.2010.5416643

[65] A. Ros and S. Kaxiras, “Complexity-effective multicore coherence,” in Proceedings of

the 21st International Conference on Parallel Architectures and Compilation Techniques,

ser. PACT ’12. New York, NY, USA: ACM, 2012, pp. 241–252. [Online]. Available:

http://doi.acm.org/10.1145/2370816.2370853

[66] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “DRAMSim2: A cycle accurate memory

system simulator,” IEEE Comput. Archit. Lett., vol. 10, no. 1, pp. 16–19, Jan. 2011.

[Online]. Available: http://dx.doi.org/10.1109/L-CA.2011.4

[67] J. H. Ryoo, N. Gulur, S. Song, and L. K. John, “Rethinking TLB designs

in virtualized environments: A very large part-of-memory TLB,” in Proceedings



http://doi.acm.org/10.1145/3079856.3080210

[68] A. Saulsbury, F. Dahlgren, and P. Stenstrom, “Recency-based TLB preloading,” in

Proceedings of the 27th Annual International Symposium on Computer Architecture,

ser. ISCA ’00. New York, NY, USA: ACM, 2000, pp. 117–127. [Online]. Available:

http://doi.acm.org/10.1145/339647.339666

[69] A. Seznec, “Concurrent support of multiple page sizes on a skewed associative TLB,”

IEEE Trans. Comput., vol. 53, no. 7, pp. 924–927, Jul. 2004. [Online]. Available:

http://dx.doi.org/10.1109/TC.2004.21

[70] A. Seznec, “A case for two-way skewed-associative caches,” in Proceedings of

the 20th Annual International Symposium on Computer Architecture, ser. ISCA

’93. New York, NY, USA: ACM, 1993, pp. 169–178. [Online]. Available: http:

//doi.acm.org/10.1145/165123.165152

[71] A. Seznec, “A new case for skewed-associativity,” Internal Publication No 1114, IRISA-

INRIA, Tech. Rep., 1997.


https://software.intel.com/sites/default/files/article/393195/intel-xeon-phi-core-micro-architecture.pdf

https://software.intel.com/sites/default/files/article/393195/intel-xeon-phi-core-micro-architecture.pdf


http://doi.acm.org/10.1145/2370816.2370853

http://dx.doi.org/10.1109/L-CA.2011.4

http://doi.acm.org/10.1145/3079856.3080210

http://doi.acm.org/10.1145/339647.339666

http://dx.doi.org/10.1109/TC.2004.21

http://doi.acm.org/10.1145/165123.165152

http://doi.acm.org/10.1145/165123.165152

Bibliography 150

[72] M. Shah, R. Golla, G. Grohoski, P. Jordan, J. Barreh, J. Brooks, M. Greenberg,

G. Levinsky, M. Luttrell, C. Olson, Z. Samoail, M. Smittle, and T. Ziaja, “Sparc T4: A

dynamically threaded server-on-a-chip,” IEEE Micro, vol. 32, no. 2, pp. 8–19, Mar. 2012.

[Online]. Available: http://dx.doi.org/10.1109/MM.2012.1

[73] B. Sinharoy, J. A. V. Norstrand, R. J. Eickemeyer, H. Q. Le, J. Leenstra, D. Q. Nguyen,

B. Konigsburg, K. Ward, M. D. Brown, J. E. Moreira, D. Levitan, S. Tung, D. Hrusecky,

J. W. Bishop, M. Gschwind, M. Boersma, M. Kroener, M. Kaltenbach, T. Karkhanis,

and K. M. Fernsler, “IBM POWER8 processor core microarchitecture,” IBM Journal of

Research and Development, vol. 59, no. 1, pp. 2:1–2:21, Jan 2015.

[74] SPARC International, Inc., CORPORATE, The SPARC Architecture Manual (Version 9).

Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1994.

[75] S. Srikantaiah and M. Kandemir, “Synergistic TLBs for high performance address

translation in chip multiprocessors,” in Proceedings of the 2010 43rd Annual IEEE/ACM

International Symposium on Microarchitecture, ser. MICRO ’43. Washington, DC,

USA: IEEE Computer Society, 2010, pp. 313–324. [Online]. Available: http:

//dx.doi.org/10.1109/MICRO.2010.26

[76] Sun Microsystems, “SPARC Joint Programming Specification 1 Implementation

Supplement: Sun UltraSPARC III,” 2002, [Online; accessed May-2017]. [Online].

Available: http://www.oracle.com/technetwork/server-storage/sun-sparc-enterprise/

documentation/sparc-3-usersmanual-2516678.pdf

[77] Sun Microsystems, “Schizo Programmer’s Reference Manual,” 2007.

[78] M. Talluri and M. D. Hill, “Surpassing the TLB performance of superpages with

less operating system support,” in Proceedings of the Sixth International Conference


ASPLOS VI. New York, NY, USA: ACM, 1994, pp. 171–182. [Online]. Available:

http://doi.acm.org/10.1145/195473.195531

[79] M. Talluri, S. Kong, M. D. Hill, and D. A. Patterson, “Tradeoffs in supporting two

page sizes,” in Proceedings of the 19th Annual International Symposium on Computer



[80] A. S. Tanenbaum, Modern Operating Systems, 2nd ed. Prentice Hall Press, 2002.

[81] P. J. Teller, “Translation-lookaside buffer consistency,” Computer, vol. 23, no. 6, pp.

26–36, Jun. 1990. [Online]. Available: http://dx.doi.org/10.1109/2.55498

http://dx.doi.org/10.1109/MM.2012.1



http://www.oracle.com/technetwork/server-storage/sun-sparc-enterprise/documentation/sparc-3-usersmanual-2516678.pdf

http://www.oracle.com/technetwork/server-storage/sun-sparc-enterprise/documentation/sparc-3-usersmanual-2516678.pdf

http://doi.acm.org/10.1145/195473.195531

http://doi.acm.org/10.1145/139669.140406

http://dx.doi.org/10.1109/2.55498

Bibliography 151

[82] J. Vesely, A. Basu, M. Oskin, G. H. Loh, and A. Bhattacharjee, “Observations

and opportunities in architecting shared virtual memory for heterogeneous systems,”

in 2016 IEEE International Symposium on Performance Analysis of Systems

and Software (ISPASS), April 2016, pp. 161–171. [Online]. Available: https:

//doi.org/10.1109/ISPASS.2016.7482091

[83] C. Villavieja, V. Karakostas, L. Vilanova, Y. Etsion, A. Ramirez, A. Mendelson,

N. Navarro, A. Cristal, and O. S. Unsal, “DiDi: Mitigating the performance impact of

TLB shootdowns using a shared TLB directory,” in Proceedings of the 2011 International

Conference on Parallel Architectures and Compilation Techniques, ser. PACT ’11.



[84] Virtutech, “Simics Reference Manual (Simics Version 3.0),” 2007.

[85] H. Yoon and G. S. Sohi, “Revisiting virtual L1 caches: A practical design using

dynamic synonym remapping,” in 2016 IEEE International Symposium on High

Performance Computer Architecture (HPCA), March 2016, pp. 212–224. [Online].

Available: https://doi.org/10.1109/HPCA.2016.7446066

https://doi.org/10.1109/ISPASS.2016.7482091

https://doi.org/10.1109/ISPASS.2016.7482091



Documents

by Misel-Myrto Papadopoulou - University of Toronto T-Space · 2017. 12. 19. · Misel-Myrto Papadopoulou Doctor of Philosophy Graduate Department of Electrical and Computer Engineering