Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Address Translation Optimizations for Chip Multiprocessors
by
Misel-Myrto Papadopoulou
A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy
Graduate Department of Electrical and Computer EngineeringUniversity of Toronto
c© Copyright 2017 by Misel-Myrto Papadopoulou
Abstract
Address Translation Optimizations for Chip Multiprocessors
Misel-Myrto Papadopoulou
Doctor of Philosophy
Graduate Department of Electrical and Computer Engineering
University of Toronto
2017
Address translation is an essential part of current systems. Getting the virtual-to-physical
mapping of a page is a time-sensitive operation that precedes the vast majority of memory
accesses, be it for data or instructions. The growing memory footprints of current workloads,
as well as the proliferation of chip multiprocessor systems with a variety of shared on-chip
resources create both challenges and opportunities for address translation research. This thesis
presents an in-depth analysis of the TLB-related behaviour of a set of commercial and cloud
workloads. This analysis highlights workload nuances that can influence address translation’s
performance, as well as shortcomings of current designs. This thesis presents two architectural
proposals that both support our thesis that TLB designs and policies need not be rigid, but
should instead dynamically adapt to the workloads’ behaviour for a judicious use of the available
on-chip resources.
The Prediction-Based Superpage-Friendly TLB proposal leverages prediction to improve
energy and utilization of TLBs by allowing translations of different page sizes to coexist in a
set-associative (SA) structure. For example, a 256-entry 4-way SA TLBpred achieves better
coverage (7.7% less Misses Per Million Instructions) compared to a slower 128-entry fully-
associative TLB. It also has the energy efficiency of a much smaller structure. This design uses
a highly accurate superpage predictor that achieves a 0.4% average misprediction rate with a
meager 32B of storage.
The Forget-Me-Not TLB (FMN) proposal utilizes the existing cache capacity to store trans-
lation entries and to thus reduce TLB-miss handling latency. A per core private 1024-entry
direct-mapped FMN reduces the average L1-TLB miss latency across all simulated workloads
by 31.4% over a baseline with only L1-TLBs. Conversely, a dedicated 1024-entry 8-way SA
L2-TLB reduces it by 24.6% and causes, in some cases, performance degradation. We further
ii
propose an L2-TLB bypassing mechanism to address this challenge.
iii
Acknowledgements
This thesis is the culmination of many years of work, and reflects a significant part of my
Ph.D. journey, which would have not been possible to complete without the support, advice
and encouragement of many people.
First of all, I would like to thank my advisor, Professor Andreas Moshovos, for his support
throughout my Ph.D. studies, for his technical knowledge, his advice, and for giving me the
freedom to select my thesis topic. He has helped me develop and hone my research skills,
cultivate my critical thinking, and was instrumental in how I have matured during my Ph.D. I
sincerely appreciate how he wisely knew when it was time to actively support me and when it
was time to best support me by letting me stand on my own. I am also grateful he encouraged
me to teach, an experience that has deeply enriched the last few years of my Ph.D. studies.
I also owe many thanks to my Ph.D. committee members, Professors Natalie Enright Jerger,
Michael Stumm, Paul Chow, and Abhishek Bhattacharjee, for their input on this work, for the
care they took in reading my thesis and for their advice on how to improve it. I appreciate
them asking questions that challenged me and pushed me to think more about how the work we
do is relevant in the “big picture”. Natalie, especially, has been a significant source of support,
encouragement, and mentoring throughout my graduate studies. She would always be available
to listen, answer questions, and offer her advice. A simple thank you is not enough.
When I presented my Ph.D. proposal, Professor Greg Steffan was one of my original Ph.D.
committee members. I will always remember the excitement with which, in the discussion that
followed my Ph.D. proposal presentation years ago, he started thinking about all the research
opportunities TLBs offered in the CMP era. I wish he could be here today. This is but a small
candle lit in his memory.
I would also like to extend a thank you to all, current and former, graduate students from
the computer architecture groups with whom I have discussed research ideas, and shared the
ups and downs of my graduate studies. It was a pleasure to study and work alongside many of
you, and also see many of you grow in the process. From my early Ph.D. years, I would like
to extend special thanks to Ioana Baldini, for her unwavering support, advice, and friendship
since the beginning of my Ph.D., and to Jason Zebchuk, who was always generous in his advice
and help with our simulation infrastructure and with helping me get my hands dirty with the
administrative support of our cluster infrastructure. Many thanks to Henry Wong, Patrick
Judd, Jorge Albericio, Parisa Khadem Hamedani, Danyao Wang, and Elham Safi for all the
discussions, technical and otherwise, and for their support.
My thanks also to Andre Seznec and Xin Tong for their input on our HPCA paper.
I would also like to thank the administrative and technical staff in the ECE Department
for their help during my graduate studies, especially Ms. Kelly Chan, Ms. Darlene Gorzo and
everyone from the graduate office, as well as Ms. Jayne Leake from the undergraduate office.
iv
My graduate research studies were further enriched by my internship in AMD Research,
and my teaching endeavours. I have very fond memories of my internship in AMD Research,
in Bellevue, WA, where I had the opportunity to work alongside not only great researchers
but also wonderful people. I am grateful to all of them for welcoming me and supporting me
during that time, especially to Lisa Hsu, my mentor, for her enthusiasm, support and insights
on my research work. During the last few years of my Ph.D. I was given the opportunity to
teach multiple sections of computer programming and computer organization courses in the
Computer Science Department at the University of Toronto. All faculty members and other
instructors I have worked with, during these past few years, created a very welcoming and
nurturing environment. I have learned a lot from them and I have grown as an educator
beyond what would have been feasible otherwise.
Last but not least, this long journey would not have been possible without friends and
family, the community of people that make me feel at home, the kind of home you always carry
with you. To all my friends, my family, and all the people who have supported me: this journey
was possible and is more meaningful because of you.
It would be impossible to list everyone here, but I would like to extend my sincere gratitude
to Mrs. Vasso Mexis and her family who have warmly embraced me ever since I first came to
Toronto, to my friends Foteini, Irene, and Rena for their support during these past years, and
also, to my dear friends Debbie and Maria whose friendship dates back to our undergraduate
studies.
To my dad, Spyros, my aunt, Despoina, my grandmother, Maria, my sister, Maria, and her
husband, Panagiotis: a thank you will never be enough. I am forever grateful and indebted to
you for your care, love, and encouragement, and for you caring for my soul, an ocean and a
continent away. You had faith in me and my abilities even in times when I would falter. You
are the ones who have enabled me to come so far. I always carry you with me, your love as
precious and dear as the almond trees that bloom in our garden in the midst of the winter.
v
Contents
1 Introduction 1
1.1 The Analysis of TLB-related Behaviour . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Prediction-Based Superpage-Friendly TLBs . . . . . . . . . . . . . . . . . . . . . 3
1.3 Forget-Me-Not TLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background and Related Work 6
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Background on Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Page Tables and Page Walks . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Translation Lookaside Buffers (TLBs) . . . . . . . . . . . . . . . . . . . . 10
2.2.3 The Cheetah-MMU in SPARC . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3.1 MMU Registers and TLB Miss Handling . . . . . . . . . . . . . 13
2.2.3.2 TLB Organization and Replacement Policy . . . . . . . . . . . . 14
2.2.3.3 Special MMU Operations . . . . . . . . . . . . . . . . . . . . . . 14
2.2.4 Address Translation for I/O . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Literature Review of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Techniques that Reduce TLB Misses . . . . . . . . . . . . . . . . . . . . . 16
2.3.1.1 TLB Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1.2 Shared or Distributed-Shared TLB Designs . . . . . . . . . . . . 19
2.3.1.3 Increasing TLB Reach . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Techniques that Reduce TLB Miss Latency Overhead . . . . . . . . . . . 21
2.3.3 Techniques that Revisit Address Translation/Paging . . . . . . . . . . . . 22
2.3.4 Techniques that Reduce Address Translation Energy . . . . . . . . . . . . 23
2.3.5 Techniques that Address TLB Coherence Overheads . . . . . . . . . . . . 25
2.3.6 Techniques that Target I/O Address Translation . . . . . . . . . . . . . . 26
2.3.7 Architectural Optimizations that Take Advantage of Address Translation 26
2.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
vi
3 TLB-related Behaviour Analysis 28
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.1 Characteristics Inherent to the Workload . . . . . . . . . . . . . . . . . . 29
3.1.2 Other Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Unique Translations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 Per-Core Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.2 CMP-Wide Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.1 Context Count and Sharing Degree . . . . . . . . . . . . . . . . . . . . . . 35
3.4.2 Context Lifetimes (Within Execution Sample) . . . . . . . . . . . . . . . 38
3.4.3 Context Significance: Frequency and Reach . . . . . . . . . . . . . . . . . 40
3.4.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Translation Mappings Lifetime . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.1 Demap-Context Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.2 Demap-Page Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.3 TLB-Entry Modification Analysis . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6 TLB Capacity Sensitivity Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6.1 Split L1-TLBs; One per Page-Size . . . . . . . . . . . . . . . . . . . . . . 47
3.6.2 Fully-Associative L1 TLB . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.6.3 Set-Associative L1-TLB for Small Pages and Fully-Associative L1-TLB
for Superpages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6.4 L2-TLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7 Compressibility and Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.8 The First Cache Block Access After A TLB-Miss . . . . . . . . . . . . . . . . . . 57
3.9 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4 Prediction-Based Superpage-Friendly TLB Designs 60
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Analysis of TLB-Related Workload Behavior . . . . . . . . . . . . . . . . . . . . 62
4.2.1 Unique Translations Analysis Recap . . . . . . . . . . . . . . . . . . . . . 62
4.2.2 TLB Miss Analysis and Access-Time/Energy Trade-Offs . . . . . . . . . . 63
4.2.3 Native x86 Runs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Page Size Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.1 Superpage Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.1.1 PC-based Predictor . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.1.2 Base Register-Value-Based (BRV-based) Predictor . . . . . . . . 67
4.4 Prediction-Guided Multigrain TLB . . . . . . . . . . . . . . . . . . . . . . . . . . 69
vii
4.4.1 Supporting Other Page Size Usage Scenarios . . . . . . . . . . . . . . . . 71
4.4.1.1 Precise Page Size Prediction . . . . . . . . . . . . . . . . . . . . 71
4.4.1.2 Predicting Among Page Size Groups . . . . . . . . . . . . . . . . 72
4.4.2 Special TLB Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5 Skewed TLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.1 Prediction-Guided Skewed TLB . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.7.1 Superpage Prediction Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 77
4.7.2 TLBpred Misses Per Million Instructions and Capacity Distribution . . . . 78
4.7.2.1 TLBpred Capacity Distribution . . . . . . . . . . . . . . . . . . . 79
4.7.3 Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.7.4 TLBskew and TLBpskew MPMI . . . . . . . . . . . . . . . . . . . . . . . . 81
4.7.5 Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.7.6 Sensitivity to the Page Size Access Distribution . . . . . . . . . . . . . . . 84
4.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.9 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5 The Forget-Me-Not TLB 90
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2 FMN’s Goal and Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2.1 FMN Operating Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3 FMN Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.1 Page Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.2 FMN’s Indexing Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.3 FMN’s Allocation and Replacement Policies . . . . . . . . . . . . . . . . . 96
5.4 Caching the FMN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.4.1 FMN Probes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.4.2 FMN Allocation Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.5 Simulation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.5.1 Simulation Challenges - Software-Managed TLBs in Simics . . . . . . . . 101
5.5.2 Timing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.5.3 Page Walk Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.5.4 Discussion of Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.6 Reasoning about FMN’s Performance Potential . . . . . . . . . . . . . . . . . . . 108
5.7 Synthetic Memory Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.8 Baseline CMP Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.9 Sequential Page Access Patterns - A Case Study with Synthetic Traces . . . . . . 115
5.9.1 Impact of Workload’s Footprint on Baseline Configuration . . . . . . . . . 115
viii
5.9.2 Effect of Per-Page Access Pattern on Baseline . . . . . . . . . . . . . . . . 118
5.9.3 Effect of Data Sharing on Baseline . . . . . . . . . . . . . . . . . . . . . . 119
5.9.4 Effect of Process Mix on Baseline . . . . . . . . . . . . . . . . . . . . . . . 120
5.9.5 Private FMNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.9.6 Private FMNs versus Private L2-TLBs . . . . . . . . . . . . . . . . . . . . 125
5.9.7 Private FMNs: Filtering Optimization . . . . . . . . . . . . . . . . . . . . 126
5.9.8 Private FMNs: Replacement Optimization . . . . . . . . . . . . . . . . . . 127
5.10 FMN’s Evaluation for Commercial Workloads . . . . . . . . . . . . . . . . . . . . 128
5.10.1 Impact of Address Translation on Baseline’s Performance . . . . . . . . . 129
5.10.2 FMN’s Impact on L1-TLB Miss Latency . . . . . . . . . . . . . . . . . . . 131
5.10.3 FMN’s Effect on Average Memory Latency . . . . . . . . . . . . . . . . . 132
5.10.4 FMN’s Effect on Performance . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.11 L2-TLB Bypassing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.11.1 Proposed Solution: Bypassing the L2-TLB . . . . . . . . . . . . . . . . . 136
5.11.2 L2-TLB Bypassing: Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 137
5.12 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6 Concluding Remarks 139
6.1 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Bibliography 142
ix
List of Tables
2.1 Commercial D-TLB designs; all the L2-TLBs are unified except for the AMD
systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 List of characteristics/metrics inherent to the workload presented in this analysis
along with a brief explanation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Other Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Per-core unique translation characterization: 8KB pages. Footprint in MB is
listed in parentheses for the min., max. and avg. (arithmetic mean) columns.
SD is also expressed as a percentage of the average in parentheses. . . . . . . . . 33
3.5 Per-core unique translation characterization: Superpages (i.e., 64KB, 512KB
and 4MB pages). No 64KB pages were present. . . . . . . . . . . . . . . . . . . . 33
3.6 CMP-wide unique translation characterization: 8KB pages and Superpages. . . 35
3.7 Context 0: % TLB accesses and cumulative per core unique translation entries
across the entire CMP. See previous equations. . . . . . . . . . . . . . . . . . . . 41
3.8 Non-zero contexts: % TLB accesses and cumulative per core unique translation
entries across the entire CMP for PARSEC and Cloud workloads. . . . . . . . . 41
3.9 Translation Demap and Remap Operations (cumulative in the entire CMP). . . . 43
3.10 Unique characteristics of Demap-Page requests (per core). Values in parentheses
are for the entire CMP wherever different. . . . . . . . . . . . . . . . . . . . . . . 44
4.1 Commercial D-TLB Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 System Parameters for Native x86 Execution . . . . . . . . . . . . . . . . . . . . 65
4.3 Fraction of TLB Misses due to 2MB Superpages (x86) . . . . . . . . . . . . . . . 65
4.4 Primary TLBpred Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.5 Secondary TLBpred Lookup Using a Binary Superpage Predictor. . . . . . . . . . 71
4.6 i-th TLB Lookup (1 < i ≤ N); N supported page sizes. . . . . . . . . . . . . . . . 72
4.7 Page Size Function described in Skewed TLB [69]. . . . . . . . . . . . . . . . . . 73
4.8 TLB Entry Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.9 Canneal Spin-Offs: Footprint Characterization . . . . . . . . . . . . . . . . . . . 85
5.1 TSB hit code in D-MMU Trap Handler (Solaris) . . . . . . . . . . . . . . . . . . 103
x
5.2 System Configuration Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3 L2-TLB Hit-Rate (%) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.4 L2-TLB Bypassing Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
xi
List of Figures
2.1 x86-64 (or IA-32e) Page Walk for a 4KB page. Intel refers to a level-4 table as
Page-Map Level-4 (Page-Map Level-4 (PML4)), and to a level-3 table as Page-
Directory-Pointer Table (Page-Directory-Pointer Table (PDPT)). . . . . . . . . . 9
2.2 SA TLB indexing for a TLB with 64 sets (x86 architecture). . . . . . . . . . . . . 11
2.3 Network I/O - System Snapshot . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Disk I/O - System Snapshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Number of unique contexts observed in the CMP; the number is also listed on
the top of each column. Each column is colour-coded based on the number of
core-sharers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Number of unique per core contexts for three workload classes. Each column
corresponds to a different core in the range of [0, 15] in ascending order. . . . . . 37
3.3 Context lifetimes. The average context/core lifetime is listed in parentheses as a
percentage of the workload’s execution time sample. . . . . . . . . . . . . . . . . 39
3.4 L1 TLB MPMI and Hit-Rate over different TLB sizes. The x-axis lists the
number of TLB entries for the split TLB with 8KB page translations; the capacity
of each other split TLB structure is half that in size. Canneal saturates with this
y-axis scale; see detail in Figure 3.5. . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Canneal MPMI detail with larger y-axis scale. . . . . . . . . . . . . . . . . . . . . 48
3.6 L1 TLB MPMI and Hit-Rate over different FA TLB sizes. All TLBs model full-
LRU as replacement policy. Figure 3.7 shows canneal in detail as it saturated
with this y-axis scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.7 Canneal MPMI detail with larger y-axis scale. . . . . . . . . . . . . . . . . . . . . 49
3.8 L1 TLB MPMI and Hit-Rate over different TLB sizes for the 2-way SA TLB
that only hosts translations for 8KB pages. A fixed 16-entry FA TLB is modeled
for all superpages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.9 Canneal MPMI with larger y-axis scale. . . . . . . . . . . . . . . . . . . . . . . . 51
3.10 L1 TLB MPMI over different TLB sizes for the FA TLB that hosts translations
for all superpages. A fixed 2-way SA 512-entry TLB is modeled for 8KB pages. . 52
xii
3.11 L2 TLB MPMI and Hit-Rate over different TLB sizes. The x-axis lists the
number of L2 TLB entries for an 8-way SA L2-TLB that only supports 8KB
pages. Canneal saturates with this y-axis scale; see detail in Figure 3.12. . . . . . 53
3.12 Canneal L2-TLB MPMI detail with larger y-axis scale. . . . . . . . . . . . . . . . 53
3.13 Per-Core L2-TLB Capacity classified percentage-wise in valid and invalid TLB
entries for different L2-TLB sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.14 Unique Bytes and Byte-Sets Nomeclature . . . . . . . . . . . . . . . . . . . . . . 55
3.15 Number of unique bytes and byte-sets in the virtual and physical addresses. . . . 56
3.16 Number of unique values for MSB 4 and Byte-Set 4 (both in virtual and physical
addresses). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.17 Percentage of all CMP D-TLB L1 Misses that access the same 64B cache block
as the last time that same translation-entry experienced a TLB miss. . . . . . . 58
4.1 D-TLB L1 MPMI for Different TLB Designs . . . . . . . . . . . . . . . . . . . . 63
4.2 Access Time and Dynamic Energy Trade-Offs . . . . . . . . . . . . . . . . . . . . 64
4.3 (a) PC-Based and (b) Base Register-Value Based Page Size Predictors . . . . . . 67
4.4 Multigrain Indexing with 4 supported page sizes, shown here for a 512-entry
8-way SA TLB (6 set-index bits). . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 Multigrain Tag Comparison for Figure 4.4’s TLB on superpage prediction. Page
Size field (2 bits) included in every TLB entry. . . . . . . . . . . . . . . . . . . . 70
4.6 Skewed Indexing (512 entries, 8-way skewed associative TLB) with 4 supported
page sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.7 Prediction Table (PT) Entry Transition Diagram . . . . . . . . . . . . . . . . . . 77
4.8 Superpage-Prediction Misprediction Rate (%) . . . . . . . . . . . . . . . . . . . . 78
4.9 TLBpred MPMI relative to AMD-like 48-entry FA TLB . . . . . . . . . . . . . . . 79
4.10 TLBpred per core capacity distribution over translations of different page sizes. . 80
4.11 Dynamic Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.12 TLBskew, TLBpred, and TLBpskew: MPMI relative to AMD-like 48-entry FA TLB 82
4.13 CPI saved with TLBpred . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.14 Canneal Spin-Offs: Miss Distribution for 48-entry FA (AMD12h-like) TLB . . . . 85
4.15 Canneal Spin-Offs: MPMI relative to AMD-like TLB. Includes TLBpred with
precise page-size prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.1 FMN’s Best Case Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2 FMN Operation Timeline - Page Walk completes before FMN probe . . . . . . . 93
5.3 FMN Operation Timelines - FMN probe completes before page walk . . . . . . . 94
5.4 FMN’s effect on cache contents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.5 Virtualizing a small 8-entry 2-way SA FMN. . . . . . . . . . . . . . . . . . . . . . 99
5.6 Timing Model - Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.7 Page Walk Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
xiii
5.8 Projected ideal % performance improvement based on Equation (5.11) with
∆TLB miss = 0.75 and ∆mem = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.9 Projected % performance improvement based on Equation (5.10) with c = 0. . . 111
5.10 Effect of pool size on TLB hit rate. . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.11 TLB Miss Latency as percentage of execution time with varying Pool Size (PS)
and Block Count Per Page (BCPP) values. Figure 5.12 presents how the execu-
tion time changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.12 Execution time with varying PS and BCPP values. . . . . . . . . . . . . . . . . . 117
5.13 Average Memory Request Latency in cycles. Note the logarithmic y-axis scale. . 118
5.14 Average L1-TLB Miss Latency in cycles. No TLB misses exist for the PS-64
series in these last 16M references, as explained earlier. . . . . . . . . . . . . . . . 118
5.15 Shared versus Private: Effect of data sharing on L1-TLB miss latency. . . . . . . 119
5.16 Shared versus Private: Effect of data sharing on average memory latency. . . . . 119
5.17 Private sharing pattern: Effect of process mix on baseline’s TLB miss latency. . . 120
5.18 Shared sharing pattern: Effect of process mix on baseline’s TLB miss latency. . . 121
5.19 Performance Impact: FMN versus Baseline. . . . . . . . . . . . . . . . . . . . . . 122
5.20 Average TLB Miss Latency in cycles. . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.21 Average Memory Latency in cycles; this is measured after translation is retrieved.124
5.22 Performance Impact: FMN compared to L2-TLB. . . . . . . . . . . . . . . . . . . 125
5.23 FMN Filtering and FMN vs. Baseline . . . . . . . . . . . . . . . . . . . . . . . . 127
5.24 Percentage of execution time spent in L1-TLB miss handling. . . . . . . . . . . . 129
5.25 Percentage of execution time reduction due to L2-TLB. . . . . . . . . . . . . . . 130
5.26 FMN or L2-TLB: Percentage L1-TLB Miss latency reduction over HB. . . . . . . 131
5.27 FMN or L2-TLB: Percentage of execution time spent handling L1-TLB misses. . 132
5.28 Characterization of FMN probes for a 1K-entry per core FMN with 8KB VPN
indexing scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.29 FMN or L2-TLB: Percentage memory latency increase over HB. . . . . . . . . . . 133
5.30 FMN or L2-TLB: Performance over HB. . . . . . . . . . . . . . . . . . . . . . . . 134
5.31 Percentage of execution time reduction with L2-TLB bypassing. . . . . . . . . . . 138
xiv
Acronyms
ASI Address Space Identifier. The acronym used in SPARC systems instead of ASID. 14
ASID Address Space Identifier. xv, 12, 13, 24, 29, 35, 75, 95
BCPP Block Count Per Page. xiv, 115–117, 120, 125
BRV Base-Register (src1) Value. vii, 67, 77, 88
CPI Cycles Per Instruction. 83
DM Direct-Mapped. 10, 123
DMA Direct Memory Access. 15
FA Fully-Associative. 10, 17, 24, 32, 49, 87, 101
FMN Forget-Me-Not. 4, 90, 91, 101, 140
FPGA Field-Programmable Gate Array. 28
GPU Graphics Processing Unit. 6
ILP Instruction Level Parallelism. 107
IOMMU I/O Memory Management Unit. 15
IPC Instructions Per Cycle. 107
IRQ Interrupt Request. 36
ISA Instruction Set Architecture. 12
L1 Level 1. 4
L2 Level 2. 4, 10
LLC Last-Level Cache. 25, 26, 99, 100
xv
LRU Least Recently Used. 17
MLP Memory-Level Parallelism. 107
MMU Memory Management Unit. 1, 7, 13, 92
MPKI Misses Per Kilo Instructions. 24, 46
MPMI Misses Per Million Instructions. 3, 28, 30, 46, 58, 62, 63, 135, 140
OoO Out Of Order. 107
OS Operating System. 7, 47
PC Program Counter. 44
PCID Process Context Identifier. 13
PDE Page Directory Entry. 8, 21, 121
PDPT Page-Directory-Pointer Table. xii, 9, 21
PID Process ID. 7, 56
PIPT Physically-Indexed and Physically-Tagged. 25
PML4 Page-Map Level-4. xii, 9, 21
PPN Physical Page Number. 7, 45
PS Pool Size. xiv, 115–117, 120, 125
PT Prediction Table. xiii, 66, 77, 88
PTE Page Table (or Translation) Entry. 8, 17, 97, 98
RMM Redundant Memory Mappings. 23, 87
SA Set-Associative. 10, 47, 99, 101, 116
SCSI Small Computer System Interface. 15
SD Standard Deviation. x, 32, 33
SPARC Scalable Processor Architecture. xv, 7
STXA Store extended word into alternate space. 14
THP Transparent Huge Pages. 88
xvi
TLB Translation Lookaside Buffer. 1, 6, 10
TSB Translation Storage Buffer. 4, 12, 101, 102
TT Trap Type. 13
VIPT Virtually-Indexed and Physically-Tagged. 12, 23, 24
VIVT Virtually-Indexed and Virtually-Tagged. 24
VPN Virtual Page Number. 7, 20, 34, 35, 45, 57, 95
xvii
Chapter 1
Introduction
Address translation has been an integral part of computer systems for decades, since the concept
of virtual memory was introduced in the early 1960s [31]. Virtual memory support is considered
a de facto facility for current systems. Having a large contiguous address space for each process,
along with isolation and access control provisions across different processes and memory regions,
are characteristics of the virtual memory abstraction all programmers rely on. The operating
system and the hardware architecture must support these requirements, usually transparently
to and with no effort from the programmer. Beyond correctness, which is not negotiable,
there is also the fundamental expectation of efficiency: the architecture should support address
translation within strict performance, energy, and often area, envelopes. A brief introduction
to virtual memory is given in Chapter 2.
Getting the virtual-to-physical mapping of an address is on the processor’s critical path
because it precedes the vast majority of memory accesses, be it for data or instructions. Modern
Memory Management Units (MMUs) employ Translation Lookaside Buffers (TLBs) to avoid
walking the page tables on every memory access that needs a translation. This functionality
has a parallel to data and instruction caches: TLBs hide part of the page-walk latency, while
data and instruction caches hide part of the memory latency. However, despite its long history,
address translation still causes significant performance loss in many scenarios, as both system
architectures and workloads evolve. The percentage of execution time spent doing page walks
is as high as 16% for scale-out workloads [44] or 14% for a wide range of server and scientific
applications [15], and can even reach 89% under virtualization [15]. This trend is expected to
continue as the increased data footprints of emerging applications stress conventional TLBs and
their tight latency and energy constraints. As the virtual address space grows, the TLB miss
handling latency is also expected to increase as more levels are added to current multi-level
page tables. For example, Intel is currently working on introducing 5-level page tables [40].
The TLB organization and TLB miss handling need to accommodate these ever evolving needs.
The straightforward solution of making the structures in question larger has been tried in
the realm of caches, and it is now common knowledge that blindly dedicating resources to the
problem is not enough. In the realm of L1-TLBs, it is not even a viable option because of
1
Chapter 1. Introduction 2
their strict latency constraints. But even within the existing latency and power constraints,
the “one design fits all” paradigm rarely manages to capture the widely different memory
behaviour requirements, not only across different workloads, but also across cores running a
single multithreaded workload. For example, rigid a priori decisions about the likely page size
distributions of workloads, reflected in different TLB sizes for split TLBs that each support
one page size, can waste both energy and hardware resources, and can also hurt performance
when the observed behaviour deviates from the one expected. This thesis advocates for TLB
designs and policies that dynamically adapt to the workloads’ behaviour for a judicious use of
the available on-chip resources.
To understand which are the workload behaviour and system architecture aspects that
influence the TLB usage scenarios, and thus are the ones that TLB designs and management
policies should adapt to, this thesis presents an in-depth exploration of TLB-related behaviour
for a set of commercial and cloud workloads. This analysis showed significant variation in the
use of superpages (i.e., pages with sizes greater than the smallest supported one) versus small
pages across workloads, with a strong bias for the largest supported superpage. It also showed
that most mainstream TLB structures (e.g., split TLBs) are either biased towards the smallest
page size or make an implicit assumption about the page size distribution of memory accesses.
These two observations have motivated our proposal for Prediction-Based Superpage-Friendly
TLBs that use superpage-prediction to allow translations of different page-sizes to coexist in a
single set-associative TLB, sharing its capacity at runtime as needed. Our analysis also showed
that translation modifications are rare, encouraging our second proposal, the Forget-Me-Not
TLB, a cacheable and speculative TLB that allows translations to dynamically share existing
on-chip capacity with regular data and reduces TLB miss handling latency.
The remainder of this chapter is organized as follows. Sections 1.1 to 1.3 introduce our
analysis and our two architectural proposals. Section 1.4 outlines the research contributions of
this dissertation and, lastly, Section 1.5 reviews the organization of this thesis.
1.1 The Analysis of TLB-related Behaviour
Our analysis’ goals were to: (i) understand which are the workloads’ behaviour aspects that
influence TLB usage, and (ii) to characterize the interplay between these aspects and the
existing TLB infrastructure. We thus classified our measurements according to the following
taxonomy: (i) characteristics inherent to the workloads, that is, characteristics or metrics
unaffected by translation caching structures like the TLBs, and (ii) other metrics that are
influenced by the architecture of these structures.
The measurements in the first category answer the following questions: What is the number
of unique translations, and thus the TLB sizing requirements, for these workloads? Do these
requirements vary from the perspective of each CMP core for a given workload? Which are
the page sizes more prominently used? Is there any bias we can exploit? Is there translation
Chapter 1. Introduction 3
sharing across cores that would motivate non-private TLB designs? How often are translation
mappings modified? If we look at address translation via the abstraction of process IDs (i.e.,
contexts), how does this lens influence our view of translation sharing potential across cores?
What is the frequency, data reach and lifetime of these different contexts, and would filtering
them and their translations be appropriate? Are there opportunities for translation compression
or predictability of the cache block accessed after a TLB miss? Answers to these questions help
us better understand how and why specific TLB designs affect Misses Per Million Instructions
(MPMI) measurements, and also motivate our architectural proposals. The measurements in
the second category of the above taxonomy evaluate how MPMI changes across different state-
of-the-art TLB organizations highlighting their trade-offs and shortcomings. For example, our
TLB capacity sensitivity study illustrates how rigid TLB designs that make a priori assumptions
about page size distribution in workloads poorly capture different superpage usage scenarios.
1.2 Prediction-Based Superpage-Friendly TLBs
Our analysis reveals different page size usage scenarios across workloads, with some workloads
heavily relying on superpages. It also shows that, when superpages are used, workloads tend
to favor the largest superpage size, while intermediate superpage sizes rarely appear. These
observations are not reflected in existing TLB designs that make an a priori decision about
workloads’ page size distribution, unnecessarily wasting energy and area.
To address this research gap, this thesis proposes a lightweight binary superpage prediction
mechanism that accurately guesses ahead of time if a memory access is to a superpage or
not. This predictor enables our proposed TLBpred design, an elastic set-associative TLB that
dynamically adapts its super- and regular page capacity to fit each application’s needs. That
is: (1) A workload using mostly a single page size can use all the available TLB capacity
and not waste any resources or be limited by predetermined assumptions on page size usage.
(2) A workload that uses multiple page sizes should have its translations transparently compete
for the available TLB entries. A set-associative TLB design will better scale to larger sizes
without the onerous access and power penalties of a large fully-associative TLB. For example,
a 256-entry 4-way SA configuration of the proposed TLBpred design achieves better coverage
(7.7% less MPMI) compared to a slower 128-entry fully-associative TLB. It is also significantly
more energy efficient; its energy efficiency is comparable to that of a much smaller 48-entry
fully-associative TLB that has much higher MPMI. This TLBpred design uses a highly accurate
superpage predictor; a small 128-entry predictor table with a meager 32B of storage has an
average misprediction rate of 0.4% across all simulated workloads. This work also provides the
first experimental evaluation of the previously proposed Skewed TLB design [69] that can also
support multiple page sizes in a single structure. We further augment the Skewed TLB with
page size prediction, a modified version of our superpage predictor that now predicts among
groups of page sizes, to increase its per page-size effective associativity.
Chapter 1. Introduction 4
1.3 Forget-Me-Not TLB
Level 1 (L1) TLBs are inherently limited by strict latency constraints. Even if they are re-
designed to better adapt to the workloads’ behaviour and each architecture’s nuances, they
still cannot meet the growing demands of data footprints. Extending the TLB hierarchy, e.g.,
by adding a Level 2 (L2) TLB, is a solution many current systems follow. Even though the
L2-TLB access latency is no longer in the processor’s critical path, the benefits gained by allo-
cating hardware resources to such a design need to be scrutinized especially because the L2-TLB
probe happens before a page walk. Thus, in cases where the L2-TLB hit-rate is low because
the workload’s footprint is too large or the existing L2-TLB configuration does not cater to the
workload’s page-size distribution, preceding the page-walk latency with an L2-TLB probe may
not just waste energy, but also cause performance degradation.
This work proposes the Forget-Me-Not (FMN) TLB, a cacheable TLB design that can sig-
nificantly reduce TLB-miss handling latency without any dedicated on-chip translation storage.
FMN leverages the observation that large on-chip memory caches can be shared transparently
and on demand with properly engineered virtualized structures [25, 26]. On a TLB miss, this
virtualized TLB is accessed in parallel to the page table. However, unlike the page walk that
requires multiple memory accesses, only a single access is needed to retrieve the translation, if
the latter exists in this new cacheable structure. FMN’s translations are speculative because
the FMN is not kept coherent with the page tables. However, because our analysis shows that
translation modifications are rare, so is FMN misspeculation. As explained in Section 5.4.3,
when compared to a software based translation cache, like the Translation Storage Buffer (TSB),
our design is different because it is hardware managed, its lookup does not precede the page
walk, and it is also, by design, not kept coherent with the page-tables. Further, FMN is not a
per process structure, but can be configured as a private (per-core) or a shared structure.
Contrary to an L2-TLB, FMN does not require any dedicated hardware storage and its
virtualized nature enables more flexible organizations (e.g., different indexing schemes, sizes).
A per core private 1024-entry direct-mapped FMN reduces the average L1-TLB miss latency
across all simulated workloads by 31.4% over a baseline with only L1-TLBs, while a dedicated
1024-entry 8-way set-associative L2-TLB reduces it by 24.6%. FMN’s L1-TLB miss latency
reduction results in up to 1.97% overall execution time reduction (performance). For systems
that already have an L2-TLB, this work also proposes an L2-TLB bypassing optimization. An
interval based predictor enables L1-TLB misses to bypass the L2-TLB and immediately trigger
a page walk, if it predicts that the sequential to the page walk L2-TLB lookup is likely to cause
performance degradation.
Chapter 1. Introduction 5
1.4 Thesis Contributions
This thesis makes the following contributions:
• It analyzes different aspects of workload behaviour relevant to address translation, and
highlights inefficiencies of existing TLB designs such as the poor support of multiple page
sizes (Chapter 3). This analysis also points to interesting directions for future research,
such as context-aware TLB-indexing schemes.
• It proposes a highly accurate superpage predictor that predicts if a memory access is
to a superpage or the smallest supported page size. It then leverages this predictor to
propose TLBpred that allows translations of different page sizes to co-exist in an elastic
set-associative TLB design (Chapter 4).
• It evaluates the previously proposed but not evaluated Skewed-TLB, TLBskew, and aug-
ments it with a prediction mechanism to improve the effective associativity of each page-
size (Chapter 4).
• It proposes FMN, a cacheable and speculative TLB design that reduces the TLB miss
handling latency by using the available on-chip cache hierarchy to transparently and on
demand host past translations (Chapter 5).
• It presents a suite of synthetic traces and their configuration parameters that can enable
exploration of TLB designs for a variety of workload characteristics which might not be
represented in existing workloads (Chapter 5).
• It highlights the circumstances under which L2-TLBs can hurt performance and proposes
an L2-TLB bypassing mechanism based on an interval-based predictor (Chapter 5).
The prediction-based superpage-friendly TLB designs presented in Chapter 4 were published
in the International Symposium on High Performance Computer Architecture (HPCA) in
2015 [55].
1.5 Thesis Organization
The remainder of this thesis is organized as follows. Chapter 2 first provides the necessary
background on address translation and the hardware or software facilities that support it (e.g.,
page tables, TLBs), and then reviews relevant past research. Chapter 3 presents our in-depth
exploration of TLB-related behaviour. Chapter 4 presents our superpage predictor and our
prediction-based superpage-friendly TLB designs TLBpred and TLBpskew. Chapter 5 presents
our Forget-Me-Not TLB, as well as our L2-TLB bypassing mechanism. Finally, Chapter 6
summarizes this work’s contributions and anticipates future research directions.
Chapter 2
Background and Related Work
2.1 Overview
This chapter: 1) provides the necessary background on address translation and the hardware
or software facilities that support it, page tables, TLBs, (Section 2.2), and 2) reviews ad-
dress translation related research (Section 2.3). The latter is organized in different thematic
categories that all together formulate the landscape of architectural optimizations targeting
address translation. This landscape changes as the workloads and the underlying architectures
do. The earliest research works targeted scientific applications in uniprocessor systems, while
starting from 2007 there was an emergence of works in multiprocessor systems targeting data
parallel applications; there is also research on address translation for heterogeneous systems
such as GPUs [60,82]. The literature presented in this chapter will focus on general-purpose
systems and will not include research for virtualization support. Research works that aim to
reduce the number of TLB misses or the TLB miss latency overhead or the energy spent for
address translation are the most pertinent to this work.
2.2 Background on Address Translation
The first system to implement a variant of virtual memory was the ATLAS computer in the
early 1960s. In ATLAS, “address is an identifier of a required piece of information but not
a description of where in main memory that piece of information is” [31]. As Denning later
said, referring to ATLAS, this concept of virtual memory “gives the programmer the illusion
that he has a very large main memory at his disposal, even though the computer actually has a
relatively small main memory” [29]. This indirection in the view of the address space from the
perspective of the process (i.e., the virtual address space, a linear address space exposed to the
programmer) and of the physical system (i.e., the physical address space), remains one of the
main and most crucial facilities provided by virtual memory.
The need to map addresses between virtual memory, the imaginary “large main memory”
mentioned earlier, and the actual physical memory gave birth to address translation. It is the
6
Chapter 2. Background and Related Work 7
responsibility of the operating system and the hardware to implement address translation by
providing the necessary mapping from virtual to physical addresses. The MMU serves this
purpose allowing the application (process) to be oblivious 1 to this indirection.
As application footprints grew, and with the advent of multiprogramming, virtual memory
became a necessity. For this reason, address translation has become an integral part of modern
computer systems. As memory virtualization matured over the years, it also became synony-
mous with providing isolation across multiple running processes and access control (protection)
for different parts of memory. Therefore, the address translation is not only limited to provid-
ing a virtual-to-physical address mapping, but also incorporates additional information such as
access permissions, as discussed in Section 2.2.1.
With virtual memory, only the data currently in use by a given process need to reside in
physical memory. The norm in today’s systems is that any data that exceeds the available
physical memory is stored on the disk (secondary storage). Depending on the implementation
of virtual memory used, paging [9] or segmentation [9], the data is organized in memory at
the granularity of pages or segments respectively. Usually pages have a significantly smaller
size than segments; a segment is a large contiguous memory region identified by a base address
and its size. It is also possible for a system to implement both segmentation and paging. This
work focuses on paging systems as paging is the most widespread memory management scheme.
Segmentation is not discussed as it can be applied on top of paging and it is not influencing
the specifics of this work.
The sections that follow present the structures the MMU probes to retrieve the needed
translation. Section 2.2.1 discusses the functionality and organization of page tables, while
Section 2.2.2 focuses on Translation Lookaside Buffers (TLBs). Section 2.2.3 details the Cheetah
MMU, a SPARC MMU, introducing terminology required by the methodology that will follow.
Lastly, Section 2.2.4 covers a different type of MMU, the I/O MMU, used in the path of I/O
accesses.
2.2.1 Page Tables and Page Walks
All the virtual-to-physical mappings, i.e., translations, for a given process are maintained in
a page table in memory. Contiguous regions of virtual addresses, called pages, are mapped to
contiguous regions of physical memory, called page frames. The size of a page, and by extension
the size of a page frame, is always a power-of-two bytes. The page size separates any virtual and
physical address into two fields: the page number and the page offset. The page offset consists
of the log2(page size) least significant bits of the address, while the page number consists of the
remaining higher-order bits. A translation maps a Virtual Page Number (VPN) to a Physical
Page Number (PPN), while the page-offset bits remain unchanged. The VPN and knowledge
of the Process ID (PID), a unique identifier provided to each process by the Operating System
1Here the term oblivious highlights that the programmer need not do any additional work to facilitate thevirtual-to-physical mapping. However, address translation can have a performance impact on a given process.
Chapter 2. Background and Related Work 8
(OS), are the only two pieces of information needed to access the page tables and retrieve the
translation.
Even though page tables can be organized in different ways, each Page Table Entry (PTE)
usually contains the following fields: (a) the physical page number (b) a present bit, set when
the page is resident in physical memory, (c) a referenced bit, (d) a modified bit, (e) a caching
disabled bit, and (f) an access permission (protection) field [80].
A not-set present bit in a PTE entry means that the page in question is not mapped in
physical memory. This triggers a page fault, a trap to the operating system. A replacement
candidate among all present pages is selected by the system’s page replacement algorithm, if
needed. The referenced bit in the PTE entry can be used for this purpose. If the modified
(dirty) bit of the selected page is set, then the operating system should write the contents of
this victim page back (i.e., swap it) to the disk, before it brings the requested page frame from
the disk into physical memory and updates the PTE entry as needed. A set caching-disabled
bit, usually relevant in case of memory-mapped I/O, indicates that the contents of a page frame
should not be cached. Finally, the protection field specifies the access rights to a given page
frame. In an Intel-64 processor, protection encompasses the R/W flag that controls read-only
versus read-write access to a page, the user/supervisor mode flag (U/S) that controls if user-
mode accesses to a page are permitted, and the execute-disabled flag that prevents instruction
fetches from such marked pages further protecting against malicious code [39].
How translation entries are organized in the page tables is especially important since it
determines how fast a translation can be retrieved. The most common types of page tables are:
(a) multi-level page tables, and (b) inverted page tables.
Multi-level page tables solve the problem of needing to keep really large page tables in
physical memory. Only the page tables which cover the address space used by a process are
kept in memory. This idea is realized as follows. The virtual page number of any virtual address
is split into x separate fields, where x is the number of page table levels. Each of these fields
is used as an index to the relevant page table in each level. The contents of each entry at that
index serve either as a pointer to the base address of the next (lower) level page-table or as the
final translation. This work refers to the first type of entries as Page Directory Entries (PDEs)
and to the latter as Page Translation Entries (PTEs). The page table traversal until a PTE
is found is commonly referred to as “walking the page tables” or “page walk”. Page walks are
by nature sequential and they will require more page-table levels as the virtual address space
grows.
Figure 2.1 shows a page walk in an x86-64 architecture; this example retrieves the translation
of a 4KB page. Currently 48-bit virtual addresses are mapped to 52-bit physical addresses. Each
page-table has 512 8B-entries and requires 4KB of storage.
Chapter 2. Background and Related Work 9
47 39 38 30 29 21 20 12 11 0 63 48
PDE L4 Page Table (512 entries)
L3 Page Table (512 entries)
L1 Page Table (512 entries)
Page Offset (4KB page)
PDE
PTE
PDE
L2 Page Table (512 entries)
Control Register CR3
Page Offset (2MB page)
4KB page translation
Virtual Address
x86-64 Page Walk
Page Offset (1GB page)
Figure 2.1: x86-64 (or IA-32e) Page Walk for a 4KB page. Intel refers to a level-4 table asPage-Map Level-4 (PML4), and to a level-3 table as Page-Directory-Pointer Table (PDPT).
The control register CR3 points to the base physical address of the page-table hierarchy. The
first highlighted VPN field (bits [47:39]) is used to index into the topmost L4 page-table. The
contents of this entry form the base address of the next level page-table. The second field of
the virtual address is used to index into the subsequent L3 table. These steps repeat until we
reach a page translation entry. This PTE can be either at the leaf of the tree, as in the provided
figure, or at a higher tree level, if this virtual address belongs to a larger page. For example, for
a 1GB page, the largest x86-64 supported page size, only the L4 and L3 tables will be accessed
thus requiring only two memory references for the page walk instead of the four needed for a
4KB page.
Unlike the multi-level page tables that contain an entry per virtual page, the inverted
page tables contain an entry per physical page frame, and therefore they cannot be VPN-
indexed. Even though inverted page tables have reduced memory requirements, since a system’s
physical memory is much smaller than the sum of all virtual address spaces of its currently
running processes, they have increased complexity cost. The complexity stems from the need
to exhaustively search all entries of an inverted page table to find an entry that maps (contains)
to the requested VPN and process. Hash tables are often used to speed up this search [9]. This
work considers only multi-level page tables, the most widespread page table format.
Chapter 2. Background and Related Work 10
2.2.2 Translation Lookaside Buffers (TLBs)
To avoid walking the page-tables on every memory access that needs a translation, and incurring
a significant memory latency overhead, modern MMUs employ Translation Lookaside Buffers
(TLBs). TLBs act as caches for the paging hierarchy; TLBs solely cache translations, while
data and instruction caches cache memory blocks (e.g., data, instructions). To avoid ambiguity,
the term cache(s), in the remainder of this thesis, will refer to data and instruction cache(s)
and not TLBs. The temporal and spatial locality principles that caches rely upon also result
primarily in temporal page-table entry locality. This temporal locality is more pronounced
in the TLBs because they track memory accesses at a coarser granularity than caches (pages
versus cache lines). Spatial locality in the application may also result in adjacent page table
entries being accessed close in time.
TLBs are usually organized as cache-like, fully-associative or set-associative structures, ad-
dressable by a virtual address. Table 2.1 lists the data TLB (D-TLB) configurations of several
commercial processors. All these TLBs are private per-core structures. The acronyms FA,
SA, and DM stand for Fully-Associative, Set-Associative, and Direct-Mapped respectively;
these acronyms are used to describe the associativity of various structures (e.g., TLBs, caches)
throughout the thesis. Multiple levels of TLBs can exist in a system, similar to the multiple
levels of caches in the cache hierarchy. The first level usually has separate instruction and
data TLBs, as is the case with the split L1 instruction and data caches that are the norm in
today’s systems. The second level (L2) TLBs are usually unified, hosting translations for both
instructions and data. The TLBs for the AMD systems are the only exceptions in this table.
Processor Microarchitecture L1 D-TLB Configuration L2 TLB Configuration
AMD
12h family [4, Section A.10] 48-entry FA TLB (all page sizes)4-way SA 1024-entry D-TLB (4KB)2-way SA 128-entry D-TLB (2MB)8-way SA 16-entry D-TLB (1GB)
15h family [5, Section 2.9] 64-entry FA TLB (all page sizes)8-way SA 1024-entry D-TLB (4KB,2MB or 1GB)
ARM Cortex-A72 [8] 32-entry FA (4KB, 64KB and 1MB)4-way SA 1024-entry
(4K, 64KB, 1MB, 16MB)
Haswell [38, Table 2.10], [34] 4-way SA split L1 TLBs: 64-entry (4KB),32-entry (2MB) and 4-entry (1GB)
8-way SA 1024-entry (4KB and 2MB)
Intel
Broadwell [38, Table 2.11] same as Haswell6-way SA 1536-entry (4KB and 2MB)4-way SA 16-entry (1GB pages)
Skylake [38, Table 2.5] same as Haswell12-way SA 1536-entry (4KB and 2MB)4-way SA 16-entry (1GB pages)
Knights Landing [38, Table 16.3] uTLB: 8-way SA 64-entry (4KB fractured)
8-way SA 256-entry (4KB), 8-way SA 128-entry (2MB), 16-entry FA (1GB)
OracleSparc T4 [72] 128-entry FA TLB (all page sizes)
Sparc M7 [59] same as Sparc T4
Sun UltraSparc III [76]2-way SA 512-entry TLB (8KB)
16-entry FA TLB (superpages and locked 8KB)
Table 2.1: Commercial D-TLB designs; all the L2-TLBs are unified except for the AMD systems.
Chapter 2. Background and Related Work 11
The various TLB designs in Table 2.1 are also annotated with the page-size(s) they support.
In each system, the memory allocation algorithm and, in some cases user hints/requests, influ-
ence the page size a given virtual address will belong to. As mentioned earlier, only the virtual
address and the process ID are known before a translation is retrieved. Since the page size a
virtual address belongs to is unknown at translation time, special care is needed to avoid using
page-offset bits as the TLB index. Figure 2.2 illustrates the virtual address bits commonly
being used to index a set-associative TLB for each of the supported x86 page sizes (i.e., 4KB,
2MB, and 1GB). This example assumes a TLB with 64 sets. The tag and set-index bits form
the page number, while the remaining low-order bits form the page offset. Unfortunately, set
index bits for one page size can be page offset bits for another page size. If page-offset bits
are used as a TLB-index, a translation for a single page could reside in multiple TLB-set(s)
depending on the part of the page being accessed (i.e., the page offset).
Virtual Address
4KB
2MB
page-offset bits set-index bits tag bits
1GB
63
11
63 63
20
29
0
0 0
Figure 2.2: SA TLB indexing for a TLB with 64 sets (x86 architecture).
Current systems usually mitigate this set-indexing issue either by implementing a fully-
associative TLB where only a single set exists (e.g., AMD 12h family, SPARC M7) or by
implementing split TLB designs where multiple TLB structures exist, one per page-size (e.g.,
Intel’s Haswell). These split TLBs are all probed in parallel adjusting the TLB index based on
the page size they support. Fully-associative designs have the shortcoming of a slower access
latency, in addition to being less energy efficient than less associative structures, while split
designs have the shortcoming of wasted energy as at most one of all the split-TLB lookups will
be useful. Chapter 3 presents an evaluation of these different design choices, while Chapter 4
further examines their shortcomings and proposes alternative TLB design choices that are
superpage friendly.
A few systems do support multiple page-sizes in a set-associative design presumably either
via multiple sequential lookups or by splitting a single superpage translation into multiple
translation entries of the supported page size. The fractured uTLB in Knights Landing takes
an alternative approach by fracturing the translation of any page greater than 4KB and holding
only the translation(s) for the 4KB parts of the page being accessed.
Chapter 2. Background and Related Work 12
The TLB organization affects the TLB access latency which is time-critical because it pre-
cedes every cache lookup. In systems with a Virtually-Indexed and Physically-Tagged (VIPT)
L1 cache, the translated address is required before the L1 tag comparison. Therefore, the
L1 TLB organization where a hit is hopefully the common case should meet some tight timing
requirements. But constraints exist from the perspective of the paging infrastructure as well.
For example, the smallest supported page size limits the organization of any L1 VIPT cache;
the cache index bits should fall within the page-offset to avoid being translated, to ensure cor-
rectness. That is, the capacity of a VIPT cache way (i.e., the number of sets multiplied by the
cache block size) should not exceed the smallest page size. For x86-64, a 32KB SA L1 cache
with 64B cache blocks should be at least 8-way SA.
TLBs can be classified into hardware-managed and software-managed. In architectures
with hardware-managed TLBs, like the x86 Instruction Set Architecture (ISA), a TLB miss is
serviced via a hardware state-machine that walks the page tables and delivers the translation
entry, if any is present, to the TLB. Hardware management is minimally intrusive; it does
not require flushing the processor’s pipeline and does not pollute the instruction and data
caches. However, this comes with the overhead of an inflexible page table organization since
the organization specifics need to be fixed for all hardware that supports the same ISA. On the
other hand, software-managed TLBs allow for a more flexible page table design.
In ISAs with software-managed TLBs, like SPARC, a TLB miss triggers an interrupt and
it is a software interrupt handler routine that walks the page table and refills the TLB. Un-
fortunately, the use of precise interrupts requires flushing the core’s pipeline. Therefore, the
flexibility of hardware agnostic page tables comes with a cost. In some SPARC systems, namely
the UltraSPARC CPU family, the interrupt handler checks the TSB before walking the page
tables. The TSB is a direct-mapped, virtually-addressable data structure that caches transla-
tions. Think of it as a software cache that logically lives between the TLBs and the page-tables.
The TSB can be accessed with a single memory access and can thus hide the page-walk latency.
Chapter 5 provides additional details on how the TSB is accessed and presents a TSB-inspired
cacheable TLB for systems with hardware-managed TLBs.
All types of TLBs, both hardware and software-managed, need to be kept coherent with the
page tables. For example, if any modifications happen to the access permissions of a page, the
relevant TLB entry should be updated or invalidated. This change should also be communicated
to other cores in the system (TLB shoot-down). Another scenario when TLB entries need to
be invalidated is on a context-switch. Some systems have enhanced the TLB entries with an
Address Space Identifier (ASID), to allow translations from multiple processes to cohabit the
TLB. The absence of an ASID forces the system to flush the entire TLB. Finally, some systems
provide the option of invalidating entries within a given address range.
Chapter 2. Background and Related Work 13
2.2.3 The Cheetah-MMU in SPARC
The MMU implementation varies per architecture and encompasses a wide range of features,
such as the TLB organization, the format of a TLB-entry, or any special MMU registers used
in handling TLB-misses. This work uses and extends Flexus from the SimFlex project [35], “a
family of component-based C++ computer architecture simulators”, based on Simics [52] that
models the SPARC v9 ISA [74] and supports full-system simulation. Therefore, this section
presents relevant details for the Cheetah-MMU [84], the MMU used in the UltraSparc-III [76]
processors, and also draws parallels, wherever applicable, with x86.
One might naturally ask how measuring behaviour in one system can be indicative of be-
haviour in other systems. For the purpose of this work, we used the existing MMU to collect
memory traces with address translation information, such as the virtual to physical mapping
of a memory access. Collecting memory accesses before the TLB, and not a TLB miss stream,
allowed us to simulate different TLB configurations. The traces also include non-translating
accesses.
As Section 2.2.2 discussed, the TLBs in Cheetah-MMU are software-managed. On a TLB
miss, a trap handler walks the page tables and refills the TLB. The UltraSparc-III processors
use the Trap Type (TT) register to track the most recent trap (multiple trap levels exist). A
D-TLB miss triggers a fast_data_access_MMU_miss trap with trap type 0x68. Section 5.5.1
reviews the Cheetah-MMU trap handler in more detail. The remainder of this section reviews:
(i) the MMU registers used in TLB miss handling, (ii) the TLB organization and replacement
policy, and (iii) the special MMU operations that keep the TLB coherent with the page tables.
2.2.3.1 MMU Registers and TLB Miss Handling
The Cheetah-MMU has various special registers. Of particular interest when a TLB miss occurs
are the TLB Tag Access Register and the Data In Register [76]. On a TLB miss, the TLB Tag
Access register contains the only information known when a TLB miss occurs: (a) the virtual
address bits [63 : 13] (i.e., 8KB VPN) of the missing address (since the page size is yet unknown,
the smallest page size is used), and (b) a 13-bit context identifier.
The context identifier, also present in TLB entries, allows translations from multiple pro-
cesses, and thus address-spaces, to co-exist in the same structure. With a context ID, systems
can avoid invalidation of all TLB entries in case of a context switch. The terminology for iden-
tifiers similar to context ID that are used in address translation varies greatly. For example,
Address Space Identifier is another term commonly used in literature, while in x86-64 the term
is Process Context Identifier (PCID), a 12-bit value [39]. During a TLB lookup, the context ID
for the currently running process, usually kept in a separate register, is compared against the
context stored in a given TLB entry. The only exception is when the global bit of the corre-
sponding TLB entry is set. If this is the case, no context comparison takes place; the virtual
page comparison suffices.
Once the translation is retrieved (e.g., after a page walk), it is loaded into the 64-bit Data In
Chapter 2. Background and Related Work 14
register. This register has various fields but the following are the most relevant for this work:
(a) a 2-bit page-size field that distinguishes between the four supported page sizes of 8KB, 64KB,
512KB, and 4MB, (b) physical address bits [40 : 13] (8KB PPN; more least significant bits can
be masked according to the page-size field), (c) a global bit, explained earlier, and (d) a locked
bit that indicates whether this translation can be a TLB replacement candidate; translations
with the locked bit set are called locked or pinned. The writeable flags, the privileged flags and
the bits that determine cacheability are a few examples of other fields.
2.2.3.2 TLB Organization and Replacement Policy
The Cheetah-MMU uses two separate D-TLBs to support translations of different page-sizes
as well as locked translations. As Table 2.1 showed, a 512-entry 2-way set-associative TLB
only hosts 8KB pages, while a smaller 16-entry fully-associative TLB holds translations for
superpages (i.e., non-8KB page sizes) and locked translations of any page-size. Both structures
are probed during a TLB-lookup; on a TLB miss the retrieved translation is installed in the
appropriate structure based on the page-size and locked fields discussed earlier.
Each TLB entry in the Cheetah-MMU TLBs also has a used bit associated with it; this bit
is set on a TLB hit, speculative or otherwise, and it is used to identify a replacement candidate
when a TLB set is full [76]. If no invalid entries exist in the current TLB set, the first unlocked
entry with a used-bit set to zero is selected. If no replacement candidate is identified, all used
bits are reset and the process repeats.
2.2.3.3 Special MMU Operations
Until now, the main focus was on how to retrieve the translation information after a TLB
miss. However, it is crucial for the MMU to have the ability to modify the TLB contents
to ensure they are coherent with the page tables. Correctness is non-negotiable. There are
two types of such modifications: invalidations of TLB-entries, often referred to as demaps or
demappings, and modifications of the contents of TLB-entries, sometimes referred to in this
work as remappings.
In the Cheetah MMU, these operations are initiated by specialized store instructions (STXA
opcode). These instructions use specific ASIs. An ASI is an 8-bit value that specifies the address
space. In SPARC v9, the 13th bit of the instruction (starting from zero) specifies the ASI’s
location. If the bit is zero, the ASI is explicitly encoded in the instruction (bits 5-12 inclusive),
while if it is one, the ASI is held in the ASI register. An ASI with its most significant bit set to
zero corresponds to a restricted ASI, one only accessible by privileged software [74]. Accesses to
MMU registers usually involve one of these special ASIs explicitly encoded in the instruction.
Demaps: “stxa %g0, [%g1 + %g0] 0x5f # ASI_DMMU_DEMAP” is a disassembled demap
instruction; 0x5F is the D-MMU TLB DEMAP ASI. The virtual address denoted in square
brackets, here the sum of the contents of global registers g1 and g0, contains the demap type,
Chapter 2. Background and Related Work 15
the virtual address bits, and a 2-bit field that indicates the register that contains the context
ID. The value to be stored can be ignored. The restricted ASI is explicitly encoded in the
instruction.
In UltraSparc-III, two types of demaps exist: (a) a demap page type that can invalidate
at most one TLB entry associated with an instruction-encoded VPN and context, and (b) a
demap context type that invalidates all TLB entries associated with a given context, if their
global bit is not set. Locked translation entries can also be demapped like all others.
The functionality described above is also provided in other non-SPARC architectures. For
example, x86-64 supports similar operations with dedicated instructions. Intel reports that
the INVLPG instruction can invalidate all translation entries for a given page number, while
the INVPCID instruction has four different operation modes that can invalidate mappings of
a specific address, all mappings of a specific context (similar to demap-context), or map-
pings of all contexts with the option to either include or exclude any translations marked
as global [39, Section 4.10.4.1].
TLB-Entry Modifications: TLB-entry modifications occur via instructions that directly
modify the MMU Data Access register (0x5d is the relevant ASI). Here is an example in-
struction: “stxa %o3, [%o1 + %g0] 0x5d # ASI_DTLB_DATA_ACCESS_REG”. These are OS-
directed writes to a specific TLB-entry. The virtual address of this write (store) operation
specifies the TLB-entry to be modified (overwritten), while the store value specifies the new
TLB data.
2.2.4 Address Translation for I/O
The previous sections discussed the basics of address translation from the perspective of the
core’s MMU. However, a different path to memory exists via Direct Memory Access (DMA) from
I/O devices. An I/O Memory Management Unit (IOMMU) provides address translation and
memory protection to I/O accesses that would previously directly access physical memory [2,37].
This added functionality provides a level of protection from misbehaving drivers, as well as the
necessary hardware support for I/O virtualization.
With an IOMMU, an address translation step is now introduced in the critical path of
every DMA access. I/O TLBs are used to avoid the costly walk of the I/O page tables. The
IOMMU is commonly located on an I/O hub or bridge (PCI bridge in our simulated system)
and can serve multiple devices. A PCI bridge is a system component that connects multiple
buses together. Figures 2.3 and 2.4 show two examples of a typical I/O architecture focusing
on network and disk I/O traffic respectively, the two dominant types of I/O traffic. For the
network traffic, DMA accesses initiated by the network adapter reach the memory of the server
system after crossing a PCI bus and a PCI bridge where the IOMMU resides. For disk I/O,
Figure 2.4 shows a snapshot of one SCSI disk array from a system that contained multiple
arrays of fiber channel SCSI disks connected to multiple PCI bridges via a hierarchy of PCI
Chapter 2. Background and Related Work 16
Figure 2.3: Network I/O - System Snapshot Figure 2.4: Disk I/O - System Snapshot
buses. DMA accesses initiated by the Fibre SCSI controller in this snapshot cross the PCI-to-
PCI bridge as well as the Host-to-PCI bridge before reaching memory. The IOMMU module is
now located at the Host-to-PCI bridge, the root node of this hierarchy of buses.
SPARC also contains an IOMMU. In the Serengeti server systems, the IOMMU is located
in Schizo, the host-to-PCI bridge [84]. It has a 16-entry fully-associative TLB and supports
8KB and 64KB page sizes [77]. Selective flushing of TLB entries is permitted via programmable
I/O operations. On a TLB miss, the IOMMU looks-up the Translation Storage Buffer (TSB),
a software-managed direct-mapped memory data structure. The TSB serves like a page table
here. On a TSB miss, an error is returned to the device which initiated the DMA access.
2.3 Literature Review of Related Work
This section reviews past work that optimizes some aspect of the address translation process
either via hardware optimizations or hardware/software co-design techniques. Past research
has been grouped in the following thematic categories: (a) techniques that reduce the number
of TLB misses, (b) techniques that reduce the latency overhead of a TLB miss, (c) techniques
that revisit address translation/paging, (d) techniques that reduce address translation energy,
(e) techniques that address TLB coherence overheads, (f) techniques that target I/O address
translation, and lastly (g) techniques that leverage address translation facilities to optimize dif-
ferent system aspects (slightly deviating from the prior classification). Some optimizations cross
the boundaries of these categories. Also, many techniques are orthogonal to each other. As is
often the case in architecture research, no perfect design exists. As the applications and the un-
derlying hardware changes, both new opportunities and challenges arise. Sections 2.3.1 - 2.3.7
review the aforementioned thematic categories and also discuss how this thesis relates to them.
2.3.1 Techniques that Reduce TLB Misses
TLBs, similar to caches, capture workloads’ memory access behavior, albeit at a coarser gran-
ularity. Increasing the TLB hit rate, which is already high due to the spatial and temporal
locality forces at work, is one of the main approaches to alleviate the address translation over-
Chapter 2. Background and Related Work 17
head. This section classifies research works that reduce TLB misses into the following three
categories: (i) research works that employ TLB prefetching, (ii) research proposals for shared
or distributed TLB designs that exploit translation sharing across CMP cores, and (iii) research
techniques that extend the reach of each TLB-entry by revisiting the amount of information it
tracks; translation coalescing is one such example. The straightforward solution of increasing
the TLB capacity is not discussed; it is not sustainable for the same timing constraints that
bound L1 cache sizes.
2.3.1.1 TLB Prefetching
Prefetching is employed in caches to anticipate future data use based on previously seen memory
patterns. Prefetching has been proposed for TLBs too, first for uniprocessors in the early 2000s
and later for multiprocessors near the end of that decade.
Saulsbury et al. were the first to propose a hardware-based TLB prefetching mechanism [68].
Their recency-based prefetcher targets deterministic iterative TLB misses. These are charac-
teristic of applications which iteratively access data structures in the same order, but suffer
capacity TLB misses. The proposed scheme maintains a temporal ordering of virtual pages in
an LRU stack by adding a previous pointer and a next pointer to each page table entry. When
this ordering is updated, on a TLB miss, the entries near the requested PTE, referred to by
the aforementioned pointers, are prefetched. All predicted translations are first placed into a
prefetch buffer, and are only promoted to the TLB on a hit, thus minimizing TLB pollution
from bad prefetches. For a set of five applications, the recency-based prefetcher correctly pre-
dicts between 12% to ∼59% of TLB misses for a 64-entry Fully-Associative (FA) TLB, assuming
8KB pages. Applications with a regular stride access pattern benefit the most, and in all cases,
the proposed scheme consistently outperforms a linear, next-page(s) prefetcher.
Kandiraju et al. proposed a TLB prefetching mechanism that captures memory reference
patterns in the form of distances [42]. Distance is the difference in pages between two con-
secutive memory accesses in the TLB-miss stream. The proposed scheme stores previously
seen distances in a distance-indexed hardware table. For example, if the current TLB miss
is to page a and the current distance points to an entry with distances two and five, then
translations for pages a+2, and a+5 will be prefetched. One of the main benefits of the dis-
tance prefetcher is that it can perform well with limited hardware storage. For example, if
all TLB misses of a workload had the same stride (distance), then just a single-entry distance
table would provide full coverage. The recency-based prefetcher [68], discussed earlier, does
not have any on-chip storage constraints but at the expense of larger in-memory page tables.
The proposed distance prefetching scheme has the highest average prediction accuracy across
the simulated workloads for a variety of stride and history based prefetching schemes, some
originally proposed for caches, as well as for the recency-based prefetcher. The latter however
has slightly higher average accuracy than distance prefetching when the accuracy is weighted
by each application’s TLB miss rate. The authors attribute this behaviour to a few applications
Chapter 2. Background and Related Work 18
with high TLB miss rates that benefit from a long history.
All aforementioned prefetching schemes were evaluated in a uniprocessor environment.
Bhattacharjee et al. were the first to characterize TLB misses of parallel workloads (PARSEC)
in a CMP environment and to propose prefetching schemes to address them [20, 21]. The first
class of misses they identified, inter-core shared, captures TLB misses to the same virtual page
across different cores, representative of multi-threaded workloads that access the same instruc-
tions and data. A leader-follower prefetching scheme is proposed, which pushes translations
into the prefetch buffers of other cores (the sharers) under the rationale they will miss on the
same page. A confidence mechanism filters useless or harmful prefetches.
The second class of TLB misses, inter-core predictable stride, captures misses to virtual
pages that are accessed by different cores within a fixed timeframe and are stride pages apart
from each other. The hypothesis is that if core i accesses V PNa, then it is possible that core j
will access V PNb that is V PNa plus some stride. Such behaviour is reflective of data-parallel
applications where threads running on different cores operate on different subsets of data but
follow the same access pattern. Because memory accesses to different pages across cores can
be reordered in timing execution, the proposed distance-based cross-core prefetching scheme
keeps track of distances between consecutive per-core TLB misses. These distance pairs are
stored in the distance table, a hardware structure that is shared across all cores, and they drive
distance-based prefetches on other CMP cores.
Even though this thesis does not evaluate the use of prefetching, prefetching optimizations
are orthogonal to our work. Prefetching could be added both to the superpage friendly TLB
designs in Chapter 4 and the FMN design in Chapter 5. For FMN, a straightforward prefetching
implementation would trigger more FMN and page-table accesses, assuming the prefetching
candidate does not exist in a hardware structure on chip. By default, useless translation
prefetches can unnecessarily increase the memory bandwidth or displace useful data in the
cache hierarchy by bringing additional page table entries on chip. This behaviour could be
further exacerbated with an FMN that is also cacheable like the page tables. A judicious
feedback mechanism to throttle prefetches might be needed, if the existing FMN probe filtering
mechanism (Section 5.9.7) proves insufficient. Alternatively, prefetches could only probe either
the FMN or the page tables and not both. The former would result in a faster prefetch, if the
translation entry has been previously seen and is present in the FMN, while the latter would
avoid the risk of FMN displacing useful data. Both the page walk and FMN probes also perform
indirect prefetching. Within each 64B - FMN or page table - cache block accessed, multiple
translations, for usually spatially adjacent virtual pages, co-reside. All these translations are
naturally moved higher in the cache hierarchy, i.e., closer to the cores, when another translation
in that cache line is accessed.
Chapter 2. Background and Related Work 19
2.3.1.2 Shared or Distributed-Shared TLB Designs
In a CMP environment different applications can either share memory access patterns and data
(e.g., parallel applications) or have very different resource requirements (e.g., multiprogrammed
applications). Even though today’s cache hierarchies often include one or more shared cache
levels to facilitate data sharing, a per core private TLB hierarchy continues to be the norm.
Two proposals in 2010 - 2011, one for a shared TLB [19] and one for a distributed-shared TLB
design [75], were the first to target this research gap. In both cases, the proposed TLB designs
appear to only cater to the smallest supported page size.
A shared last-level TLB design [19] was proposed to better utilize the available TLB capacity
compared to private TLBs. Having a shared structure avoids translation replication across
private TLBs, expanding the effective on-chip TLB capacity, and thus the TLB hit-rate, for
parallel applications. It also allows multi-programmed applications to freely contend for the
entire shared TLB capacity without the constraints of per-core TLBs, which can be especially
beneficial for applications with unbalanced TLB requirements. Integration of stride prefetching
in a shared TLB design is orthogonal, yielding additional miss reduction. The proposed design
was however a monolithic structure with increased access time compared to private TLBs and
potentially poor scalability.
Synergistic TLBs is an alternative design that aims to combine the short access latency of
private TLBs with the better utilized capacity of the shared TLB paradigm [75]. The proposal
is for a distributed TLB design that allows evicted translation entries from borrower TLBs to
spill to remote TLBs that have been dynamically classified as donors. On top of the distributed
design, synergistic TLBs also permit heuristic-based translation replication and migration. The
first allows translations to be replicated across cores to avoid long access times to remote TLBs,
while the latter migrates translations to cores that are likely to access them to better utilize
the available TLB capacity.
Neither the shared nor the synergistic TLB designs were shown to support multiple page
sizes. Using any of our superpage-friendly TLB designs from Chapter 4 as a shared or dis-
tributed TLB could further benefit performance or energy as it would increase the TLB reach.
Also, it would be straightforward to probe our FMN design (Chapter 5) on a shared or a syn-
ergistic TLB miss. One of the shortcomings of a shared TLB is having a monolithic hardware
structure that might not scale well as the number of cores on a chip grows. The FMN design
can be easily configured to be a shared one; all TLB miss controllers need to share the same
FMN base address (see Section 5.4). A shared FMN straddles the ground between a shared
and a distributed structure. Multithreaded workloads that share data across cores can fully
utilize FMN’s capacity without any translation replication. FMN’s cacheable nature allows the
same FMN entries to simultaneously exist in multiple private caches, thus reducing TLB access
latency, albeit at the risk of more useful-data displacement. One could also envision different
FMN organizations where a subset of CMP cores share an FMN while other FMNs are private,
a potentially beneficial configuration for systems running virtual machines or multiprogrammed
Chapter 2. Background and Related Work 20
workloads.
2.3.1.3 Increasing TLB Reach
Prefetching and shared/distributed structures that facilitate translation sharing both rely on
the traditional reach of each TLB translation entry. An alternative way to increase the TLB hit-
rate is extending the reach of each individual TLB entry. TLB entries that support superpages
already do that. However, these entries simply adhere to the decision made by the OS’s memory
allocation algorithm. This section reviews research that extends the reach of TLB entries beyond
what was decided at the OS level.
Talluri and Hill were the first to explore this research avenue in the mid 1990s with two
subblocking TLB designs [78]. The complete-subblock TLB allows a subblock-factor number of
contiguous virtual page numbers to share a single TLB tag, with separate data fields for the
physical frame numbers these VPNs map to. With this design, each TLB entry has a similar
reach as a superpage that is subblock-factor times greater than the smallest supported page size,
albeit at the cost of more hardware resources. A partial-subblock TLB, on the other hand, eases
the area overhead by requiring all PPNs to fall within an aligned memory region, share attribute
bits, and properly align with their VPNs. The partial-subblocking TLB entries are closer in size
to a superpage TLB entry and, contrary to it, do not require any OS support. However, because
not all virtual pages within a subblock might meet these requirements, multiple instances of
the same VPN tag could coexist in the TLB unless the valid bits for all subblock’s VPNs are
combined with the tag.
Almost two decades later, Pham et al. proposed translation coalescing [58]. They observed
the presence of intermediate degrees of contiguity where a group of contiguous VPNs maps
to contiguous PPNs, but this contiguity does not suffice for the contiguous VPN region to be
promoted to a superpage. Their proposed design, CoLT, coalesces such VPN groups maintaining
a single TLB-entry for each. They report a 40% to 57% average TLB miss reduction on a set of
SPEC CPU 2006 and Biobench (bioinformatics) workloads while limiting the maximum number
of coalesced translations to four. Their scheme changes the TLB indexing scheme to support
larger coalesced pages but at the expense of more conflict misses. Contrary to the subblocking
designs discussed earlier, CoLT does not have any alignment restrictions. However, CoLT’s
potential is inherently limited to the “contiguous spatial locality” available in a given system,
which can be scarce in the presence of fragmentation.
Pham et al. later relax CoLT’s requirement for contiguous VPNs to map to contiguous PPNs
and propose the use of clustering to extend TLB reach [57]. Similar to the partial subblock
TLBs, each TLB entry in this clustered TLB maps a set (cluster) of contiguous VPNs. This
VPN cluster needs to be in both cases properly aligned; all VPNs in a cluster share the same
VPN bits except for the lower log2(cluster factor) bits. The same alignment requirement
applies to the PPN cluster too. However, unlike the partial subblock TLB design, these VPNs
can map anywhere within an equally sized and properly aligned cluster of PPNs. Holes, that is,
Chapter 2. Background and Related Work 21
VPNs that do not map to any PPNs in that cluster, are also permitted. These two differences
allow the clustered TLB to capture more cluster locality than CoLT and without any OS
changes. But, as the authors observe, not all translation mappings exhibit such cluster locality;
having too many holes within a cluster would unnecessarily waste resources. Therefore, they
propose a multi-granular TLB design where a clustered TLB and a conventional TLB are both
probed in parallel. This design is further enhanced with a frequent value locality optimization
that will be discussed in Section 3.7. Bhattacharjee reports that “TLB coalescing schemes are
being adopted by industry (e.g., AMD’s Zen chip supports TLB coalescing today)” [17]. AMD’s
“Zen” microarchitecture supports “PTE coalescing [that] [c]ombines 4K page tables into 32K
page size” [27].
Our work relies on page contiguity identified by the OS and does not exploit any interme-
diate degrees of contiguity. The proposed coalescing schemes coalesce translations for only one
page size per set-associative structure. For example, CoLT proposed CoLT-SA coalescing for
a set-associative TLB that supports the smallest page size, while a separate design, CoLT-FA,
coalesced translations for the fully-associative TLB that supports superpages. Coalescing sup-
port for multigrain set-associative structures (i.e., structures that support multiple page sizes)
is far from straightforward, especially since coalescing already requires modification of the set-
indexing scheme. Configuring FMN to support CoLT-like or clustering contiguity might be
possible, if the risk of wasted resources due to holes in a cluster is mitigated. Having a separate
small cacheable FMN that tracks such groups of pages might be an alternative.
2.3.2 Techniques that Reduce TLB Miss Latency Overhead
Unlike the aforementioned TLB miss reduction works, this section reviews research that targets
the TLB miss latency overhead. That is, if one cannot reduce the number of TLB misses, is
it possible to make them less costly? MMU caches and speculative translations are two such
options.
MMU caches, employed by many current commercial designs, logically reside between the
TLB hierarchy and the page tables. By caching parts of the page walk, MMU caches reduce
the number of page-walk required memory accesses and thus the TLB-miss latency. The main
insight here is the presence of temporal locality in the high levels of a multi-level page table,
i.e., memory accesses that share the most significant virtual address bits. AMD64 processors
employ a Page Walk Cache (PWC) [3, 15], a “fully-associative, physically-tagged page entry
cache” that hosts page entries from all page table levels but the last one. This MMU cache
type is also referred to as page table cache [10] because it provides the physical address for
the next (lower) level page table. Intel’s processors employ paging structure caches [39], also
referred to as translation caches [10]. Contrary to the page table caches, translation caches are
virtually tagged and a single entry can skip more than one memory accesses. A PML4 cache
skips accessing the topmost page-table level, a PDPT-entry cache skips the top two levels,
while a PDE cache can skip the top three [39]. Barr et al. first explored the effect of these
Chapter 2. Background and Related Work 22
types of MMU caches [10], including the newly proposed translation-path cache, while more
recently Bhattacharjee proposed coalescing and sharing modifications [16], grounded in the
same observations that guided the CoLT and shared TLB designs.
Another mechanism that attempts to hide the page walk latency is SpecTLB [11]. SpecTLB
speculates as to what the virtual to physical translation will be on a TLB miss, allowing for
memory accesses and other useful computation to proceed speculatively and in parallel with
the page table walk. Note that a TLB miss most probably denotes the presence of a cache
miss too. The proposed system takes advantage of the unique characteristics of a reservation-
based memory allocation system (FreeBSD). On a page fault, the OS might choose to reserve a
superpaged-size region (large page reservation) instead of the default small page, if it predicts
that the entire large page reservation is likely to be used. When this large page reservation
is filled, i.e., all small pages within it are accessed, it is promoted to a superpage. SpecTLB
takes advantage of this memory allocation algorithm. Whenever an address that misses in the
TLB falls within a partially filled large page reservation, the SpecTLB provides a speculative
translation based on the assumption that all small reservations will be promoted to a single
large page. Even with a heuristic reservation detection, SpecTLB overlaps on average 57% of
the page table walks with successful speculative execution.
Our FMN proposal (Chapter 5) also targets TLB miss latency reduction. It leverages the
idea that retrieving a translation with a single memory access is faster than a multi-level page
walk. If MMU caches are present in the system, a configuration not evaluated in this thesis,
then the FMN should be probed either in parallel with or after the MMU caches. The MMU
caches’ location and access latency will likely influence this decision.
2.3.3 Techniques that Revisit Address Translation/Paging
Most research works reviewed earlier in this chapter attempt to optimize address translation
within the existing virtual memory paradigm of paging. Nonetheless, some have followed a
different direction, revisiting how virtual memory is supported.
Basu et al. proposed using direct segments to map large contiguous virtual memory regions,
associated with key data structures, to contiguous physical memory regions [12]. Virtual ad-
dresses that belong to a direct segment do not suffer from TLB misses; instead, they are mapped
to physical addresses via minimal hardware that co-exists with the TLBs. The main motivation
was that big-memory workloads not only pay a hefty penalty due to paging, but an unnecessary
one as they do not benefit from the facilities paged virtual memory provides. Specifically, Basu
et al. observe that “For the majority of their address space, big-memory workloads do not re-
quire swapping, fragmentation mitigation, or fine-grained protection afforded by current virtual
memory implementations. They allocate memory early and have stable memory usage.” [12].
As the authors point out, direct segments do not replace paging. The two mechanisms co-exist;
virtual memory addresses outside a direct segment are mapped via paging. Direct segments re-
quire significant software support; the programmer needs to identify a memory region amenable
Chapter 2. Background and Related Work 23
to this optimization, and the OS needs to consider this in its memory allocation algorithm. The
OS is also responsible for managing the special hardware registers that support direct segment
mappings, e.g., by updating them on a context switch.
Even though direct segments can reap significant benefits for workloads that have a single,
easily identifiable by the programmer, direct segment (e.g., database workloads), they cannot
be extended to different application types, they are not transparent to the application, and they
are limited to one segment per application. These limitations are addressed in the Redundant
Memory Mappings (RMM) proposal [45]. In RMM, Karakostas et al. propose range mappings
for multiple “arbitrarily large” and contiguous virtual memory regions that, similar to a direct
segment, each map to a contiguous physical memory region. Translations for these mappings
are hosted in a fully-associative range-TLB, probed in parallel to the conventional L2-TLB, and
a range table, similar to a page table. Page table entries are augmented with a range bit to
specify that a page has a range-table entry. On a last-level TLB miss on both TLB types, a page
walk takes place first. Then, if the range bit is set, the range table is accessed in the background.
This access happens off the critical path and updates the range TLB with a range translation.
Next time an address within this range mapping misses in the L1-TLB, it will hit in the range-
TLB (unless evicted); the relevant page translation will then be installed in the conventional
L1 TLB. Beyond the required architectural support, RMM also requires explicit OS support to
manage range translations and update the range table. The authors also modified the memory
allocation algorithm to support eager paging. This algorithm generates more memory regions
amenable to range mappings during memory allocation, but might inadvertently fragment the
memory space.
All aforementioned research proposals did not abdicate paging; instead, the proposed mech-
anisms only use paging when necessary. Because they do not needlessly use multiple TLB
entries to track translations for virtual memory regions that could be tracked by direct seg-
ment(s), they improve TLB utilization. All our proposed TLB designs can be used for virtual
memory regions not amenable to direct segments or similar optimizations and alongside any
structures that might support them. For FMN, some co-design might be needed if the non-
paging optimization involves a separate page table walk as in RMM [45]. For example, the
FMN could trigger the range table walk on a hit before the page walk completes.
2.3.4 Techniques that Reduce Address Translation Energy
Address translation not only impacts performance, but also involves a significant energy cost;
translation caching structures like the D-TLBs are accessed on every memory operation, which
needs a translation, for today’s widespread VIPT caches. Some TLB designs exacerbate this
energy cost when multiple structures (e.g., split per page size L1 TLBs) are accessed in parallel.
Chapter 4 presents our Prediction-Based Superpage-Friendly TLB designs that can support
multiple page sizes within the same associative structure, significantly reducing TLB lookup
energy. Related work on supporting multiple page sizes is reviewed there. This section briefly
Chapter 2. Background and Related Work 24
reviews other research that also targets energy.
Karakostas et al. proposed Lite [43], a mechanism that disables TLB ways at runtime to
reduce dynamic address-translation energy. Their motivation is twofold: (i) parallel lookups
of different per page-size TLB structures waste dynamic energy, and (ii) page-table walks can
also consume significant energy. The former is a similar observation we had made earlier,
which had motivated our superpage-friendly set-associative designs [55] discussed in Chapter 4.
The authors’ approach to energy reduction is different than ours. They do not replace the
multiple per page size associative L1 TLBs, but rather downsize them dynamically by disabling
a power-of-two subsets of TLB ways, an idea originally proposed for caches [1]. Their operating
principle is that if hits for a specific page-size dominate, then reducing the sizes of the TLBs that
support other page-sizes will save energy, with minimal, if any, performance overhead. Using
an interval-based scheme, they dynamically identify the number of ways they can disable while
keeping Misses Per Kilo Instructions (MPKI) changes within an acceptable threshold. Their
decision algorithm occasionally re-enables all TLB ways, based on some random probability,
to avoid pathological cases. They also couple Lite with their Redundant Memory Mappings
proposal [45]; by adding an FA L1 range-TLB (recall the range-TLB was L2 in RMM [45]),
they can more aggressively trim down ways from the conventional L1 TLBs. The architectural
and explicit OS level support required for RMM, reviewed in Section 2.3.3, applies here too.
This paper also includes a comprehensive listing of research papers that target TLB energy
reduction, many at the circuit level.
An alternative approach to reducing address translation energy is to reduce the frequency
of TLB accesses, and thus their energy, via Virtually-Indexed and Virtually-Tagged (VIVT)
instead of Virtually-Indexed and Physically-Tagged (VIPT) L1 caches. With a VIVT L1 cache,
the TLB needs to be accessed only on a cache miss. Given the high hit-rates of L1 caches, this
design can result in significant energy savings. Furthermore, removing the TLB access from
the critical path of an L1 cache access, frees the TLB design from strict latency constraints,
potentially allowing TLBs to grow more in size.
Unfortunately, other challenges prevent VIVT L1 caches from becoming prevalent. For
example, unless ASID information is included in the cache tag entry, the entire cache would need
to be flushed on a context switch to correctly deal with homonyms, identical virtual addresses
that belong to different address spaces and map to different physical addresses. Synonyms
(different virtual pages mapping to the same physical page) are also harder to support in VIVT
caches and can complicate cache coherence; cache coherence would require a reverse translation
lookup to identify a specific cache line (or cache lines in the presence of synonyms) via a physical
address. Basu et al. proposed Opportunistic Virtual Caching (OVC), a hybrid L1 cache where
some blocks are either VIVT or VIPT cached [13]. They rely on the OS to specify which
addresses are amenable to virtual caching, which is enabled “when it is safe (i.e., no read-write
synonyms) and efficient (i.e., few permission changes)” [13]. Yoon et al. dynamically detect
and remap synonyms, thus revisiting virtual L1 cache design [85], whereas Park et al. also use
Chapter 2. Background and Related Work 25
synonym detection and advocate for virtual caches throughout the cache hierarchy [56]; caches
below L1 are prevalently Physically-Indexed and Physically-Tagged (PIPT).
Kaxiras et al. approach the problem from a different perspective: instead of redesigning
virtual caches to solve the synonym problem, they advocate for a cache coherence protocol re-
design [46]. They observe that “Virtual-cache coherence (supporting synonyms) without reverse
translations is possible with a protocol that does not have any request traffic directed towards vir-
tual L1s; in other words, a protocol without invalidations, downgrades, or forwardings, towards
the L1s.” [46]. Their previously proposed VIPS-M [65] protocol meets these requirements and
enables VIVT L1 caches. In this new design, the TLBs could either be private structures probed
after the L1, or a shared banked structure placed alongside the Last-Level Cache (LLC). The
latter requires page-colouring to ensure a memory request accesses the TLB and the LLC in the
same tile (bank) and does not incur network traffic overhead. It also removes the need for TLB
consistency while reaping the benefits of a shared TLB design (reviewed in Section 2.3.1.2).
2.3.5 Techniques that Address TLB Coherence Overheads
Besides all the overheads associated with walking the page tables, translation coherence has its
own challenges. TLB coherence describes the correctness requirement for the TLBs’ data to
be in sync with (i.e., coherent) with the page tables. The term TLB consistency was originally
used [23, 81] to describe this requirement, but recent literature uses the terms TLB coherence
and consistency interchangeably, despite the nuanced but important distinction of the two in
the cache domain. This section also follows this nomenclature. In multiprocessor systems, TLB
consistency requires that any page table modifications made by one core (e.g., remappings,
invalidations) need to be propagated to the other cores, as their TLBs might host that stale
translation.
Early work in 1989-1990 [23,81] highlighted the translation consistency problem, proposing
different hardware or software solutions. Almost two decades later, two research papers [64,83]
highlighted the overhead that software-based TLB consistency - usually implemented via TLB
shootdown software routines that use inter-processor interrupts - incurs in multiprocessor sys-
tems. Romanescu et al. demonstrated that today’s software TLB shootdown mechanisms scale
poorly, latency wise, as the number of cores increases. They proposed a hardware coherence
mechanism in a scheme that unifies instruction, data, and translation coherence [64]. Villavieja
et al. also explored the impact of TLB shootdowns [83]. They identified two main issues.
First, TLB shootdowns are performed via the very costly and intrusive mechanism of precise
interrupts. Second, at the time of a TLB shootdown the OS does not know the exact set of
TLB sharers, and thus unnecessarily interrupts some processors. Their proposed scheme, DiDi,
avoids both these overheads by (a) keeping track of all the translation sharers in a dictionary
directory, and (b) by using a hardware invalidation mechanism instead of an interrupt.
In this thesis, special instructions communicate any translation modifications/invalidations
to the TLBs, as Section 2.2.3.3 illustrated. Section 3.5 in the next chapter provides specifics
Chapter 2. Background and Related Work 26
about the frequency of such operations. The superpage friendly TLB designs from Chapter 4
can use the TLB consistency mechanism of any system, whereas the FMN design proposed in
Chapter 5 is configured as a speculative, and thus not coherent with the page tables, structure.
2.3.6 Techniques that Target I/O Address Translation
There is limited work on the impact of I/O address translation on system’s performance. Yehuda
et al. were the first to explore the performance impact of an IOMMU on real hardware [14].
They measured throughput and CPU utilization with and without the IOMMU, when running
the FFSB and netperf workloads for disk and network I/O. In a system without a hypervisor, no
difference was seen in throughput, while CPU utilization increased by up to 60% more with the
IOMMU enabled. There were two main sources for this overhead: (a) mapping and unmapping
entries in the page tables in memory, and (b) the system’s inability to selectively invalidate
IOTLB entries.
Amit et al. were the first to propose hardware and software optimizations to reduce IOTLB
miss rates [6]. They used a virtual IOMMU in order to collect I/O traces and ran netperf
and bonnie++ write tests to analyze network and disk I/O. They proposed page offsetting
for each device’s virtual I/O address space to avoid IOTLB hot spots for coherent mappings.
They also proposed a modification to Intel’s Address Locality Hints mechanism, which provides
hints as to whether prefetching of higher or lower adjacent pages should occur. Finally, in their
Mapping Prefetch (MPRE) scheme the OS provides the IOMMU with hints to prefetch the first
group of consistent mappings. Unfortunately, limited details are provided about the various
optimizations (e.g., the timeliness of their MPRE scheme).
2.3.7 Architectural Optimizations that Take Advantage of Address Transla-
tion
Address translation and its facilities not only present optimization challenges but they also
offer opportunities. A different class of research does not aim on improving address translation
per se, but rather leverages existing address translation hardware or software for other archi-
tectural optimizations. For example, R-NUCA [36] augmented page table entries with sharer
information to optimize placement of private versus shared data in a non-uniform cache setting.
Additional bits in the page tables and the TLBs were also used in a snooping mechanism in
virtualized systems [47] to find private pages and pages shared across multiple virtual machines
to filter snoops from cores not mapped to a given virtual machine. Most recently, Bhattacharjee
proposed TEMPO, “translation-enabled memory prefetching optimizations” [18]. He observed
that for big-data workloads 20-40% of DRAM accesses are due to page walks, and these DRAM
accesses are almost always followed by a DRAM access to retrieve the data. TEMPO identifies
these DRAM accesses and prefetches the data first into the DRAM row buffer and then into
the LLC. Once the memory instruction that missed in the TLB is replayed, it is expected to
Chapter 2. Background and Related Work 27
hit in the LLC, or worst case in the row-buffer, saving both time and energy. TEMPO requires
modifications both in the page table walker and the memory controller.
The research works mentioned above represent only a small sample of related work. Sec-
tion 3.8 illustrates that there is predictability in the first cache block of a virtual page accessed
on a TLB miss. This observation could trigger additional optimizations. The benefit of looking
at the virtual address space is that it is more representative of the application behaviour at
a coarser granularity. The temporal ordering of the various data structures and application
accesses crossing page boundaries could be lost if one looks through the physical address lens.
Similarly, the spatial correlation between physical addresses might be harder to dynamically
extract/learn compared to virtual address correlation.
2.4 Concluding Remarks
This chapter (i) presented background information for address translation, and (ii) classified
and reviewed related work that targets different aspects of the address translation process
such as TLB miss reduction or energy (wherever relevant, additional related work information
will be provided in the following chapters). This review is not all-encompassing, but it high-
lights how design decisions for address translation structures, such as TLBs, permeate system
design affecting performance and energy. It also reflects the different approaches to address
translation optimizations: from purely micro-architectural to mostly OS or system-level or
hardware-software co-design, to name a few. This thesis opts for an architectural approach.
First, Chapter 3 analyzes TLB-related behaviour to better understand those application or
system-level characteristics that can influence the address translation cost. Then, Chapter 4
addresses the paucity of research in associative designs that can support multiple page sizes,
while Chapter 5 proposes a cacheable TLB to reduce TLB miss latency without hefty hardware
resources.
Chapter 3
TLB-related Behaviour Analysis
3.1 Overview
This chapter presents an exploration of the TLB-related workload behaviour for a set of appli-
cations emphasizing commercial and cloud workloads. Address translation caching structures
such as TLBs can improve performance but are not prerequisite for correct execution. In a
design space that is on one end marked by a system without a TLB and on the other end
by a system with an ideal TLB - a utopian structure that has zero access time, zero stor-
age requirements, zero energy, and zero misses - there are multitudes of possible and realistic
designs.
As in most architectural designs, it would be possible to optimize the hardware design of ad-
dress translation structures (e.g., TLBs) to perform well for an individual workload, but such an
approach could be mainly applicable in reconfigurable architectures (e.g., Field-Programmable
Gate Arrays (FPGAs)). The core objective in system design is to first optimize for the common
case, and if possible have dynamic mechanisms to reduce the negative impact or further optimize
for the less common operating scenarios. This chapter’s goal is a comprehensive summary of
characteristics and metrics that we believe are of interest for anyone doing research in the area
of address-translation optimizations. The complex interactions between these characteristics
are mapped out, to the extent possible, and suggestions on design trade-offs are also described.
Sections 3.1.1 and 3.1.2 present a roadmap for the rest of this chapter. These sections list
the characteristics/metrics that will be reported along with a short justification as to why these
measurements were collected. The precise definition of each metric is given in later sections,
just before the measurements are presented. Wherever relevant, both the overall behaviour of
the characteristics as well as how they vary in time is presented. Some of the metrics shown,
e.g., MPMI, can be considered as proxy metrics for performance. It is key to understand that
no single metric is sufficient in isolation to decide the most appropriate TLB-design or relevant
address translation optimization for a single workload. There is overlap in the information that
each measurement yields, and they all together formulate a complex set of trade-offs. This
problem becomes even more complex when trying to identify designs that work well for most
28
Chapter 3. TLB-related Behaviour Analysis 29
workloads, especially when these workloads have drastically different characteristics.
3.1.1 Characteristics Inherent to the Workload
Table 3.1 lists characteristics that are inherent to the workload and that are not influenced by the
hardware configuration of the TLBs or the organization of other address translation structures,
such as the page tables or the MMU caches. However, these characteristics, which can vary
greatly across workloads, can profoundly influence the impact of different TLB designs and
address translation optimizations on metrics such as performance, TLB miss handling latency,
etc. The main characteristics listed for each workload are: the number of unique translations,
the number of ASIDs (contexts), and the lifetime of the various translation mappings. The
memory allocation algorithm used by the operating system as well as the supported page sizes
can influence these characteristics, but modifying any of these parameters is beyond the scope
of this thesis.
Characteristics Measured Brief Justification
Sets an upper bound to the ideal TLB size. Useful whenUnique Per Core and deciding among different TLB sizes and private versus
Per Page-Size Translations shared TLBs. Per page-size breakdown highlights(Section 3.3) issues of current TLB structures.
The ASID count can help in TLB indexing scheme selection,Contexts (ASIDs) e.g., a high count can result in many TLB conflict misses
Count and Sharing Degree, for non context-aware indexing schemes. The sharing degreeLifetimes, can affect private versus shared TLB decisions. For example,
Frequency and Reach if all contexts have high sharing degrees, private TLBs would(Section 3.4) suffer from translation replication and thus have less
effective capacity.
The lifetimes of translation mappings influence all TLBTranslation Mappings management schemes. For example, large TLB capacities
Lifetime would be useless if all mappings had an extremely short,(Section 3.5) one-time access, lifetime. History-based management
schemes better cater to long translation lifetimes.
Unique Bytes or Byte Sets Hints at the compressibility of translation information.in Translations Can drive compression schemes that reduce the size(Section 3.7) of structures needed to capture translation footprint.
Table 3.1: List of characteristics/metrics inherent to the workload presented in this analysisalong with a brief explanation.
Chapter 3. TLB-related Behaviour Analysis 30
To summarize, the goal of Sections 3.3 to 3.5 is to reveal behaviour that is inherent to the
program and thus aid in understanding any specific measurements obtained for specific TLB
configurations in the sections that follow.
3.1.2 Other Characteristics
Table 3.2 presents characteristics and metrics that, contrary to the ones depicted in Table 3.1,
are influenced by the structure of the TLBs and other address translation structures. Some
of these metrics, e.g., MPMI, can be used as proxy metrics for performance, evaluating the
effectiveness of a given TLB design, while other metrics outline what we consider interesting
opportunities for architectural optimizations.
Characteristics Measured Brief Justification
MPMI and Hit-Rate Effectiveness of different TLB structures.for different TLB organizations Influenced by metrics like unique translations,
(Section 3.6) context, page size, and translation lifetimes.
Cache Block Accessed after TLB miss Opportunity for Cache/TLB co-design.(Section 3.8)
Table 3.2: Other Measurements
3.2 Methodology
All graphs in this chapter were generated via functional trace-based full-system simulation.
Functional simulation allows for a longer execution sample and is appropriate for the aforemen-
tioned measurements. Even if accesses are reordered in a full-timing setting, it unlikely that
this will affect metrics such as TLB MPMI, given that TLB misses are not extremely frequent
events and thus would not fall within that short time window.
The traces were collected using Flexus, from the SimFlex project [35], a full-system simulator
based on Simics [52]. Simics models the SPARC ISA and boots Solaris. We relied on Simics
API calls (e.g., probing TLBs/registers) to extract translation information. The traces include
both user and OS (privileged) memory accesses. The collected memory references correspond to
one billion dynamic instructions per core in a 16-core CMP, 16 billions in total. In our sample,
the progression of memory references implies the progression of time. For the remainder of this
chapter, the terms execution and execution time will interchangeably refer to the aforementioned
execution sample.
The memory traces were collected after the running workloads had reached a stable state,
that is, after they had passed their initialization phase, in order to obtain a representative
Chapter 3. TLB-related Behaviour Analysis 31
execution sample. No drastic changes were observed in the results throughout the aforemen-
tioned execution sample. We thus expect that similar trends will persist for longer executions,
until a workload enters a drastically different execution phase. Different execution phases are
likely to exacerbate some of the already observed trends. Unless a workload’s data footprint or
data use change (e.g., by allocating new data structures or accessing the existing ones with a
drastically different access pattern that spans page boundaries), it is unlikely that smaller-scale
phase changes would affect TLB-related trends.
The aforementioned execution sample is also within the same order of magnitude as what
is used in other research works that use architectural simulators [19, 43]. Simulating longer is
practically difficult given the slow simulation speeds. Research works that can be evaluated
via OS modifications and with measurements from hardware performance counters [12] can
be run for significantly lengthier evaluation intervals (e.g., several minutes). Even though it
would be feasible to collect some of this chapter’s results in a real system too, e.g., by using the
BadgerTrap [32] tool that instruments and collects x86-64 TLB misses, these results would be
naturally limited to the TLB organization of one hardware system. For example, a functional
simulator would still be needed to explore how different TLB organizations influence MPMI as
Section 3.6 does.
3.2.1 Workloads
Table 3.3 summarizes the set of eleven commercial, scale-out and scientific workloads used in this
work. These are standard state-of-the-art workloads, sensitive to modern TLB configurations,
that many other works have used [13,16,19,44,57].
Workload Class/SuiteWorkloadName
Description
Online TransactionProcessing (OLTP) -
TPC-C
TPC-C1 100 warehouses (10GB), 16 clients, 1.4GB SGA
TPC-C2100 warehouses (10GB), 64 clients, 450MB
buffer pool
Web Server (SpecWEB-99) Apache 16K connections, FastCGI, worker-threading
PARSEC [22](native input-sets)
canneal simulated annealing
ferret Content similarity search server
x264 H.264 video encoding
Cloud Suite [30]
cassandra Data Serving
classification Data Analytics (MapReduce)
cloud9 SAT Solver
nutch Web Search
streaming Media Streaming
Table 3.3: Workloads
We opted for a variety of workloads versus an exhaustive representation of a single workload
Chapter 3. TLB-related Behaviour Analysis 32
suite. For workload suites such as PARSEC, a workload subset was selected based on each
workload’s data footprint and how much it stresses the TLBs. For example, Canneal, Ferret
and x264, have some of the highest weighted D-TLB misses per million instructions values, as
presented in an earlier TLB characterization work [20].
3.3 Unique Translations
What is the number of TLB entries required to deliver the maximum possible hit-rate for each
workload given a conventional private or shared FA TLB? This is the main question that this
section addresses. If the TLB capacity was not limited by constraints like access-time or area,
having a fully-associative TLB with as many entries as the number of unique translations each
workload requires would result in no TLB capacity misses. A unique translation, from the
perspective of each core, is identified via the tuple {Virtual Page Number (VPN), Context,
Page Size}. Each such translation requires a separate TLB-entry. The physical frame this
tuple maps to can change over time, but this is not classified as a separate translation because
it would not occupy a separate TLB entry.
3.3.1 Per-Core Measurements
Tables 3.4 and 3.5 present a per page-size classification of the unique translations accessed per
workload, as seen from the perspective of each CMP core. The simulated system supports page
sizes of 8KB, 64KB, 512KB and 4MB. All pages with sizes other than the smallest one are
referred to as superpages. The total memory footprint these unique translations correspond to,
assuming all the data blocks in each page are accessed, is also shown.
Variations in the access pattern of each core result in different ideal TLB capacity require-
ments of each private TLB. Therefore, the per core tables below present the minimum, the
maximum, as well as the average number of unique translations per core along with the mea-
sured Standard Deviation (SD) to provide a more well-rounded picture. The maximum number
of unique translations would be the answer to the question posed in the beginning of this section,
assuming private FA TLBs, all with the same number of entries.
For the small 8KB pages, Table 3.4 shows that the average number of unique translations
accessed per core is one or two orders of magnitude more than the state-of-the-art L1 TLB
capacity. As the SD column hints (the detailed measurements are omitted), while for some
workloads all cores have similar TLB-capacity requirements (canneal) or there is only one
outlier core (apache), for other workloads (e.g., cloud9, x264) the variations across cores are
more pronounced, showing, in some cases, different clusters of cores in terms of their ideal TLB
requirements. Designing private TLBs with different space requirements could be meaningful
for workloads like these; a shared TLB where workloads could contend on-demand for TLB
capacity could be another alternative.
While the simulated system supports four page sizes, only the 8KB and 4MB page sizes
Chapter 3. TLB-related Behaviour Analysis 33
Workload Workload Min. Unique Max. Unique Avg. Unique SD (σ)
Class 8KB Translations 8KB Translations 8KB Translations (σ as % of Avg.)
(Footprint in MB) (Footprint in MB) (Footprint in MB)
Commercial
apache 11,571 (90) 55,393 (433) 51,380 (401) 10,306 (20.06%)
TPC-C2 15,310 (120) 19,757 (154) 17,903 (140) 1,060 (5.92%)
TPC-C1 6,177 (48) 11,627 (91) 7,071 (55) 1,230 (17.39%)
canneal 68,047 (532) 68,471 (535) 68,132 (532) 112 (0.16%)
PARSEC ferret 3,537 (28) 9,299 (73) 7,304 (57) 2,152 (29.46%)
x264 22 (0.2) 9,862 (77) 2,513 (20) 2,754 (109.59%)
cassandra 11,392 (89) 18,819 (147) 14,727 (115) 2,316 (15.73%)
classification 411 (3) 1,336 (10) 918 (7) 276 (30.07%)
Cloud-Suite cloud9 9,145 (71) 74,033 (578) 28,014 (219) 19,278 (68.82%)
nutch 7,088 (55) 8,030 (63) 7,648 (60) 233 (3.05%)
streaming 20,295 (159) 56,449 (441) 53,110 (415) 8,506 (16.02%)
Table 3.4: Per-core unique translation characterization: 8KB pages. Footprint in MB is listedin parentheses for the min., max. and avg. (arithmetic mean) columns. SD is also expressedas a percentage of the average in parentheses.
Workload Workload Min. Unique Max. Unique Avg. Unique SD (σ) Superpage
Class Translations Translations Translations Footprint (MB)
512KB 4MB 512KB 4MB 512KB 4MB 512KB 4MB Max. Avg.
Commercial
apache 28 4 54 7 50 4 6 1 55 41
TPC-C2 69 13,207 73 14,121 71 13,706 1 279 56,521 54,860
TPC-C1 4 410 7 748 6 544 1 95 2,996 2,179
canneal - 5 - 5 - 5 - 0 20 20
PARSEC ferret - 5 - 8 - 7 - 1 32 28
x264 - 1 - 8 - 4 - 3 32 16
cassandra - 1,280 - 1,534 - 1,469 - 90 6,136 5,876
classification - 121 - 832 - 549 - 192 3,328 2,196
Cloud-Suite cloud9 - 4 - 6 - 6 - 1 24 24
nutch - 174 - 190 - 185 - 4 760 740
streaming - 8 - 8 - 8 - 0 32 32
Table 3.5: Per-core unique translation characterization: Superpages (i.e., 64KB, 512KB and4MB pages). No 64KB pages were present.
were prominently used. No use of 64KB page sizes was observed, while very few, if any, 512KB
pages were used. There are multiple possible reasons for this page size usage. For example,
workloads that allocate and access large data structures likely exhibit sufficient memory address
contiguity for them to use the largest supported page size (4MB) from all supported superpage
sizes, without risking internal fragmentation. In such cases, using a larger page size is preferable
because it extends the reach of each TLB entry. It is also possible -depending on the memory
allocator used- that smaller superpages were used during the initialization phase of these work-
loads, and they were later promoted to 4MB pages once the workload reached a more stable
Chapter 3. TLB-related Behaviour Analysis 34
state. The intermediate page sizes might be reserved for specific purposes, e.g., I/O buffers;
we had observed usage of 64KB pages for I/O TLB accesses during an earlier research project.
They might also exist to support legacy devices. Even though the OS with its memory allocator
is the one responsible for page size decisions, hints by the user can influence these decisions.
For example, in Solaris, one can use the ppgsz utility to specify the preferred page size for
the heap or the stack. Lastly, it is also possible that the TLB configuration of the underlying
system influences the page size decisions of the memory allocation algorithm. For instance, if
there is little hardware support for a given page size, the OS might avoid using it.
As Table 3.5 shows, the use of superpages varied drastically across workloads. The OLTP
workloads TPC-C1 and TPC-C2 (commercial database systems) and three scale-out applica-
tions from the Cloud benchmark suite (cassandra, classification, and nutch) use 4MB pages.
These workloads can easily thrash an unbalanced TLB design with limited superpage capac-
ity, while the 8KB-only workloads waste energy looking up any separate superpage-only TLB
structures. Chapter 4 presents our proposed prediction-based superpage-friendly TLB designs
that address this challenge.
3.3.2 CMP-Wide Measurements
This section has until now focused on per core measurements of unique translations, as these can
better inform design decisions for the private TLB designs that are prevalent today. Table 3.6
will next report CMP-wide measurements. These CMP-wide unique translation counts reflect
the capacity (number of entries) of a shared - across all 16 cores - fully-associative TLB that
would yield no TLB capacity misses. These values are also shown normalized to the average per
core unique translations counts from Table 3.4 for 8KB pages and from Table 3.5 for superpages
respectively. The range for the normalized values is [1, 16] inclusive. Workloads with low
degrees of data sharing have normalized values close to 16 (e.g., cloud9 for 8KB pages); each
core in these cases accesses an almost distinct part of the data footprint at the page granularity.
Conversely, workloads with high degrees of data sharing have normalized values close to 1 (e.g.,
ferret, canneal for 8KB pages); here most cores access the same data footprint, and thus the
CMP-wide unique measurements closely match the average per-core values. Translations for
superpages, when a workload only accesses a small number of them, also fall under this second
category with a normalized value close to 1. However, when a significant number of superpages
is accessed, as for example in cassandra and classification, these superpages are mostly private.
The presence or not of different processes running on different cores also contributes to this
private versus shared distinction, as Section 3.4 will discuss.
3.4 Contexts
Context IDs are not traditionally part of TLB indexing schemes. As Section 2.2.2 mentioned,
only parts of the VPN are commonly used for TLB indexing. However, the presence of multiple
Chapter 3. TLB-related Behaviour Analysis 35
Workload
Unique 8KB Unique 512KB Unique 4MB
Workload Translations Translations Translations
Class (Normalized over (Normalized over (Normalized over
Per-Core Avg.) Per-Core Avg.) Per-Core Avg.)
apache 138,919 (2.7) 56 (1.12) 7 (1.75)
Commercial TPC-C2 34,575 (1.93) 85 (1.2) 31,847 (2.32)
TPC-C1 31,455 (4.45) 36 (6) 4,807 (8.84)
canneal 69,287 (1.02) - 5 (1)
PARSEC ferret 14,294 (1.96) - 9 (1.29)
x264 16,619 (6.61) - 9 (2.25)
cassandra 110,418 (7.5) - 23,397 (15.93)
classification 10,911 (11.89) - 8,137 (14.82)
Cloud-Suite cloud9 447,302 (15.97) - 6 (1)
nutch 28,226 (3.69) - 2,871 (15.52)
streaming 768,854 (14.48) - 8 (1)
Table 3.6: CMP-wide unique translation characterization: 8KB pages and Superpages.
contexts (also known as ASIDs) can apply more pressure on specific TLB sets as different
processes may use identical VPNs to refer to otherwise different physical frames. Even if these
different VPNs refer to the same physical frame (synonyms), a different translation entry is
needed, as the previous section mentioned. This TLB-set pressure is anticipated to be more
pronounced in shared TLB structures. Thus, context-aware TLB management schemes could
be a compelling alternative as they could prevent translation entries with the same VPN, but
with different contexts, from mapping to the same TLB set. But even beyond this, better
understanding and analyzing contexts can provide us with a coarser-grain lens through which
we can see and understand measured TLB behaviour (e.g., TLB MPMI).
3.4.1 Context Count and Sharing Degree
CMP-Wide Measurements: Figure 3.1 depicts the number of unique contexts (y-axis) ob-
served in the entire CMP for a set of workloads (x-axis). All bars are further colour-coded to
indicate the number of cores that issue memory requests for this context. Private is a context
that appeared only in a single core during the workload’s execution, while Shared is a context
present in all 16 cores. All workloads have at least one shared context, context zero. Four
additional categories represent contexts shared by 2 to 15 cores.
Two of the commercial workloads, apache and TPC-C2, have a considerable number of
contexts with varying degrees of sharing. TPC-C2 has an order of magnitude more contexts
than the other TPC-C variant, TPC-C1, that is running on a different database. Almost all
Chapter 3. TLB-related Behaviour Analysis 36
172
205
37
4 4 423
133
19 2235
020406080
100120140160180200220
apache TPC-C2 TPC-C1 canneal ferret x264 cassandraclassifica?on cloud9 nutch streaming
CommercialWorkloads PARSEC Cloud-Suite
#Uniqu
eCM
PCo
ntexts
Private(1core) Shared(all16cores) 2cores [3,4]cores [5,8]cores [9,15]cores
Figure 3.1: Number of unique contexts observed in the CMP; the number is also listed on thetop of each column. Each column is colour-coded based on the number of core-sharers.
simulated Cloud-Suite workloads are dominated by private contexts. Ferret, from PARSEC, is
the only workload without a private context in any of the CMP cores; it has two contexts shared
by all 16 cores and two contexts in the [9, 15] category, shared by 10 and 13 cores respectively.
PARSEC workloads have two shared contexts, context 0 and the application’s context, that are
responsible for nearly all TLB accesses, as Section 3.4.3 shows.
Per-Core Measurements: Figure 3.2 further details the number of unique contexts for each
CMP core (i.e., core 0 to core 15), following the same colour scheme as Figure 3.1. Each
of the three subfigures plots data for a different workload class. Overall there are not many
variations across cores; the standard deviation of the per core context count is below one for
all workloads except for classification (2.8), TPC-C2 (2.4), and apache (20.6), the latter is only
due to core #1.
We believe that the significantly smaller context count for apache’s core #1, the sole contrib-
utor to the aforementioned high standard deviation, is due to the system’s Interrupt Request
(IRQ) affinity set-up. In some operating systems, like Solaris here, only specific CPUs can ser-
vice IRQs from specific devices. We speculate that for apache, the CPU #1, from the server side,
is the one responsible for servicing interrupts triggered by the network adapter (Figure 2.3 ear-
lier showed a system snapshot of the server’s network I/O). Unfortunately, the kernel debugger
in the Solaris version used in this Simics checkpoint did not have the capability to display the
interrupt affinity table. But our assumption is supported by the following two observations:
(i) running mpstat showed CPU1 had an order of magnitude more interrupts than the other
15 CPUs, and (ii) running vmstat -i showed two sources of interrupts: clock and hmec0, the
controller for the “cheerio-hme Network Adapter”. The network’s adapter interrupt rate from
vmstat is similar to the interrupts reported from mpstat for CPU1.
As Figure 3.2’s results indicate, the context behaviour is inherently tied both to the type
Chapter 3. TLB-related Behaviour Analysis 37
0
20
40
60
80
100
120
140
0 15 0 15 0 15
apache TPC-C2 TPC-C1
#Uniqu
ePe
r-Co
reCon
texts
Private(1core) Shared(all16cores) 2cores [3,4]cores [5,8]cores [9,15]cores
(a) Commercial workloads
0
1
2
3
4
5
0 15 0 15 0 15
canneal ferret x264
#Uniqu
ePe
r-Co
reCon
texts
Private(1core) Shared(all16cores) 2cores [3,4]cores [5,8]cores [9,15]cores
(b) PARSEC workloads
0
2
4
6
8
10
12
14
16
0 15 0 15 0 15 0 15 0 15
cassandra classifica2on cloud9 nutch streaming
#Uniqu
ePer-CoreCo
ntexts
Private(1core) Shared(all16cores) 2cores [3,4]cores [5,8]cores [9,15]cores
(c) Cloud Workloads
Figure 3.2: Number of unique per core contexts for three workload classes. Each columncorresponds to a different core in the range of [0, 15] in ascending order.
of the running workload and its specific configuration. For example, multi-threaded workloads
such as PARSEC, or even more traditional high-performance computing workloads, are likely to
Chapter 3. TLB-related Behaviour Analysis 38
have multiple threads of a single process running across all CMP cores, unless they are running
in a multi-programmed environment. For commercial server-type workloads, multiple processes
are expected to handle different tasks; their numbers can drastically vary as the example of the
two TPC-C instances running on different database systems demonstrates. Cloud (scale-out)
workloads, although they can occasionally be configured to run in a single process, are also more
naturally fitted to deploy multiple processes for scalability purposes, to leverage the available
cores. It is likely that such behaviour will be more prevalent in future workloads. In all cases,
because of the small overall standard deviation, the context behaviour observed at a given
core, or known a priori due to workload profiling, could drive decisions about context-aware
TLB-indexing schemes and/or shared TLB structures. These decisions could be dynamically
adapted based on a system’s target workload, to reflect the trends described above.
3.4.2 Context Lifetimes (Within Execution Sample)
For the workloads that have multiple contexts, it is important to examine the lifetime of these
contexts; that is, whether they are persistent throughout the workload execution or they are
created and destroyed over time. It is overlapping - in time - contexts that could stress the TLBs
due to increased set conflicts. The inherent limitations of architectural simulation preclude us
from simulating workloads in their entirety. Therefore, in this section, the term context lifetime
denotes the portion of the execution sample during which a context (process) issues memory
accesses on each core. A context’s “lifetime” starts once the first memory request from that
context is issued and ends when the last memory request from that context is issued. It is
possible that this context is not destroyed, but issues more requests later in time. However, we
believe that observing the contexts’ “lifetime” within this sample, can provide useful information
about overlapping contexts and their use over time. Section 3.4.3 will later discuss contexts’
significance, e.g., the percentage of TLB accesses a context is responsible for.
Figure 3.3 depicts the lifetime of each context as seen from the perspective of each core for
workloads with more than 30 per core contexts. The x-axis represents the passage of time as
measured in a trace-driven functional simulator; one cycle corresponds to one memory request.
Each plotted horizontal line starts when the first memory request for a given (context, core)
was issued and ends when the last memory request was issued for that (context, core). For
example, if a context is accessed by all 16 cores, it will appear as 16 distinct lines on the graph.
Each figure label also lists the average lifetime duration as a percentage of the workload’s
execution time. If a core had two running contexts, with the first triggering requests for the
first quarter of the execution sample and the second triggering requests throughout the sample,
a context lifetime graph would include two horizontal lines both with the same starting time.
The first line would span the first 25% of the x-axis, while the second line would span the entire
x-axis (execution sample). The average context lifetime in this example, which is used solely
for illustration purposes, would be 62.5% of the execution time. In Figures 3.3a and 3.3b the
lifetimes (x-axis) are sorted via lifetime duration (y-axis), with the longer lifetimes at the top,
Chapter 3. TLB-related Behaviour Analysis 39
to point out the presence of contexts with small lifetimes, while in the remaining figures the
lifetimes are sorted (y-axis) by their initial start time.
0 1E+09 2E+09 3E+09 4E+09 5E+09
Time(func4onalsimula4on)
apache(contextlife4mesiny-axissortedbydura4onandnotstart4me)
(a) apache (61%)
0 1E+09 2E+09 3E+09 4E+09 5E+09
Time(func4onalsimula4on)
TPC-C2(contextlife4mesiny-axissortedbydura4onandnotstart4me)
(b) TPC-C2 (84%)
0 1E+09 2E+09 3E+09 4E+09
Time(func3onalsimula3on)
TPC-C1
(c) TPC-C1 (65%)
0 1E+09 2E+09
Time(func1onalsimula1on)
classifica1on
(d) classification (36%)
0 1E+09 2E+09 3E+09
Time(func2onalsimula2on)
streaming
(e) streaming (64%)
Figure 3.3: Context lifetimes. The average context/core lifetime is listed in parentheses as apercentage of the workload’s execution time sample.
In all cases, a large number of contexts is actively issuing memory requests throughout
the entire workload execution sample. This observation is more pronounced in TPC-C2 where
on average each context/core is active for 84% of the workload’s execution sample. On the
other hand, classification has many contexts that are active for a short period of time with an
average lifetime of 36% of the workload’s execution sample. As the classification plot shows
(Figure 3.3d), many contexts often start when the lifetimes of other contexts end, most probably
reflecting the various map-reduce operations performed by this data-analytics workload. These
results indicate that contexts’ lifetimes usually overlap in time, which might encourage context-
aware TLB designs in the future. They also show that some contexts might be used only for a
short period within an execution sample; details such as the lifetime duration or their number
Chapter 3. TLB-related Behaviour Analysis 40
vary greatly across workloads. Even though longer execution times would be ideal, these results
-which are naturally based on the specific execution samples- provide a snapshot of common
context behaviour. Dynamic hardware policies usually adapt their behaviour based on much
smaller execution intervals.
3.4.3 Context Significance: Frequency and Reach
The previous sections explored context characteristics like count, sharing degree, and context
lifetimes. But are all contexts of equal importance when it comes to TLB behaviour? If all
contexts are not to be treated homogeneously, for example by filtering out translations for
some contexts to reduce TLB pollution, how can we quantify context significance? We use
the following two metrics: the number of TLB accesses initiated by each context (frequency),
and the number of unique translations accessed (reach). This section’s measurements highlight
that aside from one or two prominent contexts (e.g., context 0), most contexts have small,
percentage wise, individual contributions which do however cumulatively add up.
As Table 3.7 shows, context 0 is responsible for a significant number of TLB accesses as well
as unique translation entries accessed in many of the simulated workloads. For example, it is
responsible for almost half the total TLB CMP accesses and unique translations in apache. It
is the most prominent of all contexts in all workloads, with the exception of canneal, ferret, and
cloud9. The number of unique translations is measured from the perspective of each CMP core
(Section 3.3.1). The measurements presented in Table 3.7 are the sum of these per core unique
translation measurements according to Equation 3.1. Equation 3.2 shows the computations for
each context’s frequency of accesses.
%Unique 8KB translations =
15∑core=0
Unique Translationscore, ctxt=0, page size=8KB∑page size
∑ctxt
15∑core=0
Unique Translationscore, ctxt, page size
(3.1)
%CMP TLB Accesses =
15∑core=0
TLB Accessescore, ctxt=0
∑ctxt
15∑core=0
TLB Accessescore, ctxt
(3.2)
Table 3.8 shows the significance of some non-zero contexts for the PARSEC and Cloud-Suite
workloads; for the Cloud-Suite workloads, contexts that contribute to less than 0.1% in terms
of accesses or unique translations are classified as negligible. For canneal and ferret, a shared
non-zero context is responsible for the vast majority of TLB accesses and unique translations.
This behaviour is not observed in Cloud-suite workloads where the average contribution to
TLB accesses and translations is below 7%. In cloud9, two contexts (476 and 580) out of the 16
mentioned in this table are responsible for ∼16% of unique 8KB translations even though they
Chapter 3. TLB-related Behaviour Analysis 41
Workload Workload % CMP TLB % Unique 8KB
Class Accesses Translations
apache 48.6 46.9
Commercial TPC-C2 20.2 19.3
TPC-C1 30.1 77.4
canneal 13.6 0.3
PARSEC ferret 23.6 10.1
x264 75.4 72.8
cassandra 22.4 73.5
classification 15.7 20.9
Cloud-Suite cloud9 4.3 0.4
nutch 20.5 88.3
streaming 44.1 12.7
Table 3.7: Context 0: % TLB accesses and cumulative per core unique translation entries acrossthe entire CMP. See previous equations.
correspond to 0.8 and 16.2% of TLB accesses respectively. This observation indicates that a
large unique translation reach does not always go hand in hand with a large TLB access count.
Workload Workload Context % CMP TLB % Unique 8KB
Class Information Accesses Translations
canneal 1400 86.4 99.7
PARSEC ferret 1400 76.4 89.7
x264 1401 24.6 27
Workload Workload Context Count % Range of CMP TLB % Range of Unique 8KB
Class (contexts with negligible Accesses; Avg. % Translations; Avg. %
contributions: second row)
cassandra16 contexts 4 - 6 (each); avg. 4.9 0.5 - 1.5 (each); avg. 1.1
6 contexts negligible
classification74 contexts 0.1 - 4.9 (each); avg. 1.2 0.1 - 2.3 (each); avg. 0.5
58 contexts negligible
Cloud-Suitecloud9
16 contexts 0.5 - 7 (each); avg. 6 2-16.5 (each); avg. 6.2
2 contexts negligible
nutch16 contexts 4.5 - 5.5 (each); avg. 5 0.4 - 0.7 (each); avg. 0.6
5 contexts negligible
streaming16 contexts 0.3 - 4 (each); avg. 3.5 1.9 - 5.8 (each); avg. 5.5
18 contexts negligible
Table 3.8: Non-zero contexts: % TLB accesses and cumulative per core unique translationentries across the entire CMP for PARSEC and Cloud workloads.
Chapter 3. TLB-related Behaviour Analysis 42
3.4.4 Concluding Remarks
To summarize, Section 3.4 demonstrated the variations that exist across different sets of work-
loads in terms of unique context counts, in how these contexts are shared across multiple cores,
in the active lifetime of each context, as well as in the variations of these contexts’ contributions
to TLB accesses and their data reach.
The context count and sharing degree varied greatly across each workload class; the sim-
ulated PARSEC and Cloud-Suite workloads mostly had only a few contexts, the former with
a high sharing-degree while the latter with mostly Private contexts, whereas the commercial
workloads with the exception of TPC-C1 had two orders of magnitude more contexts present,
most with a high sharing-degree. The vast majority of contexts had lengthy lifetimes that
covered more than half of the execution sample, with the exception of classification that had
many short-lived contexts. Significant variation - in these context specific measurements - was
observed in the frequency and data reach of each context. Context 0 was the most prominent
context for most workloads with only a few exceptions, mostly of other Shared contexts, e.g.,
in canneal. However, this section’s measurements indicate that these other contexts should not
be ignored or filtered; their small individual contributions are significant when accumulated.
3.5 Translation Mappings Lifetime
This section explores the frequency with which translation mappings are modified, either via
demap or translation modification requests as Section 2.2.3.3 discussed. MMU TLBs are de-
signed with the fundamental expectation that translation mappings do not change often. Oth-
erwise, if mappings were single-use only - an extreme example - TLBs would experience no
reuse and would thus fail to hide part of the lengthy page-walk latency. The expectation that
translation mappings are persistent for appropriately lengthy intervals is also essential for any
address translation optimization techniques that rely on remembering past mappings. The
FMN TLB proposed in Chapter 5 exploits this observation.
Table 3.9 presents the absolute number of demappings and remappings that took place
during workload execution. Demappings are further classified into demap-page or demap-
context as explained in Section 2.2.3.3. The results show that translation mappings persist
over time since translation modifications, of any kind, are quite rare. Demap-context operations
appear only for a single workload (TPC-C2), while remap operations are more pronounced in the
three commercial workloads modeled, as are the demap-page operations. While the simulated
traces contain 2.6 - 4.8B memory accesses, translation-mapping modifications are 3 to 6 orders
of magnitude less.
The remainder of this section provides additional analysis and observations for the three
types of translation mapping modifications: the demap-context, the demap-page, and the TLB-
entry modification scenarios. For the workloads where translations invalidations/modifications
were observed, this analysis examines how frequent these operations are, how many cores and
Chapter 3. TLB-related Behaviour Analysis 43
Workloads Number of Demappings Number of Remappings
Demap-Context Demap-Page
apache 0 567,059 127,385
TPC-C2 59 611,685 585,201
TPC-C1 0 896,844 176,713
canneal 0 90 9,296
ferret 0 60,122 5,243
x264 0 11,315 5,694
cassandra 0 513 6,747
classification 0 168 1,025
cloud9 0 1,212 697
nutch 0 66 1,537
streaming 0 11,914 15,379
Table 3.9: Translation Demap and Remap Operations (cumulative in the entire CMP).
instructions were involved, and how many unique pages or translation-entries were affected.
This analysis can be helpful if someone wants to predict these operations or build a caching
structure as in Chapter 5.
3.5.1 Demap-Context Analysis
A demap-context operation invalidates all translations involving the context in question, a po-
tentially faster way to tear down mappings involving large multi-page memory buffers instead
of individually invalidating each virtual page from that buffer with separate demap-page re-
quests. The only workload that experienced demap-context operations was TPC-C2. Four of
its contexts, which experienced primarily 8KB-page accesses, were involved in multiple demap-
context operations; two contexts were shared across all 16 cores, whereas the other two contexts
were shared by 13 and 14 cores respectively, thus the 59 overall demap-context measurement
in Table 3.9.
As anticipated, and as our measurements confirm, once a demap-context operation takes
place for one core, all cores that had at some point accessed translations from this context should
also execute a demap-context operation. Two PCs were involved in these demap operations:
one PC (0x10156830) initiated the first demap operation for each context, while another PC
(0x101568d8) triggered all the subsequent demap operations (i.e., the demaps that took place in
the other cores). Once all these demap operations for a given context have completed across all
sharer cores, the corresponding process may issue new memory accesses (e.g., accessing a new
memory buffer). All four demapped contexts were involved in multiple subsequent accesses,
shortly after all the relevant - for that context - demap-context operations had completed.
The rarity of demap-context operations is welcomed, as invalidating all translation entries
associated with a given context can be a costly operation. The impact of such operations is
Chapter 3. TLB-related Behaviour Analysis 44
not only gauged by their frequency, but also by each context’s data reach. The more unique
translations a demapped context has accessed, the greater the impact of the demapping. In
TPC-C2, each of these four contexts has accessed on average 170 to 205 unique translations per
core, the largest data reach after context 0 that dominates TPC-C2 accesses. The time elapsed
between the last access to a context and its demapping time will determine if these translations
will persist in the TLBs and other paging structures. In the worst case, this interval was
∼890K functional simulation cycles. Noticeably, all four involved contexts were short-lived
with a maximum lifetime at 6.9% of execution time, significantly below the average TPC-C2
contexts’ lifetime that is at 84% of execution time.
3.5.2 Demap-Page Analysis
A demap-page operation, discussed in Section 2.2.3.3, invalidates a single (context, virtual page)
and will thus affect at most one translation entry at a time. As Table 3.9 showed, all workloads
experienced numerous demap-page operations. Table 3.10 presents the maximum number of
unique contexts, PCs (demap-page instructions), VPNs, and (VPN, context) tuples involved
in demap-page requests per core across all our workloads. The values for the entire CMP are
shown in parentheses when different.
Overall, the number of demap-page operations varied across cores; in four workloads (can-
neal, cassandra, cloud9 and x264) demap-page operations took place in half or less the CMP
cores, whereas in the remaining workloads such operations took place in all 16 cores. Multiple
contexts were involved in demap-page operations, with context-0 usually being one of them. In
all workloads only a handful PCs triggered these demap-page requests, an anticipated behaviour
given these operations are handled by system code. It is common for the same (VPN, context)
to get a demap-page request multiple times during the workload’s execution time.
Workload Workload Max. Context Max. PC Max. VPN Max. (VPN, context)
Class Name Count Count Count Count
Commercial
apache 90 (132) 7 17,286 (18,926) 17,752 (27,297)
TPC-C2 26 (58) 8 559 (765) 1,336 (3,115)
TPC-C1 3 (18) 6 14,868 (14,930) 14,868 (14,974)
PARSEC
canneal 2 3 43 (50) 43 (50)
ferret 2 6 358 (1,796) 358 (1,796)
x264 2 3 2,417 (5,673) 2,417 (5,673)
Cloud-Suite
cassandra 2 2 2 2
classification 3 (31) 4 (5) 12 (15) 14 (73)
cloud9 2 (5) 2 355 (611) 355 (611)
nutch 1 (16) 4 (5) 2 2 (32)
streaming 2 (17) 5 698 (1,858) 698 (9,643)
Table 3.10: Unique characteristics of Demap-Page requests (per core). Values in parenthesesare for the entire CMP wherever different.
Chapter 3. TLB-related Behaviour Analysis 45
The same VPNs are demapped multiple times during workload execution with the PPNs
these pages map to either changing or remaining the same over time. In cassandra, a single
8KB virtual page was repeatedly demapped (512 times total) and the physical page it was
mapping to was constantly incremented by one. We have also observed that in some cases the
same VPNs are demapped from different contexts over time. For example, in TPC-C2 a context
sees demaps to a group of VPNs that are later demapped from another context too and so on.
These VPNs mapped to the same PPNs for the aforementioned small group of contexts. The
vast majority of demap-page requests were to 8KB pages only. The only exceptions were the
commercial workloads where most demap-page requests affected 512KB pages. TPC-C1 was
the only workload with demap-page requests to 4MB pages.
3.5.3 TLB-Entry Modification Analysis
This section presents observations about TLB-entry modification operations (discussed in Sec-
tion 2.2.3.3) to help better understand the characteristics of such operations and the translations
they involved. All translations installed in the TLB via such operations had both their privi-
leged and locked bit set. All 16 CMP cores executed modification operations with the exception
of cloud9 and x264 where a few cores did not. For the Commercial workloads, the allocated
translations that were installed by these modifications all belonged to 512KB pages. For the
PARSEC workloads they belonged to 8KB pages with the exception of canneal where the vast
majority belonged to 4MB pages. Lastly, for the Cloud-Suite workloads, 4MB pages prevailed
for cassandra and streaming, 8KB for cloud9, while nutch and classification saw equal partici-
pation for 8KB and 4MB pages. All translation entries installed via a TLB-entry modification
operation involved context 0.
Only a few PCs trigger modifications operations: two for PARSEC workloads, two for
Commercial workloads and three for Cloud-Suite workloads. Having shared PCs within each
of the three workload classes is anticipated. The OS code which initiates these modification
operations can vary across kernel versions, and these three workload classes were set-up on
different systems (e.g., the commercial workloads were running on Solaris 8, while Cloud-Suite
workloads on Solaris 10). For the commercial workloads, both PCs of these modifications
operations immediately succeed PCs that trigger demap-page operations.
Lastly, we have observed many cases where the entry being modified does not share a VPN
with the entry being allocated. Only in classification and cloud9 there were a few cases where
these entries shared the same VPN and in both workloads this was a single VPN that got
mappings that alternated between two different PPNs (per workload) across all cores.
It is an open question whether the translation remapping trends will remain the same in
emerging systems with heterogeneous memory architectures, or if more remappings could help
take better advantage of the different memory devices, and their characteristics, in such systems.
Chapter 3. TLB-related Behaviour Analysis 46
3.5.4 Concluding Remarks
To conclude, the analysis presented in Section 3.5 has shown that MMU operations such as
demap-page or TLB-entry modifications are rare and are triggered by specific instructions
(PCs) that are not application specific. For each of the three workload classes used in this
work, the number of these instructions was always less than a dozen. This observation suggests
it is possible to predict these special PCs, and thus anticipate such operations. A simple
table-based predictor would suffice, but such an approach is beyond the scope of this work. In
this work, the rarity of the aforementioned operations encourages design choices that rely on
remembering past translation mappings, thus optimizing for the common operating scenario
without harming correctness.
3.6 TLB Capacity Sensitivity Study
The previous sections presented an analysis of workload characteristics that, although indepen-
dent of the organization of TLBs and other address translation caching structures, can influence
the address translation cost as expressed by metrics like TLB hit-rate, TLB miss-handling la-
tency, etc. As this chapter’s introduction discussed, it would be possible to optimize a specific
architectural design to perform well - for one or more of these metrics - for an individual work-
load; such an approach would be applicable on reconfigurable architectures but not appropriate
for a general purpose system. The goal here is to design for the common case and if possible
have dynamic mechanisms to alleviate the negative impact for the less common cases.
One could naturally anticipate that different TLB organizations would better cater to dif-
ferent workloads. For example, the workloads measured to have a large superpage footprint
would benefit from TLB organizations that are not biased in their space allocation against such
translations. To this respect, this section examines the effectiveness of different TLB organiza-
tions as the TLB capacity scales. Effectiveness is measured via TLB MPMI (Misses Per Million
Instructions1) and TLB hit-rate. First, Sections 3.6.1 to 3.6.3 focus on L1-TLBs by model-
ing the following three configurations, all representative of commercial L1-TLB designs: (a) a
configuration with split L1-TLB structures, one per supported page-size, (b) a fully-associative
TLB, and (c) a two-table TLB configuration that uses a set-associative TLB for the smallest
and most prominent page-size and a fully-associative TLB for all other page sizes. Section 3.6.4
then evaluates the impact of an L2-TLB using the same MPMI and hit-rate metrics.
All results were collected from functional simulation in a 16-core system. All TLB designs
listed above are private (i.e., per core) and use an LRU replacement policy. The locked bit values
of all translation entries were ignored to allow us to simulate any associativity. As Section 2.2.3
discussed earlier, translations for locked pages are not to be evicted from the TLB. If all entries
in a set are locked, it no longer becomes possible to cache additional entries in that set, thus
1MPKI is usually used in caches but since TLBs track memory accesses at a much coarser granularity, andthus have fewer misses, MPMI is used instead.
Chapter 3. TLB-related Behaviour Analysis 47
hurting TLB hit-rate. On a given configuration, the OS can control how many locked entries
are allowed per set, potentially denying lock requests where appropriate. Since we use traces,
these decisions are embedded in the trace and cannot be altered during simulation.
3.6.1 Split L1-TLBs; One per Page-Size
Figure 3.4 depicts the variation of L1 TLB MPMI as the TLB capacity increases. The x-axis
depicts the number of entries for a Set-Associative (SA) TLB that supports the smallest page-
size (8KB). The capacity of each of the remaining three split TLBs (e.g., the TLBs for 4MB
pages, etc.) is always half of that, to closely reflect Haswell-based TLB configurations (Table 2.1,
[38]). All TLBs have an associativity of four for the same reason. Each MPMI bar is further
broken down into the MPMI contributions of each supported page-size.
92
93
94
95
96
97
98
99
100
0
1000
2000
3000
4000
5000
6000
7000
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
apache TPC-C2 TPC-C1 canneal ferret x264 cassandraclassifica@on cloud9 nutch streaming
Commercial PARSEC Cloud-Suite
L1-TLBHitRa
te(%
)
L1-TLBM
PMI
Workloadsand#TLBentriesinsplitTLBfor8KBpages
MPMI(8KBpages) MPMI(64KBpages) MPMI(512KBpages) MPMI(4MBpages) Hit-Rate(%)
Figure 3.4: L1 TLB MPMI and Hit-Rate over different TLB sizes. The x-axis lists the number ofTLB entries for the split TLB with 8KB page translations; the capacity of each other split TLBstructure is half that in size. Canneal saturates with this y-axis scale; see detail in Figure 3.5.
MPMI is expected to decrease, hence improve, as capacity increases. The measurements in
Figure 3.4 follow this trend but diminishing returns are observed as capacity increases beyond
512 entries. Today’s norm in split L1 TLB designs is a 64-entry TLB for the smallest supported
page size, the second column in this graph, indicating significant potential for improvement if
increasing the L1-TLB capacity was not an issue. However, doing so would be detrimental to
performance given the strict access-time constraints of L1 TLBs that need to be accessed in
parallel with virtually-indexed and physically-tagged L1 caches.
Chapter 3. TLB-related Behaviour Analysis 48
0
4000
8000
12000
16000
20000
24000
32
128
512 2K
canneal
L1-TLBM
PMI
Figure 3.5: CannealMPMI detail withlarger y-axis scale.
For the commercial workloads, the Haswell-like baseline has an
average MPMI of 2324 and only by quadrupling the available TLB
capacity does it fall down to near half that value, i.e., 1111. Canneal
has the most problematic behaviour from all the simulated workloads.
As the detail in Figure 3.5 shows, even a 2K-entry SA TLB for 8KB
pages yields an MPMI of 9032, just a 59.3% MPMI decrease over the
smallest 32-entry TLB.
Beyond the slope of MPMI reduction, the MPMI contribution of
each page size is also important as anticipated by the unique transla-
tion observations from Section 3.3. Cassandra and classification from
Cloud-suite are the workloads that most benefit from the presence of
larger superpage TLBs. For almost all other workloads, with the ex-
ceptions of TPC-C1 and to a lesser extent TPC-C2, a larger TLB for
4MB pages has little to no benefit. TPC-C2, the workload with the largest 4MB footprint, con-
tinues to see an MPMI of 100 even with the largest 1K-entry split L1-TLB for 4MB pages, while
other workloads like apache or the PARSEC workloads see almost zero 4MB MPMI. Our work
on Prediction-Based Superpage-Friendly TLB Designs [55], presented in Chapter 4, addresses
this inconsistency and the resulting wasted energy due to the many unnecessary split-TLB
lookups.
Figure 3.4 also includes the TLB hit-rate as a separate line (secondary y-axis). The TLB
hit-rate for private TLBs is the ratio of all CMP L1-TLB hits - across all four split structures
and cores - over the total number of CMP L1-TLB accesses. The split L1-TLBs are probed in
parallel but these parallel probes are counted as one TLB access. The measured TLB hit-rates
are over 97% for all workloads except for canneal and they increase alongside the TLB capacity.
Such high L1 TLB hit-rates, even higher than L1-D cache hit-rates which are also commonly
above 90%, were to be expected due to the coarse-grain tracking granularity of TLB-entries.
All remaining graphs in Section 3.6 will plot both MPMI and hit-rate. These two metrics
complement each other. Hit-rate is the most straightforward to grasp metric, which charac-
terizes TLB efficiency. Especially for L2-TLBs (evaluated in Section 3.6.4), the hit-rate is a
first indication that these structures do not reach their full potential, a conclusion not as eas-
ily reached via the MPMI measurements. But looking at the hit-rate alone does not suffice.
Hit-rate provides no notion of time, whereas the MPMI metric implicitly incorporates time by
showing the frequency of misses within a fixed instruction sample. Therefore, even though the
high L1-TLB hit-rates might suggest that L1-TLBs have no further room for optimization, the
MPMI measurements indicate that TLB misses are still frequent and can thus impact both
performance and energy.
Chapter 3. TLB-related Behaviour Analysis 49
3.6.2 Fully-Associative L1 TLB
As mentioned earlier in this thesis, an alternative to split-TLBs is a Fully-Associative (FA)
TLB design, similar to the AMD-12h family, which can naturally support any page-size without
requiring multiple lookups. Figure 3.6 depicts how MPMI and hit-rate vary as the per core
TLB capacity increases. All modeled TLB configurations have a power-of-two entry count
for consistency with the previous section, even though this is no longer a requirement; the
AMD-12h L1-TLB has 48 entries.
919293949596979899100
0
1000
2000
3000
4000
5000
6000
7000
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
apache TPC-C2 TPC-C1 canneal ferret x264 cassandraclassifica@on cloud9 nutch streaming
Commercial PARSEC Cloud-Suite
L1-TLBHitRa
te(%
)
L1-TLBM
PMI
Workloadsand#TLBentriesinFATLB
MPMI(8KBpages) MPMI(64KBpages) MPMI(512KBpages) MPMI(4MBpages) Hit-Rate(%)
Figure 3.6: L1 TLB MPMI and Hit-Rate over different FA TLB sizes. All TLBs model full-LRUas replacement policy. Figure 3.7 shows canneal in detail as it saturated with this y-axis scale.
0
4000
8000
12000
16000
20000
24000
32
128
512 2K
canneal
L1-TLBM
PMI
Figure 3.7: CannealMPMI detail withlarger y-axis scale.
A steeper slope (reduction) of MPMI is observed as FA capac-
ity increases when compared to the split-TLB configurations of Fig-
ure 3.4. When comparing with the results of that figure, configura-
tions with the same x-axis label correspond to different TLB capac-
ities. For example, a 32-entry label corresponds to a 32-entry FA
TLB and to a total split-TLB capacity of 80-entries.2 Nevertheless,
while smaller, the corresponding FA TLB is on average 14.7% better
in terms of MPMI across all workload and configurations. Having a
fully-associative structure allows translation entries of different page-
sizes to coexist and can also reduce conflicts due to translations that
share the same VPN but belong to different processes (contexts). The
full-LRU replacement policy is also beneficial.
280-entries = 32-entries for 8KB pages + 16-entries for each of 64KB, 512KB and 4MB pages.
Chapter 3. TLB-related Behaviour Analysis 50
It would be amiss though not to briefly touch upon some shortcomings of a fully-associative
TLB design. FA structures are generally power-hungry and slow to access; energy and access
time measurements will be presented in Chapter 4. Also, as capacity scales, a full-LRU replace-
ment policy is an unrealistic design choice and thus the MPMI is going to be different (likely
higher) with the most commonly employed pseudo-LRU or random replacement policies.
Even when comparing the total MPMI for each vertical column in Figure 3.6 with its corre-
sponding column in Figure 3.4, there are some instances where the split L1-TLB configurations
perform better in terms of MPMI (even though on average they are worse). For example, in
cassandra, the FA configurations with entries in the 64 to 1K range (inclusive) are from 4% to
20% worse in terms of MPMI compared to their split-TLB counterparts. Cassandra accesses
slightly over 1K superpages and its MPMI really benefits from having a separate superpage
TLB structure. The FA TLB is better only when the FA TLB becomes sufficiently large to
host some working set of these pages, or when the split superpage TLB is too small (16-entries)
to make a difference for that footprint. For TPC-C2, which has a significant superpage foot-
print, the FA TLB is better when it has 128 or more entries. The difference is negligible for
smaller TLB sizes, mostly because the superpage footprint is too large for the split superpage
TLBs to make a significant difference; even increasing the split superpage TLBs from 16 to 1K
entries only reduces the 4MB MPMI by 46%.
The FA hit-rates depicted in Figure 3.6 continue to be high, most of the time slightly higher
than for split-L1 TLBs. Canneal continues to have the lowest hit-rate than all other workloads;
even a fully-associative TLB cannot accommodate the pseudo-random access pattern [22] of
this simulated-annealing workload that also has the largest footprint of all other workloads.
3.6.3 Set-Associative L1-TLB for Small Pages and Fully-Associative L1-TLB
for Superpages
The last L1 TLB design simulated and presented in this chapter is based on the TLBs from
the UltraSparc-III processors and involves a set-associative design for the smallest and most
prominent page-size and a fully-associative design for translations from all the other supported
page-sizes. Figure 3.8 depicts the variation in MPMI and hit-rate when we vary the number
of entries in the 2-way set-associative TLB that only hosts translations for 8KB pages. The
fully-associative TLB that hosts translations for superpages remains unchanged to 16-entries.
The MPMI shown in Figure 3.8 is consistently worse than its split-TLB counterpart from
Figure 3.4. Due to the difference in scale, Canneal is shown separately in Figure 3.9. When
focusing on the 8KB MPMI stacked bars, since these are the only ones affected by the SA
capacity increase, one can observe that the smaller associativity of two in the SA TLB hurts
MPMI.
Chapter 3. TLB-related Behaviour Analysis 51
919293949596979899100
010002000300040005000600070008000
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
32
128
512 2K
apache TPC-C2 TPC-C1 canneal ferret x264 cassandraclassifica@on cloud9 nutch streaming
Commercial PARSEC Cloud-Suite
L1-TLBHitRa
te(%
)
L1-TLBM
PMI
Workloadsand#TLBentriesin2-waySATLB
MPMI(8KBpages) MPMI(64KBpages) MPMI(512KBpages) MPMI(4MBpages) Hit-Rate(%)
Figure 3.8: L1 TLB MPMI and Hit-Rate over different TLB sizes for the 2-way SA TLB thatonly hosts translations for 8KB pages. A fixed 16-entry FA TLB is modeled for all superpages.
0
4000
8000
12000
16000
20000
24000
32
128
512 2K
canneal
L1-TLBM
PMI
Figure 3.9: Canneal MPMIwith larger y-axis scale.
The only case where the UltraSparc-based configuration
is better in terms of overall MPMI is for cassandra where
having a fully-associative 16-entry TLB is considerably better
than the 16-entry 4-way SA split-TLB. The hit-rate follows
similar trends. Classification is a noteworthy exception where
increasing the capacity of the SA TLB has little to no impact
on MPMI and hit-rate, since for this workload it is superpage
capacity that matters.
Figure 3.10 depicts how MPMI changes when increasing
the capacity of the FA superpage TLB. For brevity, only work-
loads whose MPMI changes when the FA-size increases beyond 16-entries are depicted. Simply
doubling the FA capacity from 16 to 32 entries greatly reduces MPMI for classification (80%
MPMI reduction). Contrasting this behaviour with the useless SA capacity increase for this
workload illustrates why blindly increasing the capacity of a TLB that statically caters to a
specific page-size can be a poor and wasteful design decision for some workloads. Chapter 4
will present our work on “Prediction-Based Superpage-Friendly TLB-Designs” that addresses
such concerns.
Chapter 3. TLB-related Behaviour Analysis 52
0
500
1000
1500
2000
2500
16
64
256 1K
32
128
512 2K
16
64
256 1K
32
128
512 2K
16
64
256 1K
TPC-C2 TPC-C1 cassandra classifica8on nutch
Commercial Cloud-Suite
L1-TLBM
PMI
Workloadsand#FATLBentries
MPMI(8KBpages) MPMI(64KBpages)MPMI(512KBpages) MPMI(4MBpages)
Figure 3.10: L1 TLB MPMI over different TLB sizes for the FA TLB that hosts translationsfor all superpages. A fixed 2-way SA 512-entry TLB is modeled for 8KB pages.
3.6.4 L2-TLB
This section complements the L1-TLB capacity sensitivity study by exploring the effect of
L2-TLB capacity on MPMI and hit-rate. L2 TLBs are, like any other hierarchical structure,
larger and slower to access than their L1 counterparts. Commercial L2-TLBs are usually set-
associative with 4-way and 8-way SA TLBs being the most common. Traditionally, L2-TLBs
either support a single page-size or multiple page-sizes, presumably either via multiple sequential
lookups or by splitting a single superpage translation into multiple translation entries of the
supported page size.
To prune the vast design space, this section uses a single L1-TLB configuration while mod-
ifying L2-TLB properties. The state-of-the-art L1-TLB configuration used is a 64-entry 4-way
SA split TLB for 8KB pages along with three 32-entry 4-way SA TLBs, one for each remaining
page size, and corresponds to the second column in Figure 3.4. An 8-way SA L2-TLB with a
cache-like indexing scheme is modeled that only hosts translations for 8KB pages. This asso-
ciativity matches the one in Haswell’s L2-TLB. Upon an L1-TLB miss, the missing translation
is installed in both TLB levels. Figure 3.11 depicts MPMI and hit-rate as the capacity of each
private L2-TLB increases from 512 entries to 64K entries.
Even adding the smallest 512-entry L2-TLB reduces the MPMI over a baseline without an
L2-TLB by 56.5% on average (amean); the L1-MPMI for each workload is shown in parenthe-
ses for reference in Figure 3.11. For workloads such as classification that are dominated by
superpage misses, increasing the L2-TLB beyond some size (here 1K-entries) has no impact,
also shown by the hit-rate that remains unchanged at 6.8%.
Chapter 3. TLB-related Behaviour Analysis 53
0102030405060708090100
01503004506007509001050120013501500
512 2K
8K
32K
512 2K
8K
32K
512 2K
8K
32K
512 2K
8K
32K
512 2K
8K
32K
512 2K
8K
32K
512 2K
8K
32K
512 2K
8K
32K
512 2K
8K
32K
512 2K
8K
32K
512 2K
8K
32K
apache(2534)
TPC-C2(2592)
TPC-C1(1846)
canneal(14758)
ferret(340)
x264(35)
cassandra(546)
classificaBon(842)
cloud9(2101)
nutch(751)
streaming(4541)
Commercial PARSEC Cloud-Suite
L2-TLBHitRa
te(%
)
L2-TLBM
PMI
Workloadsand#TLBentriesin8-waySAL2TLB.L1TLBMPMIinparentheses.
MPMI(8KBpages) MPMI(64KBpages) MPMI(512KBpages) MPMI(4MBpages) Hit-Rate(%)
Figure 3.11: L2 TLB MPMI and Hit-Rate over different TLB sizes. The x-axis lists the numberof L2 TLB entries for an 8-way SA L2-TLB that only supports 8KB pages. Canneal saturateswith this y-axis scale; see detail in Figure 3.12.
0
2000
4000
6000
8000
10000
12000
512 1K
2K
4K
8K
16K
32K
64K
canneal
L2-TLBM
PMI
Figure 3.12: Canneal L2-TLB MPMI detail withlarger y-axis scale.
Overall, the L2-TLB hit-rate (secondary y-axis) is sig-
nificantly lower than the one measured for the L1-TLBs, an
anticipated behaviour given that most spatial and temporal
locality has been filtered by the first TLB level. As the ca-
pacity moves beyond the maximum number of unique trans-
lations for each workload (reported in Section 3.3 in the be-
ginning of this chapter), there are diminishing returns; the
only benefits are from reducing conflict misses.
To better quantify the usefulness of L2-TLBs, Figure 3.13
classifies - at the end of each trace execution - all L2 TLB
entries as either invalid or valid and presents these numbers
as a percentage of the overall L2 TLB capacity (y-axis). The
x-axis depicts the number of L2 TLB entries (lower label); for each TLB size, there are 16
adjacent vertical columns, one per core, as indicated by the 0 and 15 upper labels for the
512-entry L2-TLB configuration. The remaining upper x-axis labels are omitted for brevity.
The graphs that follow illustrate not only that for many workloads a significant percentage
of the L2-TLB capacity is wasted as TLB capacity increases, but also highlight the occasional
differences that might exist across cores. For example, x264 and cloud9 are two workloads
where L2-TLBs of different cores see drastically different occupancies. These workloads had a
high standard deviation for the unique per core translation as listed in Table 3.4 earlier in this
chapter.
Chapter 3. TLB-related Behaviour Analysis 54
Figure 3.13: Per-Core L2-TLB Capacity classified percentage-wise in valid and invalid TLBentries for different L2-TLB sizes.
0%20%40%60%80%
100%
0 15
512 1K 2K 4K 8K 16K32K64K
apache
0%20%40%60%80%
100%
0 15
512 1K 2K 4K 8K 16K32K64K
TPC-C2
0%20%40%60%80%
100%
0 15
512 1K 2K 4K 8K 16K32K64K
TPC-C1
(a) Commercial Workloads
0%20%40%60%80%
100%
0 15
512 1K 2K 4K 8K 16K32K64K
canneal
0%20%40%60%80%
100%
0 15
512 1K 2K 4K 8K 16K32K64K
ferret
0%20%40%60%80%
100%
0 15
512 1K 2K 4K 8K 16K32K64K
x264
(b) PARSEC Workloads
0%20%40%60%80%
100%
0 15
512 1K 2K 4K 8K 16K32K64K
cassandra
0%20%40%60%80%
100%
0 15
512 1K 2K 4K 8K 16K32K64K
classifica2on
0%20%40%60%80%
100%
0 15
512 1K 2K 4K 8K 16K32K64K
cloud9
0%20%40%60%80%
100%
0 15
512 1K 2K 4K 8K 16K32K64K
nutch
0%20%40%60%80%
100%
0 15
512 1K 2K 4K 8K 16K32K64K
streaming
(c) Cloud-Suite Workloads
Chapter 3. TLB-related Behaviour Analysis 55
3.7 Compressibility and Compression
This section explores the compressibility of the translation information held in the TLB and
other paging structures. The motivation for this analysis was the idea of a larger L2-TLB
that could use fewer hardware resources if some hardware compression was employed. The
less involved approach, in terms of hardware complexity, is to compress the information within
each conventional translation entry. This approach would be particularly useful if one were to
support a cacheable TLB structure (see Chapter 5) as more translations could be packed in
the same cache block. The parts of the translation entry expected to be more amenable to
compression are the virtual page number, part of the TLB tag, and the physical page number,
part of the TLB data block. One can anticipate that the upper parts of the VPN and PPN
could have higher degrees of compression as a smaller part of the address space is expected to
be accessed closer in time. The sensitivity study presented in this section demonstrates both
the compressibility potential and some optimization techniques that could harvest it.
To identify viable compression techniques, the variation of unique values in the VPN and
PPN fields was first explored for a set of workloads. Unique values were measured at a per
byte or a per byte-set granularity as indicated in Figure 3.14. The two least-significant bytes of
the VPN and PPN fields are ignored as all but two bits of the second least-significant byte fall
under page-offsets bits. Byte-sets are numbered starting from the most-significant byte (MSB).
Byte-set i contains the i most significant bytes of the relevant field.
63 56 55 48 47 40 39 32 31 24 23 16 15 .. 0
MSB 0 MSB 1 MSB 2 MSB 3 MSB 4 MSB 5
Byte-Set 0
Byte-Set 1Byte-Set 2
...
Byte-Set 5
Figure 3.14: Unique Bytes and Byte-Sets Nomeclature
The working assumption is that if only a limited number of unique bytes or byte-sets exist,
then one could trade storing accurate byte information for an index to a table that stores these
unique bytes or byte-sets. Figure 3.15 plots the unique number of bytes and byte-sets for both
virtual and physical addresses across our workloads. The smaller this number is, the more
potential compression would have. For each unique byte the maximum possible number of
values is 256, while for byte-set i the maximum number of values is 2i∗8. The ALL x-axis label
in the figure below reflects the unique values measured across all workloads, a number relevant
for systems running multi-programmed workloads.
The results indicate that the first four MSBs see less than ten unique values each, compared
to the maximum of 256, making it possible to reduce the TLB storage for these bytes to at
least half. The unique values are significantly fewer for the upper bytes; this is a side-effect of
Chapter 3. TLB-related Behaviour Analysis 56
02468
101214161820
apache
TPC-C2
TPC-C1
cann
eal
ferret
x264
cassandra
classifi
ca;o
nclou
d9
nutch
stream
ing
ALL
apache
TPC-C2
TPC-C1
cann
eal
ferret
x264
cassandra
classifi
ca;o
nclou
d9
nutch
stream
ing
ALL
apache
TPC-C2
TPC-C1
cann
eal
ferret
x264
cassandra
classifi
ca;o
nclou
d9
nutch
stream
ing
ALL
apache
TPC-C2
TPC-C1
cann
eal
ferret
x264
cassandra
classifi
ca;o
nclou
d9
nutch
stream
ing
ALL
MSB0or[63:56]Byte-Setbits MSB1or[63:48]Byte-Setbits MSB2or[63:40]Byte-Setbits MSB3or[63:32]Byte-Setbits
#un
ique
BytesorB
yte-Sets
acrossall16CMPcores
VPNBytes VPNByte-Sets PPNBytes PPNByte-Sets
Figure 3.15: Number of unique bytes and byte-sets in the virtual and physical addresses.
the maximum supported virtual and physical addresses. The Sparc-V9 architecture supports a
64-bit virtual address space in all cases. Because of the address space layout of each process,
the first most significant byte, MSB 0, was observed to have up to four unique values across all
workloads for data memory accesses, as illustrated by the VPN Bytes series in Figure 3.15 and
the ALL x-axis label. Two of the unique MSB 0 values, 0x00 and 0xff were due to the location
of the heap and the stack in the address space respectively. TPC-C2 was the only workload
with one observed unique MSB-0, because this was the only 32-bit application running on the
64-bit capable operating system. A snippet of the pmap output for one of the processes (PID
937) running for TPC-C1 is shown below:
$pmap -x 937
Address Kbytes Resident Shared Private Permissions Mapped File
[..]
000000010011A000 1384 1384 680 704 read/write/exec [ heap ]
7FFFFFFF7BB04000 8 8 - 8 read/write [ anon ]
[..]
FFFFFFFF7FFEA000 88 88 - 88 read/write [ stack ]
[..]
Figure 3.15 also shows that byte-set-3 has fewer than 16 unique values, orders of magnitude
less than the the maximum number of 232. Therefore, these upper bytes and/or byte-sets are
great compression candidates.
The compression potential degrades as we move towards lower-order bytes and byte-sets.
For example, Figure 3.16 depicts the number of unique values for the fifth MSB (MSB4). Here
the potential is more limited as one could save at most 1 or 2 bits for the fifth VPN byte and
that only in specific workloads like apache or canneal. Therefore, bits below the 32-bit line are
not good unique-value compression candidates for the simulated workloads.
Pham et al. employed a compression mechanism in their clustered TLB design [57]. In
Section 5.3 “Frequent Value locality in the Address bits” they demonstrated the entropy, i.e.,
Chapter 3. TLB-related Behaviour Analysis 57
58119 132
45 41
184256
110213 190 229189 190 189
102 80
256 256 256 256 243 256
0200400600800
10001200140016001800
apache TPC-C2 TPC-C1 canneal ferret x264 cassandra classifica>on cloud9 nutch streaming
MSB4or[63:24]Byte-Setbits
#un
ique
BytesorB
yte-Sets
acrossall16CMPcores
VPNBytes VPNByte-Sets PPNBytes PPNByte-Sets
Figure 3.16: Number of unique values for MSB 4 and Byte-Set 4 (both in virtual and physicaladdresses).
average number of unique values, in the upper bits of VPN and PPN for their workloads. They
then employed two auxiliary tables (VUBT and PUBT) to keep track of the most common
virtual and physical upper bits respectively. These two tables were limited to 8 and 4 entries
respectively. Whenever the unique values did not fit in the two aforementioned tables due to
space constraints, the translations were limited to the unencoded ways of the baseline TLB.
3.8 The First Cache Block Access After A TLB-Miss
Virtual addresses are by default a closer representation of an application’s behaviour than
physical addresses, especially in an overloaded system where fragmentation might be present.
The physical addresses that the virtual addresses have been translated into can depart from,
and thus muddle, locality patterns that exist in the application’s accesses to its more common
data structures. It is possible to envision using TLB-filtered observations to guide cache opti-
mizations, such as TLB-guided cache prefetching. Even though this thesis does not evaluate
such a mechanism, it provides a useful observation for future research. This section examines
how often the cache block (memory address) that triggered a TLB miss matches the cache
block accessed the last time a TLB miss for that same page had occured. If the two cache
block addresses match (matching refers to their memory addresses and not their contents),
then TLB-prefetching mechanisms could also prefetch the corresponding cache blocks, on top
of translations, thus potentially reducing the memory latency associated with these requests.
For example, if the last time a process missed on VPN A it had accessed the second 64B cache
block of the corresponding physical page, then the next time this process misses on VPN A
this mechanism would predict that it will again access the second 64B cache block. Because
the TLB tracks memory at a coarser granularity than caches, a TLB miss usually indicates
that the data in that page has not been accessed in a while and will thus likely exist either in
lower level caches or off-chip. Therefore, prefetching them in advance, as soon as the virtual to
physical translation is known, could improve performance.
Chapter 3. TLB-related Behaviour Analysis 58
Figure 3.17 depicts the percentage of L1 D-TLB misses that access the same 64B cache block
as the most recent TLB-miss to the same virtual page. On average, 51% of all TLB misses
across all workloads access the same 64B cache block as the most recent, from the same core,
TLB-miss to that translation-entry. This number goes up to 78% for the streaming workload.
51.6457.93 56.41
24.4216.10
62.9354.54
40.5550.77
72.4477.80
0
20
40
60
80
100
apache TPC-C2 TPC-C1 canneal ferret x264 cassandra classifica@on cloud9 nutch streaming
CommercialWorkloads PARSEC Cloud-Suite
%L1D-TLBMissesthataccessthesame64BcacheblockasthelastTLBmissforthesametransla;onentry
Figure 3.17: Percentage of all CMP D-TLB L1 Misses that access the same 64B cache block asthe last time that same translation-entry experienced a TLB miss.
These results indicate that there is a high predictability of cache accesses that miss in the L1
D-TLBs. As mentioned earlier in this thesis, the vast majority of TLB misses are to 8KB pages;
therefore, only seven bits (log2(8192/64)) per translation-entry would be required to keep track
of the cache block likely to be accessed on a TLB-miss.
3.9 Concluding Remarks
This chapter presented an analysis of TLB-related behaviour for a set of state-of-the-art ap-
plications, emphasizing commercial and cloud workloads. Our analysis involved the following
taxonomy: (i) characteristics inherent to the workloads, that is, characteristics or metrics unaf-
fected by translation caching structures like the TLBs, and (ii) other metrics (e.g., MPMI) that
are influenced by the architecture of these structures. The workloads’ data footprint in terms
of unique translations for page size, the presence of multiple processes and the CMP cores they
run on, as well as the lifetimes of translation mappings are all examples of characteristics that
are inherent to the workload. We believe that the characteristics and metrics presented here
should be of interest to anyone doing research in the area of address-translation optimizations.
Knowing the nuances of each workload can help both understand program behaviour and also
guide design decisions at the architectural level.
As anticipated, our results show that there is no single TLB model to match all workloads’
needs. Even within the same class of workloads, there is up to an order of magnitude variation
in the number of unique translations. Variations in TLB size requirements (high standard de-
viation) can exist across cores within a workload too. Workloads also exhibit different degrees
of translation sharing across cores, as well as different superpage usage, both behaviours that
Chapter 3. TLB-related Behaviour Analysis 59
rigid TLB hierarchies would poorly capture. Our TLB capacity sensitivity study further illus-
trates how most mainstream TLB structures (e.g., split TLBs) are either biased towards the
smallest page size or make an implicit assumption about the page size distribution of memory
accesses. Chapter 4 demonstrates how these assumptions can waste energy and space and pro-
poses Prediction-Based Superpage-Friendly TLB Designs that can allow translations of different
page-sizes to coexist in a single set-associative TLB, sharing its capacity as needed.
Each unique translation, and the TLB entry it might occupy, incorporates by default infor-
mation for the process it belongs to. These contexts provide us with another abstraction level to
observe TLB-related workload behaviour such as translation sharing across cores or per process
footprint. Even though there are variations in the frequency, data reach, and occasionally the
lifetime of each context, one should not filter or ignore them. Context-aware TLB indexing
schemes might warrant future research.
Despite the fluidity of so many characteristics, translation invalidations and modifications
are rare for the evaluated workloads. This observation was made in other research works
as well. The persistence of translation mappings encouraged researchers to propose changes
to the memory allocation algorithm to bypass paging for select large memory regions, e.g.,
direct segments [12], redundant memory mappings [45]. On our end, we believe that persistent
translation mappings can motivate history-based TLB schemes. Chapter 5 presents our history-
based cacheable TLB, a speculative - by configuration - design not kept coherent with the page
tables.
The last contributions of this chapter are the observations on (i) the compressibility of
translation entries, and (ii) the predictability of the cache block accessed within a page on a
TLB miss. Although not used in this work, we hope these results can motivate future research.
As this chapter illustrated, the landscape of TLB-related workload behaviour is vast. The
results presented here have charted different, often overlapping, facets of this landscape.
Chapter 4
Prediction-Based
Superpage-Friendly TLB Designs1
4.1 Overview
There are technology trends that compound making TLB performance and energy critical in
today’s systems. Physical memory sizes and application footprints have been increasing without
a commensurate increase in TLB size and thus coverage. As a result, while TLBs still reap the
benefits of spatial and temporal locality due to their entries’ coarse tracking granularity, they
now fall short of the growing workload footprints. The use of superpages (i.e., large contiguous
virtual memory regions which map to contiguous physical frames) can extend TLB coverage.
Unfortunately, there is a “chicken and egg” problem: some workloads do not use superpages
due to the poor hardware support, and no additional support is added as workloads tend not
to use them.
The number of page sizes supported in each architecture varies. For example, x86-64 sup-
ports three page sizes: 4KB, 2MB and 1GB. UltraSparc III supports four page sizes: 8KB,
64KB, 512KB and 4MB, while the MMUs in newer generation SPARC processors (e.g., Sparc
T4) support 8KB, 64KB, 4MB, 256MB and 2GB page sizes [72]. Itanium and Power also
support multiple page sizes. For example, POWER8 supports 4KB, 64KB, 16MB and 16GB
pages [73]. Larger page sizes extend TLB reach, reduce the TLB miss handling penalty (assum-
ing multi-level page tables), and could even enable further data prefetching without crossing
smaller-page boundaries. But using larger page sizes risks fragmentation. The use cases of the
various page sizes vary across systems. For example, Power systems running Linux use 64KB
as their default page size; however, this choice can be harmful, e.g., if “an application uses
many small files, which can mean that each file is loaded into a 64KB page” [33]. In these
systems, 16MB pages require specific support (e.g., the Linux libhugetlbfs package); these
pages are “typically used for databases, Java engines, and high-performance computing (HPC)
1A modified version of this chapter has been previously published in the Proceedings of the IEEE 21stInternational Symposium on High Performance Computer Architecture (HPCA), February 2015 [55].
60
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 61
applications” [33].
While variety in page sizes may cater to each application’s memory needs/patterns [54],
it may burden the OS with selecting and managing multiple page sizes. It also makes TLB
design more challenging since the page size of an address is not known at TLB lookup. This is
a problem for set-associative designs as page offset bits cannot be used in the set index. Thus,
modern systems support multiple page sizes by implementing multiple TLB structures, one per
size, or alternatively resort to a fully-associative TLB structure.
Each design has its own trade-offs. Fully-associative (FA) TLBs seamlessly support all
page sizes, but are much more power hungry and slower than their set-associative counterparts.
Such slow access times are better tolerated in heavily multithreaded systems, such as Sparc T4,
where individual instruction latency does not matter as much. Separate per page-size TLBs
(e.g., SandyBridge, Haswell) are sized a priori according to anticipated page size usage. These
structures are all checked in parallel, each using an indexing scheme appropriate for the page
size they cache. If a workload does not use some page sizes, the extra lookups waste energy
and underutilize the allocated TLB area. Haswell’s [34] or Skylake’s [38] L2-TLBs are rare
examples of a commercial set-associative design which supports two page sizes. Unfortunately,
their indexing method has not been publicly disclosed. Finally, UltraSparc III is representative
of designs that just distinguish between 8KB pages and superpages storing the latter in a small
FA structure. Workloads that heavily use superpages thrash the small FA TLB.
The goal of this work is to allow translations of different page sizes to co-exist in a single
set-associative (SA) TLB, even at the L1 level, while: (1) achieving a miss rate comparable to
that of an FA TLB, and (2) maintaining the energy and access time of an SA TLB. The target
TLB design should allow elastic allocation of entries to page sizes. That is: (1) A workload
using mostly a single page size should be able to use all the available TLB capacity so that it
does not waste any resources or be limited by predetermined assumptions on page size usage.
(2) A workload that uses multiple page sizes should have its translations transparently compete
for TLB entries. An SA TLB will better scale to larger sizes without the onerous access and
power penalties of a large FA TLB.
Our analysis of the TLB behaviour of a set of commercial and scale-out workloads that
heavily exercise existing TLBs has indicated that: (i) some workloads do use superpages heavily,
and (ii) workloads tend to favor the largest superpage size, while intermediate page sizes rarely
appear. Motivated by these results, we propose a lightweight binary superpage prediction
mechanism that accurately guesses ahead of time if a memory access is to a superpage or not.
This prediction enables an elastic TLBpred design that dynamically adapts its super- and regular
page capacity to fit the application’s needs.
The rest of this chapter is organized as follows. Section 4.2 extends Chapter 3’s TLB
behaviour analysis for a set of commercial and scale-out applications with energy, access-time,
and x86 native execution results, demonstrating the need for adaptive superpage translation
capacity in the TLB. Section 4.3 discusses our binary superpage prediction mechanism, and
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 62
Section 4.4 describes how we incorporate it in the proposed TLBpred. Section 4.5 presents a
summary of the previously proposed but not evaluated Skewed TLB (TLBskew) [69] which allows
translations of different page sizes to coexist. This section also presents our enhanced TLBpskew
proposal that uses dominant page-size prediction to boost TLBpskew’s effective associativity.
Sections 4.6 and 4.7 present our methodology and evaluation results respectively, followed by
an overview of the related work in Section 4.8 and concluding remarks in Section 4.9.
4.2 Analysis of TLB-Related Workload Behavior
This section extends the analysis of TLB behaviour presented in Chapter 3. All results presented
in this chapter, except for Section 4.2.3, use full-system emulation of a SPARC system running
Solaris (Section 4.6 details the experimental methodology used). Section 4.2.1 summarizes
previously reported statistics characterizing the workload footprints. Section 4.2.2 presents
how this set of workloads behaves under different TLB designs and also quantifies their access
time/energy trade-offs. We target the data TLB as its performance is much worse than the
instruction TLB. Finally, Section 4.2.3 presents results from native runs on an x86 system
running Linux. These results illustrate that some key observations that motivate this work also
hold true in an x86 system and on a different operating system.
4.2.1 Unique Translations Analysis Recap
Section 3.3.1 from Chapter 3 presented the number of per-core unique translations for 8KB
pages and superpages accessed on average during the execution of a 16 billion instruction
sample on a 16-core CMP system. This system supports four page sizes: 8KB, 64KB, 512KB,
and 4MB pages. Based on the measurements from Tables 3.4 and 3.5, we make the following
empirical observations:
• The average number of unique pages accessed per core is one or two orders of magnitude
more than the mainstream L1 TLB capacity.
• Even though four page sizes are supported, only the 8KB and 4MB page sizes were
prominently used. No use of 64KB pages was recorded, while there were very few, if any,
512KB pages.
• The superpages use varied drastically across workloads. The OLTP workloads TPC-C1
and TPC-C2 (commercial database systems), and three scale-out applications from the
Cloud Suite, cassandra, classification, and nutch, use 4MB pages. These workloads can
easily thrash an unbalanced TLB design with limited superpage capacity.
The TLB sensitivity study presented in Section 3.6 explored how different TLB organization
impact D-TLB MPMI. The subsequent section will focus on four specific TLB designs and will
quantify their MPMI, as well as the dynamic energy and access time trade-offs.
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 63
4.2.2 TLB Miss Analysis and Access-Time/Energy Trade-Offs
Table 2.1 in Chapter 2 listed current commercial D-TLB designs. This section will focus on four
TLB designs whose original L1-TLB configurations are reiterated in the Table 4.1 below, for
convenience. These configurations are adapted as follows for this chapter. The AMD12h-like
configuration models a 48-entry FA TLB, while the SPARCT4-like models an 128-entry FA
TLB respectively, both with LRU replacement. The Haswell-like TLB design has been tuned
for our system’s supported page sizes. It includes four distinct 4-way SA TLBs: a 64-entry
TLB for 8KB pages and three 32-entry TLBs for 64KB, 512KB and 4MB page sizes, all with
an associativity of four. Lastly, the UltraSparc-III-like TLB has a 4-way SA 512-entry TLB for
8KB pages and a 16-entry FA TLB for superpages.
Processor Microarchitecture L1 D-TLB Configuration
AMD 12h family [4] 48-entry FA TLB (all page sizes)
Sparc T4 [72] 128-entry FA TLB (all page sizes)
Intel Haswell [38], [34]
4-way SA split L1 TLBs:64-entry (4KB), 32-entry (2MB) and4-entry (1GB)
UltraSparc III [76]2-way SA 512-entry TLB (8KB)16-entry FA TLB (superpages and locked 8KB)
Table 4.1: Commercial D-TLB Designs
Figure 4.1 shows the L1 D-TLB MPMI for the aforementioned adapted TLB designs. Lower
MPMI is better. The series are sorted from left to right in ascending L1 TLB capacity.
0
2000
4000
6000
8000
10000
12000
14000
16000
apache TPC-C2 TPC-C1 canneal ferret x264 cassandraclassifica;on cloud9 nutch streaming
Commercial PARSEC Cloud-Suite
D-TLBL1M
PMI
Workloads
48-entryFA(AMD12h-like) 128-entryFA(SparcT4-like) split-L1(Haswell-like) UltraSparcIII-like
Figure 4.1: D-TLB L1 MPMI for Different TLB Designs
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 64
The FA TLBs often have fewer misses than their SA counterparts. Increasing FA TLB size
from 48 to 128 entries further reduces MPMI, following the trend shown in Section 3.6.2. The
UltraSparc-III-like TLB, with its larger capacity for 8KB pages performs best for workloads
that mostly use 8KB pages, such as canneal. On the other hand, this design suffers when its
small 16-entry FA TLB gets thrashed by the many 4MB pages of a workload like classification
whose majority of TLB misses are due to superpages. The split Haswell-based L1 TLBs, with
their smaller overall capacity, perform much better for classification but fall short on most
others.
To summarize, the analysis shows that:
1. FA TLBs have a lower miss rate, more so given a larger number of entries.
2. Split-TLB designs are the least preferable choice for these workloads.
3. Capacity can be more important than associativity (e.g., canneal).
Figure 4.2 plots these TLB designs in a “Dynamic energy per read access” versus “access
time” plane using estimates from McPat’s Cacti [49]. The preferred TLB design would have
the MPMI of the Sparc-T4-like design (Figure 4.1), the fast access time of Haswell-like split L1
TLBs, and the dynamic read-energy per access of AMD-like 48-entry FA TLB. We approach
this goal with an elastic set-associative TLB design that uses superpage prediction as its key
ingredient.
48-entryFA(AMD12h-like)
128-entryFA(Sparc-T4-like)
split-L1(Haswell-like)
UltraSparcIII-like
0.004
0.005
0.006
0.007
0.008
0.009
0.01
0.05 0.1 0.15 0.2 0.25 0.3
Dyna
micReadEn
ergy
PerA
ccess(nJ)
AccessTime(ns)
be7er
Figure 4.2: Access Time and Dynamic Energy Trade-Offs
4.2.3 Native x86 Runs
To further demonstrate that superpages are frequent, thus requiring enhanced TLB support,
we measure,2 using performance counters, the portion of TLB misses due to superpages during
2The native x86 results [55] presented here in Section 4.2.3 were collected by Xin Tong; they are included inthis thesis for completeness.
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 65
native execution on an x86 system. Table 4.2 lists the parameters of the x86 system used for
these native runs. The workloads were run for 120 seconds, with measurements taken every
2M TLB misses with oprofile. Only a subset of the workloads was available due to software
package conflicts. Only 4KB and 2MB pages were detected in this system.
Processor Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz
OS Linux with enabled Transparent Huge Page support
L1 D-TLBs 32-entry for 2MB pages & 64-entry for 4KB pages, all 4-way SA
L2 TLB 512-entry, 4-way SA, shared across instructions and data
Table 4.2: System Parameters for Native x86 Execution
Table 4.3 results show that superpages (i.e., 2MB pages) can be responsible for a signifi-
cant portion of TLB misses in an x86 system too. For example, slightly more than half of all
L1 D-TLB misses for cassandra and classification are due to 2MB page accesses. This system
not only supports different page sizes, but also runs a different operating system and mem-
ory allocator algorithm. These results further support our other empirical observations that
superpages can be an important contributor to TLB misses.
Workloads % L1 D-TLB Misses % L2 TLB Misses
canneal 16.4 2.6
cassandra 51.8 14.8
classification 54.5 56.2
cloud9 21.4 33.3
Table 4.3: Fraction of TLB Misses due to 2MB Superpages (x86)
4.3 Page Size Prediction
The page size of a memory access is unknown during TLB lookup-time. This is a challenge
for a set-associative TLB caching translations of all page sizes. Without knowing the page size
we cannot decide which address bits to use for the TLB index. This section explains how a
page-size predictor can be used to overcome this challenge.
For simplicity, let us assume a system with only two page sizes. A binary predictor, similar
to those used for branch direction prediction, would be sufficient here. Using an index available
at least a cycle before the TLB access (e.g., PC), the predictor would guess the page size and
then the TLB would be accessed accordingly. A TLB entry match could occur only if the
predicted size is correct. If this first, primary, lookup results in a TLB miss, then either the
prediction was incorrect or the entry is not in the TLB. In this case, another secondary TLB
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 66
lookup is needed with the alternate page size. If this lookup also results in a miss, then a page
walk ensues.
Most architectures support multiple (N) page sizes. Thus, a binary predictor does not suf-
fice to predict the exact page size of an access. In such a system, a page size predictor would
have to predict among multiple page sizes [24]. It could do so by using wider (log2(2 ∗N) bits)
or multiple saturating counters to predict among the N possible page sizes. Besides the addi-
tional hardware and energy costs of this predictor, which may be modest, mispredictions and
misses would suffer. On a misprediction up to N − 1 additional lookups may be needed, if the
translation is present in the TLB. These serial lookups hurt performance and energy. We do
evaluate such designs in Section 4.7.6. The rest of this section discusses superpage predictors.
Section 4.4 presents the complete TLBpred design.
4.3.1 Superpage Prediction
To avoid multiple sequential lookups, we take advantage of the observed application behavior
and opt for a binary approach distinguishing between 8KB pages and superpages. Our predictor
guesses whether the page is a superpage but it does not guess its exact size. We manage all our
superpages homogeneously, as Section 4.4 will show. The proposed predictor uses a Prediction
Table (PT) with 2-bit saturating counters. The PT is a direct-mapped, untagged structure,
similar to bimodal branch predictor tables. Each entry has four possible states. Each state is
represented as A B, where A specifies the prediction for this current state, and B the prediction
for the next state after a misprediction. Both A and B can only have two possible values: P
to signify a small (8KB page) and SP to signify a superpage. Thus, the four states are the
following: (i) strongly predicted 8KB page (P P), (ii) weakly predicted 8KB page (P SP), (iii)
weakly predicted superpage (SP P), and (iv) strongly predicted superpage (SP SP). All entries
are initialized to the weakly predicted 8KB state.
For prediction to be possible, the index must be available early in the pipeline. The instruc-
tion’s address (PC) is a natural choice and intuitively should work well as an instruction would
probably be accessing the same data structure for sufficiently long periods of time, if not for the
duration of the application. However, libraries and other utility code may behave differently.
Another option is the base register value which is used during the virtual address calculation
stage and thus is available some time before the TLB access takes place. Figure 4.3 presents the
two predictors that use the PC or the base register value as the PT index respectively. In all
predictors, the prediction occurs only for memory instructions. In the SPARC v9 architecture,
memory instructions have the two most significant bits set to one as shown in the same figure.
PT entries are updated only after the page size becomes known: on a TLB hit or after the
page walk completes in case of a TLB miss. The predictor tables are never probed or updated
during demap or remap operations.
Sections 4.3.1.1 and 4.3.1.2 next detail the two PT index types. In both cases, the least
significant log2(#PTentries) bits from the selected field are used as index, discarding any
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 67
11313018140
RegisterFile
6322
Memory Instruction
Base Register-Value Indexed Prediction Table
PC
PC – Indexed Prediction Table
rs1
: 10
Figure 4.3: (a) PC-Based and (b) Base Register-Value Based Page Size Predictors
high-order bits.
4.3.1.1 PC-based Predictor
The first predictor uses the low-order PC bits. This information is available early in the
pipeline, as soon as we have identified that this is a memory instruction. A concern with PC-
based prediction is that the page size of a given page will be “learned” separately for different
instructions. For example, a program that processes different fields of a data structure would
do so via different instructions. However, most likely these data fields will all fall within the
same type of page (i.e., a superpage or an 8KB page). Having a PC-based index unnecessarily
duplicates this information resulting in slower learning times and more aliasing. Commercial
and scale-out workloads often have large instruction footprints, thus putting pressure on PC-
based structures.
4.3.1.2 Base Register-Value-Based (BRV-based) Predictor
Address computation in SPARC ISA uses either two source registers (src1 and src2 ) or a source
register (src1 ) and a 13-bit immediate. The value of register src1 dominates the result of the
virtual address calculation in the immediate case and more than not in the two source register
scenario as well (data structure’s base address). Therefore, we are using the value of source
register src1 as an index, after omitting the lower 22 bits to ignore any potential page offset
bits (this offset corresponds to the 4MB superpage size).
To demonstrate how src1 dominates the memory address calculation we provide below a
typical compiler-generated assembly of two small loops. The first loop initializes an array,
while the second sums its elements. The generated assembly of these loops, compiled with g++
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 68
with -O3 optimization on a SPARC machine, is listed below. In SPARC assembly [74], the
destination register of instructions is listed last. For example, add %o2, 0x3, %o0 adds the
contents of register o2 with the immediate value 0x3 and stores the result in register o0.
//=============================================================================
// Loop 1
//=============================================================================
for (i=0; i< cnt; i++) {
a[i] = i + 3;
}
/* In the beginning of each loop iteration:
reg. o2 holds i, reg. o3 holds &a[0], and reg. o4 holds cnt.
Registers o1 and o0 are used as temporaries. */
main+0x30: 93 2a a0 02 sll %o2, 0x2, %o1
main+0x34: 90 02 a0 03 add %o2, 0x3, %o0
main+0x38: 94 02 a0 01 add %o2, 0x1, %o2
main+0x3c: 80 a2 80 0c cmp %o2, %o4
main+0x40: 06 bf ff fc bl -0x10 <main+0x30>
main+0x44: d0 22 c0 09 st %o0, [%o3 + %o1]
//=============================================================================
// Loop 2
//=============================================================================
for (i = 0; i< cnt; i++) {
sum += a[i];
}
/* Reg. o2 holds i, reg. o3 holds &a[0], and reg. o4 holds cnt, as before.
Reg. i0 holds sum, while o0 and o1 are used as temporaries. */
main+0x64: 91 2a a0 02 sll %o2, 0x2, %o0
main+0x68: d2 02 c0 08 ld [%o3 + %o0], %o1
main+0x6c: 94 02 a0 01 add %o2, 0x1, %o2
main+0x70: 80 a2 80 0c cmp %o2, %o4
main+0x74: 06 bf ff fc bl -0x10 <main+0x64>
main+0x78: b0 06 00 09 add %i0, %o1, %i0
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 69
Both loops have one memory instruction that either modifies an array element (st, store
instruction in Loop 1) or reads an array element (ld, load instruction in Loop 2). The store
instruction in the first loop is in the branch delay slot, and is executed irrespective of the
branch outcome. In both these store and load instructions, the src1 register (bits 18-14 of the
instruction in SPARC-V9) is o3, which holds the base address of the array (i.e., &a[0]). The
src2 register (instruction bits 4-0) is register o1 for the store and register o0 for the load. Thus,
as expected, the value of o3 (array’s base address) will dominate.
By using only the base register value, and not the entire virtual address, prediction can
proceed in parallel with the address calculation. Accordingly, there should be ample time to
access the tiny 32-byte prediction table that Section 4.7 shows is sufficient.
4.4 Prediction-Guided Multigrain TLB
The proposed multi-grain TLB, TLBpred, is a single set-associative structure that uses two
distinct indices: an 8KB-based and a superpage-based index. This binary distinction mirrors
the observation that there are two prominent page sizes used in the analyzed system (8KB and
4MB). The multi-grain TLB successfully hosts translations of any page size as its tags are big
enough for the smallest 8KB supported page size.
Figure 4.4 shows the indexing scheme used for a given TLB size. All superpages, irrespective
of their size, share the same indexing bits. Also, Figure 4.5 shows a potential implementation
of the tag comparison for a predicted superpage access. With this indexing scheme all page
sizes are free to use all sets. Consecutive 8KB pages and 4MB pages, the two prominent page
sizes, map to consecutive sets. Consecutive 64KB or 512KB pages may map to the same set as
they use the same index bits as 4MB pages and thus may suffer from increased pressure on the
TLB. As these pages are relatively infrequent, this proves not to be a problem.
Virtual Address 8KB
64KB
page-offset bits set-index bits tag bits
512KB
4MB
63
12
63 63 63
21 15
21 18
21
0
0 0
0
Figure 4.4: Multigrain Indexing with 4 supported page sizes, shown here for a 512-entry 8-waySA TLB (6 set-index bits).
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 70
Tag Bits ContextPage Size G
Translation Entry
Set Index
v 27 22
63 28 21 13
0x000 (mask for 4MB)0x1c0 (mask for 512KB)0x1f8 (mask for 64KB)
63 28 27 22 21 13
Composed Tag for comparison with incoming virt. address on superpage prediction
Figure 4.5: Multigrain Tag Comparison for Figure 4.4’s TLB on superpage prediction. PageSize field (2 bits) included in every TLB entry.
While superpage prediction proves highly accurate, correctness must be preserved on mis-
predictions. Table 4.4 details all possible scenarios. The common case given the high TLB hit
rate and prediction accuracy is to have a TLB hit and a correct prediction. A TLB hit during
the primary TLB lookup implies a correct page size class (superpage or not) prediction, as each
entry’s page size information is used for the tag comparison. On a TLB miss, however, there is
a degree of uncertainty. A secondary TLB lookup is necessary, this time using the complement
page size class. For example, if the prediction was for an 8KB page, the secondary lookup uses
the superpage based index. In total, at most two lookups are necessary.
TLB LookupOutcome (w/
Predicted Page Size)
Page SizePrediction
Effect
Hit CorrectExpected common case. No further TLBlookups are required.
Hit Incorrect
Only possible if both the primary and the sec-ondary lookup probe the same TLB set (i.e.,the set index bits for 8KB and 4MB pages arethe same) and hardware supports it.
Miss X (Unknown)This could either be a misprediction (i.e., weused an incorrect TLB index) or a TLB miss.
Table 4.4: Primary TLBpred Lookup
Table 4.5 shows the two possible outcomes for this secondary TLBpred lookup - occurring
only on a primary TLBpred lookup miss - given a binary superpage predictor. A secondary
TLBpred hit implies an incorrect page-size prediction and doubles the TLB lookup latency.
However, this event is rare with an accurate predictor. Conversely, a secondary TLBpred miss
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 71
triggers a page-walk, making the latency of the secondary TLB lookup negligible compared to
the lengthy page walk latency.
Secondary TLBLookup Outcome
Original Page SizePrediction
Effect
Hit IncorrectMisprediction. Second TLB-lookup issuccessful.
Miss X (Irrelevant)True TLB miss. The page size is stillunknown and a page walk is needed.
Table 4.5: Secondary TLBpred Lookup Using a Binary Superpage Predictor.
Sections 4.4.1 and 4.4.2 next discuss how to (a) extend TLBpred for different page size usage
scenarios. and (b) deal with special TLB operations.
4.4.1 Supporting Other Page Size Usage Scenarios
In our analysis we have observed a bimodal page size distribution (i.e., two prominent page
sizes) both in SPARC and x86, which motivated our superpage predictor. This distribution
was expected for x86-64 which supports 4KB, 2MB and 1GB pages. The 1GB page size, when
enabled, is judiciously used so as not to unnecessarily pin such large memory regions. In all
cases, the proposed TLBpred correctly works for any page size distribution, possibly experiencing
increased set pressure for the non dominant page sizes (see Figure 4.15). We anticipate that the
observation that some page sizes dominate will hold in different architectures that also support
multiple page sizes.
4.4.1.1 Precise Page Size Prediction
Assuming that multiple page sizes may be actively used, one solution to avoid conflict misses
would be to use a predictor that predicts the exact page size [24]. Thus, contiguous pages of
all page sizes would map to subsequent sets. The downside is that all mispredicted TLB hits
and all TLB misses would pay the penalty of multiple sequential lookups, which could be hefty
in systems with a large number of supported page sizes. Section 4.7.6 touches upon such page
size usage scenarios further.
Table 4.6 summarizes the possible outcomes of the secondary TLB lookups for a page size
predictor predicting among N possible page sizes. A non-primary TLB lookup that hits in the
TLB signals a page-size misprediction. This misprediction overhead is high, making the TLB
hit latency anywhere from 2 to N times more than it would be in case of a primary TLB hit, and
may result in having to replay any dependent instructions that were speculatively scheduled
assuming a cache hit. The more page sizes are supported in a system with precise page size
prediction, the higher the misprediction overhead in case of multiple secondary lookups or a
TLB miss.
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 72
i-th TLB LookupOutcome
Original PageSize Prediction Effect
Hit Incorrect i-times TLB lookup latency.
Miss (i < N) X (Irrelevant) Repeat lookup with i+ 1 page size.
Miss (i = N) X (Irrelevant)True TLB miss. A page walk is in order, thusthe increase in latency is, in proportion, small.
Table 4.6: i-th TLB Lookup (1 < i ≤ N); N supported page sizes.
4.4.1.2 Predicting Among Page Size Groups
In SPARC, the two prominent page sizes were the smallest and the largest supported and the
difference between the page sizes was not stark. However, this might not be the case in other
systems. In x86-64, the largest page size is 1GB. Having 2MB pages share the same TLB set-
index as the 1GB pages could result in 512 contiguous 2MB pages competing for the same set,
whereas in our system at most 64 contiguous 64KB entries would map to the same set.
A preferable option to precise page prediction, that lowers its worst case penalty, would be to
have TLBpred predict among groups of page sizes, following the same principle of superpage pre-
diction. These architecture-specific groups should be judiciously selected to minimize potential
set pressure due to common indexing. For example, instead of predicting across five page-sizes
in Sparc T4, one could predict among the following three groups: (i) 8KB, (ii) 64KB and 4MB,
(iii) 256MB and 2GB. Within a group, the index of the largest page-size would be used by the
smaller pages. In all cases the TLB entries will have sufficient bits to host the translation of
the smallest page size, including each translation’s page size information. Finally, for very large
page sizes (GB range), that are by default sparsely used and may be limited to mapping special
areas of memory (e.g., the memory of graphics co-processors), it might be worthwhile exploring
the use of a small bloom filter as a TLBpred addition or use Direct Segments instead [12].
4.4.2 Special TLB Operations
MMUs can directly modify specific TLB entries via special instructions. In the Cheetah-MMU
that our emulated Ultrasparc-III system uses, it is possible to modify a specific entry in the
FA superpage TLB, for example to modify locked entries or to implement demap operations.
In TLBpred, it is possible that the virtual address of the original TLB entry and the virtual
address of the modified TLB entry map to different sets requiring some additional steps. In
general, any TLB coherence operation can be handled similar to a regular TLB operation,
potentially requiring multiple lookups only in the rare cases the exact page size is relevant to
the operation but is not known. A demap-context or a demap-page operation would not fall
under this category.
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 73
4.5 Skewed TLB
A design that supports multiple page sizes in a single structure is the Skewed TLB, TLBskew [69].
Unfortunately, no experimental evaluation of its performance exists to date. This section re-
views the TLBskew design and explains how we applied it in our evaluated system. In TLBskew,
similarly to the skewed associative caches [70], the blocks of a set no longer share the same
index. Instead, each way has its own index function. However, unlike skewed-associative caches
where all addresses see the same associativity, in TLBskew a page maps only to a subset of ways
depending on its actual page size and its address.
The TLBskew hash functions are designed in such a way that a given virtual address can
only reside in a subset of the TLB’s ways, resulting in a per page size effective associativity.
The page size of this address’s virtual page determines this subset. At lookup time, when the
page size is yet unknown, log2(TLB associativity) bits of the virtual address are used by a
page size function, which determines that this address can reside in way-subset X as page
size Y. This expected size Y information is incorporated in each way’s set index, ensuring that
both the page offset bits and the page size function bits for this way are discarded.
Table 4.7 shows the page size function mapping proposed by Seznec for an 8-way skewed-
associative TLB [69] for the Alpha ISA. Our system supports the same page sizes. With this
mapping, a translation for a given virtual page - that has a specific page size- can only reside
in two out of the eight TLB ways. For this page size function and TLB organization, virtual
address A with bits 23-21 zero can map (i) to ways 0 and 4 if part of an 8KB page, (ii) to ways
1 and 5 if part of a 64KB page, (iii) to ways 2 and 6 if part of a 512KB page, or (iv) to ways
3 and 7 if part of a 4MB page. As Table 4.7 shows, bit VA[23] does not matter for mapping
8KB and 64KB pages (i.e., it is a don’t care value), while bit VA[21] is a don’t care value for
mapping 512KB and 4MB pages respectively.
Virtual Addr. Bits 23-21 8KB 64KB 512KB 4MB
000 ways 0 & 4 ways 1 & 5 ways 2 & 6 ways 3 & 7
001 ways 1 & 5 ways 0 & 4 ways 2 & 6 ways 3 & 7
010 ways 2 & 6 ways 3 & 7 ways 0 & 4 ways 1 & 5
011 ways 3 & 7 ways 2 & 6 ways 0 & 4 ways 1 & 5
100 ways 0 & 4 ways 1 & 5 ways 3 & 7 ways 2 & 6
101 ways 1 & 5 ways 0 & 4 ways 3 & 7 ways 2 & 6
110 ways 2 & 6 ways 3 & 7 ways 1 & 5 ways 0 & 4
111 ways 3 & 7 ways 2 & 6 ways 1 & 5 ways 0 & 4
Table 4.7: Page Size Function described in Skewed TLB [69].
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 74
Figure 4.6 shows the set-index selection bits of the virtual address for the four page sizes of
our baseline system. The set index and page size function bits are based on a 512-entry, 8-way
TLBskew. The set index has six bits, discarding any page size selection bits. These are the
indexing functions presented in the original skewed TLB paper [69], adjusted here for a smaller
TLB size. During lookup, these eight indices are computed, one per-way. In our previous
example with virtual address A where VA[23:21] was 0, a hit in way 0 signifies an 8KB page.
v
v
v
VirtualAddress
8KB
64KB
512KB
4MB
bitsusedforpage-sizefunc@on
page-offsetbits
set-indexbits(ways0–3)
63 12 0
63
63
63
15
18
21
0 0
0
(xor):set-indexbits(ways4–7)
Figure 4.6: Skewed Indexing (512 entries, 8-way skewed associative TLB) with 4 supportedpage sizes.
The number of supported page sizes is hard-wired in the hash indexing functions, and they
all have the same effective associativity (two in our example). When an entry needs to be
allocated, and a replacement is in order, the effective-associativity ways of that given page size
are searched for an eviction candidate. In our prior example, if virtual address A belongs to an
8KB page (page size is known during allocation), then only ways 0 and 4 are searched. Since
the potential victims will reside in different sets, unless all set-index bits for ways 0-3 or 4-7 are
zero, an LRU replacement policy could be quite expensive.
Section 4.7.4 will evaluate both an LRU and an easier to implement “Random-Young”
replacement policy.
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 75
4.5.1 Prediction-Guided Skewed TLB
TLBskew allows a workload to utilize the entire TLB capacity even if it only uses a single page
size. However, the effective associativity limits the replacement candidates, causing translation
contention. The default skewed indexing scheme better caters to a uniform use of page sizes,
but Section 4.2 and also Chapter 3 showed that this is not the common case. There are three
considerations:
1. As superpages cover coarser memory regions they are not as many as the 8KB pages.
Thus a uniform distribution might not be the best fit.
2. Some workloads, e.g., scientific workloads, mainly use 8KB pages. For them the effective
associativity is an unnecessary limitation.
3. For TLBs which use contexts or ASIDs to avoid TLB shootdowns on a context-switch,
the same virtual page could be used by different processes, with all those entries mapping
onto the same set. This mapping can apply more pressure due to the imposed effective
associativity limit.
We propose enhancing TLBskew with page size prediction with the goal of extending the
effective associativity per page size. Specifically, one way of increasing effective associativity
would be to perform two lookups in series. In the first we could check for 8KB or 4MB hits
and in the second for 64KB or 512KB hits. This way we can use more ways on each lookup
per size as there are only two possible sizes each time. The downside of this approach is that
it would prolong TLB latency for TLB misses and for 64KB/512KB pages.
We can avoid serial lookups for TLB hits while still increasing effective associativity by using
a prediction mechanism. Specifically, we adapt the superpage prediction mechanism so we do
not predict between 8KB pages and superpages, but between pairs of page sizes. We group the
most used page sizes (i.e., 8KB and 4MB) together and the less used page sizes (i.e., 64KB and
512KB) into a separate group. Our binary base-register value based page size predictor, with
the same structure as before, now predicts between these two pairs of pages.
The TLBskew hash functions are updated accordingly so that now only the 22nd bit (count-
ing from zero) of the address is used for the page sizing function. For example, if this bit is
zero and we have predicted the 8KB-4MB pair, then this virtual address can reside in ways 2,
3, 6 or 7 as a 4MB page and in ways 0, 1, 4 and 5 as an 8KB page. Similar to TLBpred, if we
do not hit during this primary lookup, we use the inverse prediction of a page size pair and do
a secondary TLB lookup. Choosing which page sizes to pair together is crucial; in our case the
design choice was obvious, as our workloads’ page size usage was strongly biased.
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 76
4.6 Methodology
This work uses SimFlex [35], a full-system simulator based on Simics [52]. Simics models the
SPARC ISA and boots Solaris. All experiments are run on a 16-core CMP system, for 1 Billion
instructions per-core, for a total of 16 Billion executed instructions. As Section 3.2 stated,
to achieve reasonable simulation time, we collected D-TLB access traces for all our workloads
during functional simulation. We relied on Simics API calls (e.g., probing TLBs/registers) to
extract translation information.
In Simics each core models the Cheetah-MMU, the memory management unit for the
UltraSPARC-III processors with the D-TLB sizes shown in Table 4.1. Table 4.8 summarizes
the relevant fields of each TLB entry. In our system, the TLBs are software-managed. There-
fore, on a TLB miss a software trap handler walks the page tables and refills the TLB. This
is contrary to x86 systems where the TLBs are hardware-managed. Software-managed TLBs
allow for a more flexible page table organization, but at the cost of flushing the core’s pipeline
and potentially polluting hardware structures such as caches. In the simulated system, the trap
handler checks the Translation Storage Buffer (TSB) before walking the page tables. The TSB
is a direct-mapped, virtually-addressable data structure, which is faster to access than the page
tables. Most TLB misses hit in the TSB requiring only 10-20 instructions in the TLB handler
and a single quad load to access the TSB. All accesses are included in our trace, however, the
traces should be representative even of systems with hardware page walkers as the number of
references due to the TSB is very small compared to the overall number of references and those
needed on page walks.
TLB Field(size in bits)
Description
VPN Virtual Page Number
Context (13)The equivalent of the Address Space Identifier (ASI) in x86; preventsTLB flushing on a context-switch. The same VPN could map to differ-ent page frames based on its context3.
Global Bit (1)Global translations are shared across all processes; context field is ig-nored.
Page Size (2)Specifies the page size in ascending order: 8KB, 64KB, 512KB and4MB. Superpages are only allocated in the fully-associative TLB.
PPN Physical Page (Frame) Number
Table 4.8: TLB Entry Fields
Workloads: This chapter uses the set of eleven commercial, scale-out and scientific workloads
summarized in Table 3.3. These workloads were selected as they are sensitive to modern TLB
configurations.
3The context that should be used for a given translation is extracted from a set of context MMU registers. Thecorrect register is identified via the current address space identifier (i.e., ASI PRIMARY, ASI SECONDARY, orASI NUCLEUS). For a given machine, the latter depends on the instruction type (i.e., fetch versus load/store)and the trap-level (SPARC supports nested traps).
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 77
4.7 Evaluation
This section presents the results of an experimental evaluation of various multi-grain designs.
Section 4.7.1 shows how accurate our superpage predictors are. Section 4.7.2 demonstrates
that TLBpred reduces TLB misses for the applications that access superpages and that it is
robust, not hurting TLB performance for the other applications. Section 4.7.3 contrasts the
energy of different TLB designs, including our TLBpred. Section 4.7.4 evaluates the TLBskew
and TLBpskew skewed TLB designs, while Section 4.7.5 models the resulting overall system
performance. Finally, Section 4.7.6 investigates how TLBpred performs under hypothetical,
worst case page usage scenarios.
4.7.1 Superpage Prediction Accuracy
To evaluate the effectiveness of the superpage predictor we use its misprediction rate, i.e., the
number of mispredictions over the total number of TLB accesses. The superpage predictor
described in Section 4.3.1 is used, with the transitions summarized in Figure 4.7.
P_P P_SP SP_P SP_SP
superpage superpage superpage
superpage 8KB page
8KB page 8KB page 8KB page
Predict 8KB Page Predict Superpage
Figure 4.7: PT Entry Transition Diagram
Figure 4.8 shows how the misprediction rate varies over different PT indexing schemes
(x-axis labels) and different PT sizes (series). The misprediction rate is independent of the TLB
organization. A lower misprediction rate is better as it will reduce the number of secondary
TLB lookups. Three predictor handles are shown: (1) PC, (2) Base-Register (src1) Value, and
(3) the 4MB-page granularity of the actual virtual address. The last scheme is impractical since
it places the prediction table in the critical path between the address calculation and the TLB
access. However, it serves to demonstrate that the base register value scheme comes close to
what would be possible even if the actual address was known.
All prediction schemes perform well. The PC-based index is the worst due to aliasing and
information replication. These phenomena are less pronounced for the scientific workloads (e.g.,
canneal) that have smaller code size. Using the register-value based index performs consistently
better than the PC-index. The BRV-based predictor is almost as accurate as the exact address-
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 78
02468
101214161820
PC
4MBVP
N
BRV PC
4MBVP
N
BRV PC
4MBVP
N
BRV PC
4MBVP
N
BRV PC
4MBVP
N
BRV PC
4MBVP
N
BRV PC
4MBVP
N
BRV PC
4MBVP
N
BRV PC
4MBVP
N
BRV PC
4MBVP
N
BRV PC
4MBVP
N
BRV
apache TPC-C2 TPC-C1 canneal ferret x264 cassandraclassifica@oncloud9 nutch streaming
Commercial PARSEC Cloud-Suite
Workloads and PT index type
Page-SizeTypeMispredic2onRate(%)-Predic2ng8KBvs.Superpages
32PTentries 128PTentries 512PTentries 1024PTentries
Figure 4.8: Superpage-Prediction Misprediction Rate (%)
based predictor (4MB VPN ), which demonstrates that the source register src1 dominates the
address calculation outcome as expected.
The different series per PT index explore how the size of the prediction table influences
the misprediction rate. The bigger the table, the lower the risk of destructive aliasing. With
a miniscule 128-entry PT, which requires a meager 32B of storage, the average misprediction
rate across the workloads is 0.4% for the base register-value based PT index. Canneal exhibits
the worst misprediction rate of just 1.2%. Unless otherwise noted, the rest of this evaluation
uses this 32B superpage predictor.
4.7.2 TLBpred Misses Per Million Instructions and Capacity Distribution
Our goal was an elastic set-associative TLB design that would have the low MPMI of Sparc-T4-
like 128-entry FA TLB, the fast access time of Haswell-like split L1 TLBs, and the dynamic read-
energy per access of AMD-like 48-entry FA TLB within reasonable hardware budget. Figure 4.9
compares the MPMI of different TLBpred configurations to the MPMI of commercial-based TLB
designs. We vary the TLBpred associativity to ensure a power of two TLB sets. The results are
normalized over the AMD12h-like TLB. Numbers below one correspond to MPMI reduction;
the lower the better.
The 128-entry FA TLB (SPARC-T4-like), targeted for its low MPMI, is consistently better
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 79
3.8 10.6
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
apache TPC-C2 TPC-C1 canneal ferret x264 cassandra classific. cloud9 nutch streaming
Commercial PARSEC Cloud-Suite
Workloads
TLBMPMIRela2vetoAMD-like48-entryFATLBAMD12h-likeSPARC-T4-likeHaswell-likeUltrasparc-III-likeTLBpred:128-entry4-waySATLBpred:160-entry5-waySATLBpred:256-entry4-wayTLBpred:512-entry4-waySA
Figure 4.9: TLBpred MPMI relative to AMD-like 48-entry FA TLB
than the smaller 48-entry FA TLB (baseline); its MPMI ranges from 10.9% better for ferret to
97.5% better for classification. Our 256-entry set-associative TLB with its small 32B binary
predictor is the TLBpred configuration which meets that goal. Its MPMI ranges from 12.4%
to 82.5% better than the 48-entry FA baseline, and its AMEAN MPMI across all workloads is
7.7% better than the SPARC-T4-like. While this configuration uses twice as many entries as
the corresponding SPARC T4-like configuration, it is set-associative and, as it will be shown,
it is faster and more energy efficient.
Compared to the Haswell-like TLB configuration, even the smallest 128-entry TLBpred is
considerably better. The 128-entry FA TLB has lower MPMI than the 256-entry TLBpred
for classification. This workload has the highest number of private per core contexts than all
other workloads, resulting in many pages (from different processes) that have the same virtual
address to conflict in the set-associative TLBpred. Even so, TLBpred still achieves a low MPMI
than the baseline, and is considerably better than even larger set-associative designs like the
UltraSparc-III-like whose MPMI is 10.6 relative to the baseline for that workload.
4.7.2.1 TLBpred Capacity Distribution
TLBpred’s goal was to allow translations of multiple page-sizes to co-exist in a single set-
associative structure. Figure 4.10 shows a snapshot of the TLB capacity distribution for the
256-entry 4-way SA TLBpred, for all 16 cores, at the end of our simulations for a subset of our
workloads; the remaining workloads exhibit similar behaviour. Contrary to split-TLB designs
that have a fixed hardware distribution of the available L1-TLB capacity to different page sizes,
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 80
TLBpred’s capacity is dynamically shared across translations of different page sizes as needed.
Thus, for workloads like cassandra and classification, which heavily use superpages, 30-40%
of the available TLB capacity is occupied by translations for 4MB pages, whereas workloads
like canneal or cloud9 use almost 98-99% of their capacity for 8KB page translations. The
TLBpred capacity distribution also varies across CMP cores. For example, in TPC-C1 53% of
the core’s #6 TLBpred capacity is for 4MB page translations, while it is 17% on average for the
other cores.
0%20%40%60%80%
100%
0 15 0 15 0 15 0 15 0 15 0 15 0 15
apache TPC-C2 TPC-C1 canneal cassandra classifica:on cloud9
WorkloadsandCoreID
256-entry4-waySATLBpredPerCoreCapacityDistribuBon(overtransla:onsofdifferentpagesizes)
8KBpages 64KBpages 512KBpages 4MBpages Unoccupied/InvalidEntries
Figure 4.10: TLBpred per core capacity distribution over translations of different page sizes.
4.7.3 Energy
Figure 4.11 presents the total dynamic energy (in mJ) for a set of TLB designs. Using McPat’s
Cacti [49], we collected the following three measurements for every TLB configuration for a 22nm
technology: (i) read energy per access (nJ), (ii) dynamic associative search energy per access
(nJ), added to the read energy (i) in case of fully-associative structures, and (iii) write energy
per access (nJ). For TLB organizations with multiple hardware structures (e.g., Haswell) these
measurements were per TLB structure. In all cases, in Cacti, we used the high performance
itrs-hp transistors and the cache configuration option that includes a tag array, and specified
the appropriate TLB configurations (e.g., number of sets, ways, etc.). The total dynamic energy
of the system was then computed based on Cacti’s measurements along with the measured, via
simulation, TLB accesses (hits/misses) of each structure and workload.
In principle, every TLB access (probe) uses read energy, whereas only TLB misses (alloca-
tions) consume write energy. For fully-associative structures (e.g., AMD12-like, Sparc-T4-like
designs), the read energy is the sum of components (i) and (ii). For TLB designs with distinct
TLBs per page-size (e.g., Haswell), the read energy per probe is the sum of each TLB’s read
energy as the page size is yet unknown. However, TLB misses only pay the write energy of
a single TLB structure, the one corresponding to the missing page’s size. The read energy of
TLBpred’s secondary TLB lookups was also accounted for, along with the read energy of the
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 81
128-entry superpage predictor.
05
101520253035404550
apache TPC-C2 TPC-C1 canneal ferret x264 cassandraclassific. cloud9 nutch streaming
Commercial PARSEC Cloud-SuiteWorkloads
DynamicTLBEnergy(mJ)AMD-like(0.19ns) Sparc-T4-like(0.25ns)Haswell-like(0.08ns) UltraSparc-III-like(0.18ns)TLBpred256-entry4-waySA(0.09ns) TLBpred512-entry4-waySA(0.10ns)
Figure 4.11: Dynamic Energy
As Figure 4.11 shows, the UltraSparc-III design has the highest energy. It probes its SA
and FA TLBs in parallel and the FA TLB access dominates. The Sparc-T4-like FA TLB, with
the lowest MPMI of all the designs, also has significantly high energy mostly due to its costly
fully-associative lookup. The Haswell-like TLB design incurs comparable dynamic energy costs
due to the multiple useless TLB probes of its distinct per-page structures. Our binary page-size
prediction mechanism could be employed to avoid this energy waste, serializing these lookups
on mispredictions and misses. Finally, the 256-entry TLBpred TLB is the nearest to the target
energy of the 48-entry FA TLB, which however has a significantly higher MPMI. The 256-entry
TLBpred is the smallest TLBpred design (with lower energy and latency) that meets our MPMI
target. Alternatively, the 512-entry TLBpred can yield lower MPMI but at a somewhat higher
energy/latency cost.
4.7.4 TLBskew and TLBpskew MPMI
This section evaluates different skewed TLB configurations. Figure 4.12 shows the MPMI
achieved by TLBskew and TLBpskew relative to the AMD-like baseline for a 256-entry 8-way
skewed-associative TLB. In the interest of space, we limit our attention to 256-entry TLB
designs. The first graph series shows the original TLBskew with a random-young replacement
policy. We use the hashing functions described in Section 4.5 where the effective associativity
for each page size is two [69]. “Random-Young” is a low-overhead replacement policy based
on “Not-Recently Used” [71]. A single (young) bit is set when an entry is accessed (on a hit or
on an allocation). All young bits are reset when half the translation entries are young. Upon
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 82
replacement, the policy randomly chooses among the non-young victim candidates. If no such
candidate exists, then it randomly selects among the young entries.
The second series in Figure 4.12 reports the relative MPMI of the predictor assisted TLBpskew
of Section 4.5.1. In any primary or secondary TLB lookup only two page sizes are possible.
Therefore the effective associativity for each page size is now four, which proves beneficial in
reducing conflict misses. By coupling the 8KB with the 4MB pages during prediction, the
predictor achieves a nearly zero misprediction rate (0.07% maximum).
Figure 4.12 also explores the impact of the replacement policy. The third and fourth columns
in the graph depict the TLBskew and TLBpskew design with LRU replacement. Due to the
deconstructed notion of a set, LRU would be expensive to implement in hardware [71] compared
to the more realistic “random-young”. The graph nevertheless reports it as a useful reference
point.
Finally, the last column in Figure 4.12 is our multigrain 256-entry TLBpred design. Our
TLBpred, with half the associativity of the skewed designs, reduces AMEAN MPMI - computed
over all our workloads - by 45.7% over the AMD-like baseline, whereas TLBskew and TLBpskew
with the “random-young” replacement policy reduce it by 35.6% and 38.7% respectively. The
TLBpskew with the harder to implement LRU policy reduces MPMI by 48.2% on average.
00.10.20.30.40.50.60.70.80.91
apache TPC-C2 TPC-C1 canneal ferret x264 cassandra classific. cloud9 nutch streaming
Commercial PARSEC Cloud-SuiteWorkloads
TLBMPMIRela2vetoAMD12h-likeTLBTLBskew,TLBpskew:8-waySA,256-entry
TLBpred:4-waySA,256-ENTRY
TLBskeww/random-youngrepl.
TLBpskeww/random-young
TLBskeww/LRUrepl.
TLBpskeww/LRUrepl.
TLBpred
Figure 4.12: TLBskew, TLBpred, and TLBpskew: MPMI relative to AMD-like 48-entry FA TLB
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 83
4.7.5 Performance Model
This section uses an analytical model, as in prior work [19,68], to gauge the performance impact
of the best performing design TLBpred, as the use of software-managed TLBs with the overhead
of a software trap handler and the presence of TSB hindered simulation modeling. Saulsbury
et al. were the first to use such a model [68] modeling performance speedup as:
CPIcore + CPITLBNoOptimization
CPIcore + CPITLBWithOptimization(4.1)
where
• CPIcore is the Cycles Per Instruction (CPI) of all architectural components but the TLB,
• CPITLBNoOptimization is the TLB CPI contribution of the baseline, and
• CPITLBWithOptimization is the TLB CPI contribution under their proposed TLB prefetch-
ing mechanism.
Bhattacharjee et al. quantify performance impact as “Cycles per Instruction (CPI) Saved” over
baseline [19]. This metric is valid irrespective of the application’s baseline CPI.
Following Equation 4.1, the CPITLBNoOptimization can be computed as MPMI ∗ 10−6 ∗TLBMissPenalty. Compared to our baseline, the TLBpred has two additional CPI contributors:
1. All page size mispredictions that hit in the TLB pay an extra TLB lookup penalty.
2. All misses also pay an extra TLB lookup penalty to confirm they were not mispredictions.
Therefore:
CPIMultigrain = (MPMI ∗ 10−6 ∗ (TLBMiss Penalty + TLBLookup T ime))
+Mispredicted TLBHits ∗ TLBLookup T ime
Total Instructions
(4.2)
Figure 4.13 plots shows the cycles saved by the 256-entry 4-way SA TLBpred compared to
the 128-entry FA Sparc-T4-like TLB, i.e., (CPISparcT4−like−TLB − CPITLBpred). We assume
a 2-cycle TLB lookup latency for both designs, even though the FA TLB has a much higher
access latency. The x-axis shows different TLB miss penalties. In our system that models a
software-managed TLB, most TLB misses hit in the TSB as discussed in Section 4.6. The TLB
miss penalties we have observed in our system range from 20 cycles assuming the TSB hits in
the local L1 cache, to 60 cycles when the TSB hits in a remote L2, to over 100 cycles when
the TSB is not cached. As the TLB miss penalty increases, so does the TLB CPI contribution
of each design. Because the two contributions increase at different rates, according to the
aforementioned equations, the plotted difference does not always change monotonically in the
same direction. As Figure 4.13 shows, the two designs are comparable in terms of the CPI
component due to the TLB. Canneal experiences minimal CPI increase, within acceptable error
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 84
margins, mostly due to its slightly higher misprediction rate, since TLBpred’s MPMI is less than
that of the Sparc-T4-like TLB. Classification also experiences minimal CPI increase compared
to the FA TLB due to TLBpred’s higher MPMI, as Figure 4.9 showed. This is a workload that
benefits a lot from a large FA TLB, a benefit most likely reduced if a replacement policy other
than full-LRU is used. Conversely, Classification performs extremely poorly for a split-TLB
baseline compared to TLBpred which would thus reap significant CPI benefits. Overall, the
results in Figure 4.13 indicate that TLBpred with its highly accurate superpage predictor meets
the performance target of a fully-associative design, despite the additional TLB lookup in case
of a misprediction or a TLB miss.
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
12
25
50
100
150 12
25
50
100
150 12
25
50
100
150 12
25
50
100
150 12
25
50
100
150 12
25
50
100
150 12
25
50
100
150 12
25
50
100
150 12
25
50
100
150 12
25
50
100
150 12
25
50
100
150
apache TPC-C2 TPC-C1 canneal ferret x264 cassandra classific. cloud9 nutch streaming
WorkloadsandTLBMissPenalty(cycles)
CPISaved(TLBpredcomparedtoSparc-T4-like)-2cyclesTLBLookup
Figure 4.13: CPI saved with TLBpred
4.7.6 Sensitivity to the Page Size Access Distribution
In our experiments we have seen that there are two prominent page sizes that dominate all
TLB accesses. However, there are various factors that can influence the observed page size
distribution. For example, (a) the OS’s memory allocation algorithm, (b) how fragmented the
system is (i.e., there might not be sufficient memory contiguity to allocate large pages), and
(c) whether transparent superpage support is enabled or the user requested a specific page size.
For completeness, this section explores how TLBpred performs under hypothetical, worst case
scenarios.
We chose canneal, the workload with the largest memory footprint, to explore how our
proposed TLB design would behave under a different page-size distribution. We used the ppgsz
utility to set the desired page size for the heap and the stack. Most of the memory footprint is
due to the heap. First, we created four canneal spin-offs each with a different preferred heap
page size. Each of these configurations has a different prominent page size as Table 4.9 shows.
Secondly, we created a composite workload with a larger footprint by running two canneal
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 85
instances, each with a different heap page size (64KB and 4MB). We purposefully selected
64KB and not 8KB to put extra pressure on our TLBpred where consecutive 64KB pages map
to the same set potentially resulting in more conflict misses. For the last spin-off we dynamically
changed the heap page size throughout execution. This change resulted in the highest page size
diversity. In all cases, we set the page size for the stack to 64KB. The stack is small.
Table 4.9 reports the resulting distribution of page sizes, while Figure 4.14 shows the TLB
miss contribution of each page-size for all our canneal spin-offs for the AMD-like baseline.
Canneal Spin-Offs(Heap Page-Size)
Avg. PerCore 8KB
Pages
Avg. PerCore 64KB
Pages
Avg. Per Core512KB Pages
Avg. PerCore 4MB
Pages
8KB heap 68087 1 0 5
64KB heap 901 9272 0 6
512KB heap 658 1 1160 6
4MB heap 837 1 0 151
4MB and 64KB heap 843 9258 0 152
dynamic heap 39682 8962 62 153
Table 4.9: Canneal Spin-Offs: Footprint Characterization
Unlike the original canneal workload whose misses were solely to 8KB pages, we now observe
a different miss distribution. Most of the misses are due to the page size selected for the heap
via ppgsz as that memory dominates the workload’s footprint.
0%10%20%30%40%50%60%70%80%90%
100%
8KB 64KB 512KB 4MB 4MBand64KB(2instances)
dynamic
HeapPage-Size
TLBMissDistribu6on(%)PerPageSize
8KBpages 64KBpages 512KBpages 4MBpages
Figure 4.14: Canneal Spin-Offs: Miss Distribution for 48-entry FA (AMD12h-like) TLB
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 86
Figure 4.15 compares the MPMI achieved via our TLBpred over the AMD-like baseline. A
lower relative MPMI value is better. The TLBpred is not as good as the SparcT4 TLB for
workloads where 64KB and 512KB page sizes dominate but the differences are small. We also
modeled a precise page size predictor with larger saturating counters, similar to [24]. Values
0 to 1 correspond to strongly-predicted 8KB page and weakly-predicted 8KB pages, values
2-3 to strongly and weakly predicted 64KB pages, and so on. A correct prediction with an
even counter (i.e., strong prediction) results in no updates, while for an odd counter value
the state is decremented by one. On mispredictions the counters are incremented by one if
the page size is greater than the predicted one or decremented if it is smaller. The last bar in
Figure 4.15 corresponds to this predictor and uses the least significant bits of the predicted VPN
for the TLB set index. This precise TLBpred design is consistently better than the Sparc-T4-like
configuration. As expected, for workloads that use 8KB or 4MB page sizes it performs similar
to the superpage prediction based TLBpred in terms of MPMI. For these cases however, which
were the observed page size distributions, the precise TLBpred will yield worse latency/energy
than our superpage predictor based TLBpred design.
00.10.20.30.40.50.60.70.80.91
8KB 64KB 512KB 4MB 4MBand64KB(2instances)
dynamic
HeapPageSize
TLBMPMIRela2vetoAMD12h-likeTLB
AMD12h-like SPARC-T4-like
TLBpred(256-entry4-waySA) TLBpredPrecise(256-entry4-waySA)
Figure 4.15: Canneal Spin-Offs: MPMI relative to AMD-like TLB. Includes TLBpred withprecise page-size prediction.
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 87
4.8 Related Work
The most closely related to our work is by Bradford et al. which proposes but does not evaluate
a precise page size prediction mechanism [24]. The patent lists a variety of potential page size
prediction indexing mechanisms, based on PC, register value, and register names and targets
exact size prediction. In case of a misprediction, this approach would require as many sequential
TLB lookups as the number of supported page sizes. Our binary superpage prediction mecha-
nism balances prediction accuracy with misprediction overhead, taking advantage of observed
application behavior. Binary prediction and common indexing for different page sizes are the
key differences, yielding lower latency/energy. Our TLBpred design seamlessly supports mul-
tiple superpage sizes without having to predict their exact size. Moreover, we experimentally
evaluate the performance of prediction-guided TLB designs including one that is representative
of this design (precise TLBpred).
Talluri et al. were the first to research the “tradeoffs in supporting two page sizes” [79]. Even
though they target 4KB and 32KB pages in a uniprocessor setting, their design observations
remain relevant: a fully-associative TLB would be expensive; hosting all translations in a set-
associative TLB would require either parallel/serialized accesses with all possible indices or the
presence of split TLB structures. The latter is today’s design of choice. They also explored the
impact of always indexing the set-associative TLB with one of the two supported page numbers
showing that indexing with the 32KB page number is slightly worse but generally comparable
to “exact” indexing. Our work approximates exact indexing with the use of a binary page size
predictor.
An orthogonal approach to superpages is to pack translations for multiple pages within
the same TLB entry, as Section 2.3.1.3 more extensively reviewed. Talluri and Hill proposed
the “complete-subblock” and the “partial-subblock” TLB designs [78]. Pham et al. proposed
CoLT which takes advantage of relatively small scale page contiguity [58]. CoLT’s requirement
that contiguous virtual pages are mapped to contiguous physical frames is later relaxed [57],
allowing the clustering of a broader sample of translations. CoLT coalesces a small number of
small pages which cannot be promoted to superpages; it uses a separate fully-associative TLB
for superpages. Our TLBpred proposal is orthogonal as it can eliminate the superpage TLB.
Basu et al. revisited the use of paging for key large data structures of big-memory workloads,
introducing Direct Segments i.e., untranslated memory regions [12]. In their workloads’ analysis
they also observed inefficiency due to limited TLB capacity when large page sizes were used.
Our work addresses this inefficiency. TLBpred can (a) complement Direct Segments for those
regions that use paging and (b) can do so without OS changes.
More recently, Karakostas et al. proposed RMMLite [43]. As we reviewed in Section 2.3.4,
they dynamically downsize TLB ways of split structures adapting to different page-size distri-
butions, while also including a small FA range TLB that holds translation mappings. Their
evaluation includes TLBpp, a TLBpred implementation with a perfect superpage predictor. Their
results for a set of TLB-intensive workloads show that TLBpp reduces “dynamic energy by 43%
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 88
and the cycles spent in LTB misses by 67%” compared to a system with split-TLBs where Trans-
parent Huge Pages (THP) support is enabled. We believe that TLBpred is likely to capture a
significant portion of this idealized potential, given the fairly row misprediction rates of our
binary-superpage predictor demonstrated in Section 4.7.1.
Subsequent to our work [55], Cox and Bhattacharjee [28] proposed MIX TLBs targeting
energy-efficient address translation in the presence of multiple page sizes. Their MIX TLBs are
also set-associative designs that can host translations of different page sizes in a single struc-
ture. However, unlike our TLBpred designs that use either a small-page or superpage based
indexing scheme according to a binary superpage predictor, MIX TLBs use a single small-page
set-indexing scheme for all pages irrespective of their size. This design choice decouples the
TLB lookup from the burden of the unknown page size and eliminates the need for a page-size
predictor. MIX TLBs can retrieve a translation with a single lookup; they do no need the
secondary lookup which TLBpred designs need in case of TLB misses and superpage mispredic-
tions. However, using the small-page size index for a virtual address belonging to a superpage
causes superpage translation “mirroring”. That is, translations for the same superpage are
replicated in multiple TLB sets because page-offset bits are part of the TLB set index. MIX
TLBs counterbalance this mirroring challenge by coalescing contiguous superpages in a single
TLB entry. These superpage coalescing candidates are identified during a page-walk; their
translations all exist in the same 64B cache line that holds eight translations on x86 systems.
The authors explain that as long as the number of superpages they coalesce closely matches the
number of superpage mirrors, no TLB capacity is wasted and they can achieve “energy-efficient
performance”.
Allocating mirror translations on multiple (or even all) the TLB sets after a superpage TLB
miss, does not appear trivial in terms of energy/latency and also raises scalability questions.
Furthermore, because replacement decisions across sets are independent, there is a high likeli-
hood of duplicates even within a set. For example, a superpage might miss in one set (because
it was previously evicted) while most of its mirrors are present in other sets. The proposed
design will allocate mirror entries of the missed superpage, after the page-walk completes, on all
sets regardless. Duplicates within a set will be eventually identified and eliminated during sub-
sequent set lookups. But unfortunately, many potentially useful TLB entries across sets might
have been unecessarily evicted in this process. Despite the significant challenges of mirroring,
the high hit-rates of MIX TLBs, with the fewer page-walks etc., counteract the energy over-
heads of mirroring. MIX TLBs achieve up to 80% energy improvement and 55% performance
improvement over area-equivalent split-TLB designs.
They also evaluate our prediction-enhanced TLBpred and TLBpskew designs, presumably
with our 128-entry BRV-based PT configuration, and show TLBpred can achieve up to ∼44%
energy improvement and ∼42% performance improvement over area-equivalent split-TLB de-
signs for various workloads on native and virtualized CPUs, as well as GPUs. The TLBpred
designs are consistently more energy-efficient than the TLBpskew designs. In a few cases, their
Chapter 4. Prediction-Based Superpage-Friendly TLB Designs 89
results show up to 5% performance degradation for our designs, likely an effect of either inaccu-
rate superpage prediction for some workloads or TLB thrashing. Overall, these results further
demonstrate that TLBpred designs can achieve significant energy and performance benefits when
coupled with an accurate superpage predictor. We hope that research in supporting multiple
page-sizes will continue, and be actively adopted in commercial TLB designs.
4.9 Concluding Remarks
In this work we proposed and evaluated two prediction-based superpage-friendly TLB designs.
Our analysis of the data TLB behavior of a set of commercial and scale-out workloads demon-
strated a significant use of superpages which is at odds with the limited superpage TLB capacity.
Thus, we considered elastic TLB designs where translations of all page sizes can coexist with-
out any a priori quota on the capacity they can use. We proposed the TLBpred, a multi-grain
set-associative TLB design which uses superpage prediction to determine the TLB set index of
a given access. Using only a meager 32B prediction table, TLBpred achieves better coverage
and energy efficiency compared to a slower 128-entry FA TLB. In addition, we evaluated the
previously proposed Skewed TLB, TLBskew, and augmented it with page size prediction to
increase the effective associativity of each page size. TLBpskew proved comparable to TLBpred.
Finally, we showed that TLBpred remains effective even when multiple page sizes are actively
used and also evaluated an exact page size predictor guided TLB.
Chapter 5
The Forget-Me-Not TLB
5.1 Overview
Even though TLB capacities have increased over the past decade, this capacity growth has not
been commensurate with the ever increasing memory footprints of today’s “big-data” applica-
tions. The need for a fast TLB access-time, to avoid a negative impact to the processor’s critical
path, is the main inhibiting factor. Since increasing the L1-TLB hit-rate would conventionally
require increasing the L1-TLB capacity, thus hitting the critical latency barrier, an alternative
approach is to implement a secondary translation storage such as an L2-TLB. Some of today’s
systems have added an L2-TLB, e.g., Intel’s Haswell has an 1024-entry 8-way set-associative
L2-TLB. The access latency of an L2-TLB is not on the critical path of every memory access
because it is only accessed on an L1-TLB miss. L1-TLBs, like L1-D caches, have high tem-
poral locality that translates to a high hit-rate, usually well over 90%. However, even though
a longer L2-TLB access latency could be accommodated, the benefits gained via the allocated
TLB-capacity to such a design need to be scrutinized.
Our measurements indicate that adding an L2-TLB can yield negligible performance benefits
or even cause minor performance degradation, in some cases, when compared to a one-level
TLB hierarchy. Workloads that have very large footprints, and thus a low L2-TLB hit-rate,
or workloads heavily relying on superpages, are usually the culprits. In these cases, the extra
latency overhead of probing the L2-TLB before initiating a page walk is not counterbalanced
by the latency reduction achieved via L2-TLB hits.
This chapter presents the FMN TLB, a cacheable TLB design that aims to reduce the TLB-
miss handling latency using the existing cache capacity, and not a dedicated hardware storage,
to store translation entries. The design choices, benefits and trade-offs of this virtualized TLB
design are explored. In its core, this work harnesses the observations that (a) the vast majority
of TLB misses are to previously seen translation-entries, and that (b) translation modifications
(e.g., invalidations) are rare. Therefore, if a CMP system had sufficient capacity to store, and
not to forget, all previously seen translations, page-walks would be a rarity.
FMN can be used to back up the traditional one or two-level TLB hierarchy of current
90
Chapter 5. The Forget-Me-Not TLB 91
systems, or can be used as an alternative to a dedicated L2-TLB. The proposed FMN does
not require any additional dedicated capacity to store translation entries but instead utilizes
the existing cache capacity to store translations transparently and on-demand. This cached
translation-storage is probed with regular memory requests (memory loads) on a D-TLB miss.
If the running application’s translations can seamlessly fit in the existing hardware TLBs, no
extra storage is wasted, as it would be the case for a dedicated L2-TLB. Further, the cacheable
nature of the FMN is fertile ground for more flexible TLB organizations. For example, a shared
TLB across all cores, different TLB indexing schemes, or different TLB sizes are a few of the
optimizations that can be easily applied.
A per core private 1024-entry direct-mapped FMN reduces the average L1-TLB miss latency
across all simulated workloads by 31.4% over a baseline with only L1-TLBs, while a dedicated
1024-entry 8-way set-associative L2-TLB reduces it by 24.6%. FMN’s L1-TLB miss latency
reduction results in up to 1.97% overall execution time reduction (performance). Overall,
the L1-TLB miss latency reduction does not translate in commensurate performance benefits.
This behaviour is also observed in case of the dedicated L2-TLB that can, in some cases,
cause performance degradation up to -1.6%. This chapter also presents an L2-TLB bypassing
mechanism as a potential first-step solution to mitigate such cases.
The remainder of this chapter is organized as follows. Section 5.2 first describes the idea
behind FMN and its operating scenarios. Section 5.3 then introduces the FMN organization and
discusses different design choices from FMN indexing schemes to allocation and replacement
policies, and Section 5.4 describes how the FMN is cached. Section 5.5 details our simulation
methodology including the timing model used and its limitations. Section 5.6 presents an
analytical model to estimate FMN’s performance potential, followed by a description of a set of
synthetic traces (Section 5.7) and the baseline configuration (Section 5.8). Section 5.9 showcases
the results of a case study using synthetic traces, while Section 5.10 evaluates FMN using
commercial workloads. Section 5.11 presents our L2-TLB bypassing optimization. Finally,
Section 5.12 concludes this chapter.
5.2 FMN’s Goal and Operation
Imagine a system where on a TLB miss a virtualized TLB is accessed in parallel to the page
table. However, unlike the page walk which usually requires multiple memory requests (up
to four for x86 systems, potentially fewer if MMU caches are used [10, 15, 16]), only a single
cache request is now needed to retrieve the translation, if the latter exists in this new cacheable
structure. If this single cache access completes faster than the page walk, and the retrieved
translation is valid, you have just avoided a significant percentage (up to 75% if we assume four
memory accesses per page-walk, all with the same access latency as the FMN) of the TLB miss
penalty and have improved your system’s performance. Figure 5.1 illustrates this best case
scenario that the proposed hardware-managed Forget-Me-Not (FMN) TLB scheme aims for.
Chapter 5. The Forget-Me-Not TLB 92
time
A C B
TLB miss FMN probe returns
Page walk completes
Figure 5.1: FMN’s Best Case Scenario
As Figure 5.1 illustrates, when a TLB miss A occurs, both a page walk and an FMN
probe are initiated. If the FMN probe C returns before the page walk B does, and with
a correct translation, then the processor can make forward progress and save execution time
between events C and B (dashed region), which would not be otherwise possible. In today’s
systems, the processor would execute any instructions dependent on the memory request that
triggered the TLB miss after the page walk completed (event B ). Instead, in the scenario
described above, the processor will be at that time executing instructions further ahead in the
instruction stream. The greater the timeframe between events B and C , the better.
FMN is a cacheable and speculative TLB which significantly reduces TLB miss handling
latency without requiring any changes to the operating system or large dedicated on-chip re-
sources. It leverages the observation from prior work that large on-chip memory caches can be
shared transparently and on demand with properly engineered virtualized structures [25, 26].
The proposed design also investigates the use of speculation in providing highly accurate address
translation without keeping the FMN coherent with the page tables. For example, if a page is
demapped, the FMN is not immediately updated. Because such translation modifications are
rare, the design decision not to update the FMN does not reduce the potential performance
improvement. The FMN can be configured either as a per-core private table or as a single table
shared across all cores, thus adapting to the different requirements and memory behavior of
applications.
The FMN TLB scheme extends the reach of conventional private TLBs and has the following
main characteristics:
• It provides the MMU with a fast yet speculative translation based on recent translation
history.
• It utilizes part of the cache hierarchy, transparently and on demand, to store its speculative
translations.
Section 5.2.1 next presents the common operating scenarios of an FMN probe.
5.2.1 FMN Operating Scenarios
This section explains how an FMN-capable system handles a TLB miss. Traditionally, on a
last-level TLB miss, the MMU initiates a hardware page walk (assuming hardware-managed
Chapter 5. The Forget-Me-Not TLB 93
TLBs). In the proposed system, the MMU also initiates an FMN probe in parallel to the
page walk. Both the page walk and the FMN probe share the same objective: retrieving the
translation. As with any scenario where two operations proceed in parallel, the order in which
the two operations complete is important. Only two possibilities exist timeliness-wise: (a) the
page walk completes before the FMN probe, or (b) the FMN probe completes before the page
walk.
Figure 5.2 shows the timeline for the first scenario. Once the page walk completes, the MMU
observes its precedence over the still pending FMN probe. Program execution then continues
with the page-walk retrieved translation, which is guaranteed to be correct. The FMN probe
reply, which will arrive later in time, is treated as useless from the MMU. Whether the reply
had the correct translation or not is irrelevant in terms of performance.1
time
A
TLB miss Page walk completes FIRST
PageWalk B C
FMN probe reply is useless
FMNProbe
Figure 5.2: FMN Operation Timeline - Page Walk completes before FMN probe
Figure 5.3 depicts the three possible timelines for the second scenario in which the FMN
probe completes first. As with any tagged structure lookup, a miss or a hit are the two possible
outcomes. In case of a hit, the FMN retrieved translation needs to be eventually checked against
the one retrieved via the page walk to ensure correctness because the proposed FMN is not
kept coherent with the page-tables.
Figure 5.3a shows the timeline for an FMN miss. The FMN probe reply indicates to the
MMU that an FMN miss took place. Waiting for the page walk to complete, as if no FMN
support existed, is the only option. Figures 5.3b and 5.3c show the timeline for an FMN hit.
If an FMN hit occurs, the processor enters speculative execution while taking a checkpoint
of its architectural state. No stores are propagated to the memory hierarchy and no I/O
operations are permitted. Once the page walk completes, the MMU compares the speculative
FMN translation with the one retrieved from the page table.
If the two translations match (Figure 5.3b), any speculative changes are committed. The
functionality required to commit speculative state (i.e., make it architectural state) already
exists in today’s processors. A common example is branch prediction; speculative instructions
- potentially in the wrong code path - are allowed to execute but cannot retire until the predicted
branch has been resolved. After this commit, the program continues its execution from an
instruction that is further ahead in the dynamic instruction stream from the instruction which
1Depending on the FMN allocation policy, discussed later, we could avoid sending an FMN allocation requestif the retrieved translation was correct.
Chapter 5. The Forget-Me-Not TLB 94
initially triggered the TLB miss. This scenario is the expected common case and results in a
reduced TLB miss handling latency.
If the two translations mismatch (Figure 5.3c), the speculative execution using the FMN-
retrieved translation was useless. Any changes made during speculative execution are discarded
and the dynamic instruction stream starting from the offending instruction gets re-executed.
This misspeculation scenario is expected to be rare as translation mappings are usually persis-
tent.
time
A
TLB miss Translation retrieved via
page walk
PageWalk B
FMN probe completed first, but it was a miss. FMN probe was useless.
FMNProbe
C
(a) FMN miss
time
A
TLB miss FMN translation was correct! Commit speculative state.
PageWalk
FMN probe completed first and is a hit!
FMNProbe
C
Specula5veexecu5onusingFMN-retrievedTransla5on
B
(b) FMN Hit - Translation Correct
time
A
TLB miss FMN translation was wrong! Rollback to checkpoint and re-execute*
PageWalk
FMN probe completed first and is a hit!
FMNProbe
C
Specula5veexecu5on*usingFMN-retrievedTransla5on
B
(c) FMN Hit - Incorrect Translation
Figure 5.3: FMN Operation Timelines - FMN probe completes before page walk
Chapter 5. The Forget-Me-Not TLB 95
5.3 FMN Organization
The FMN design shares many traits of how TLBs are commonly organized. However, some
design requirements can be relaxed in the FMN due to its speculative nature. FMN’s cacheable
nature further influences some design choices. This section presents the FMN design require-
ments and discusses various design considerations. In this section, FMN is treated as a stan-
dalone structure; Section 5.4 presents how this structure is cached (i.e., virtualized).
FMN is a tagged structure, similar to a regular TLB; the presence of tags is necessary to
ensure we do not use translations from other virtual pages or processes, which - barring any
synonyms - would always lead to misspeculation. An FMN probe can only result in misspecula-
tion when it returns an older, but no longer valid, translation mapping for a given virtual page
and process. This scenario happens rarely and is due to the design choice to lazily propagate
any translation modifications to the FMN.
Each FMN entry can be thought of as a replica of a TLB entry. The following set of
conditions triggers an FMN hit:
1. The VPN of the missing page matches the VPN in a valid FMN entry.
2. The ASID of the process with the TLB miss, also referred to as a context in this work,
should also match the context in the FMN entry. The only exception occurs if the Global
bit in the FMN entry is set. The context comparison is then skipped and a VPN match
suffices.
FMN can be organized as either a private or a shared, potentially associative structure,
similar to a regular TLB. Different trade-offs exist for each design choice. A shared structure
can avoid translation replication, thus making a more efficient use of the overall capacity com-
pared to private structures, especially for workloads that either share data across cores or have
drastically different per core capacity requirements. However, as Section 2.3.1.2 discussed, these
potential benefits come at the cost of a slower access time. The trade-offs discussed earlier, and
in the literature review section in Chapter 2, apply to the FMN too. For example, replicas of
the same translations can exist in multiple private FMNs. However, FMN’s cacheability adds
another dimension to these trade-offs. Section 5.4.3 discusses the effect of this added dimension.
But before Section 5.4 presents how the FMN is cached, Sections 5.3.1 to 5.3.3 examine
important aspects of the FMN’s organization. Namely, (i) how the presence of multiple page
sizes can be handled, (ii) how the FMN can be indexed, and finally (iii) what are some possible
FMN allocation and replacement policies.
5.3.1 Page Size
The page size is not known during a TLB lookup, and by extension during an FMN lookup
as well. One design choice would be to have the FMN only support the most common page
size; the smallest and most prevalent page-size is 4KB in x86 and 8KB in SPARC. In the
Chapter 5. The Forget-Me-Not TLB 96
few cases when other page sizes are used, this FMN lookup will be wasteful. Wasteful TLB
lookups happen in conventional systems too. For example, in systems with split L1-TLBs at
most one of the multiple parallel split lookups will result in a hit, while the other lookups will
waste energy, as Chapter 4 demonstrated. Unlike an L2-TLB - commonly accessed before the
page-tables - where any unnecessary access for an unsupported page-size adds to the TLB miss
handling latency, FMN requests, which always proceed in parallel with the page-walk, can affect
performance only indirectly as a result of increased memory pressure.
It is possible for a superpage FMN lookup to not be wasteful, if on a superpage FMN miss
the translation for the 8KB page of that superpage is allocated in the FMN. The challenge
with such a design is that the FMN capacity can be unnecessarily wasted when almost all 8KB
pages of a superpage are used. However, if entries are only allocated in the FMN in case of an
L1-TLB miss, and assuming the L1-TLB(s) are not thrashed by superpage accesses, then likely
only the first 8KB page trigger of a superpage will be allocated in the FMN; the rest will hit
in the superpage translation in the L1-TLB(s).
FMN could support multiple page-sizes but at the cost of multiple sequential FMN lookups,
one per page-size. However, this choice is anticipated to have diminishing benefits the later
in this sequence the successful probe happens. Page-size prediction, similar to the proposal in
Chapter 4, could be a compelling design choice, doing the lookup with the predicted page-size
first. On a miss, any subsequent sequential lookups could be dropped to avoid wasting energy.
5.3.2 FMN’s Indexing Scheme
Like any set-associative structure, the FMN index requires log2(FMN sets) bits. One design
option in par with conventional cache indexing schemes is to use the bits of the virtual address
immediately after the page-offset bits. But solely using the log2(FMN sets) bits after the page-
offset could result in conflict-misses in a CMP environment with multiple running processes, as
different processes can have the same virtual pages contend for the same FMN set.
The aforementioned behavior stems from the fact that each process has its own address
space, and therefore the start of different address space segments (e.g., heap, kernel address
space) would coincide. These translations are differentiated in TLB-entries via an ASID (con-
text) field. Incorporating context information in the FMN indexing scheme, e.g., via xor-ing the
original cache-like index with the context bits, could reduce contention for VPNs shared across
different processes. Including context information along with the VPN in the FMN index, will
map the same VPN to different sets, if used by different processes.
5.3.3 FMN’s Allocation and Replacement Policies
Different FMN allocation and replacement policies can be implemented. One possible policy is
to allocate a translation entry in the FMN upon its eviction from the TLB, thus having FMN
act as a victim TLB. Another policy - the one used in this work - would be to allocate in the
FMN the translation entry that just missed in the TLB. The latter would facilitate sharing
Chapter 5. The Forget-Me-Not TLB 97
across cores in case of a shared FMN because as soon as a core misses in the FMN other cores
accessing the same data would benefit from that FMN entry. Choosing - on occasion during
runtime- not to allocate a translation-entry in the FMN is another viable option, especially for
cases when a translation entry is predicted to get little to no reuse, or when there is contention
in the memory system.
FMN entries are not kept coherent with the page tables, reducing synchronization overheads.
Not propagating any page-table modifications to the FMN does not affect correctness because
the FMN-retrieved translation is always compared with the correct translation that is retrieved
via the page-walk. Inconsistent FMN entries will be eventually updated once the discrepancy
is identified.
Within an FMN set, an LRU replacement policy can be implemented. PTEs traditionally
have some unreserved bits, often used for diagnostics; these bits can be used to store replacement
information in case of an associative FMN. Other variations such as pseudo-LRU or random
replacement policies could also be used. Section 5.4 provides more details.
5.4 Caching the FMN
FMN is a hardware-managed cacheable and virtual2 structure that uses on-chip cache capacity
upon demand, without requiring its own hardware budget. Caches implicitly store address
translations under different scenarios. For example, SPARC’s Translation Storage Buffer (TSB)
is a per-process software data structure which holds recent translations. Page table entries are
also cacheable both in hardware and software-managed TLB schemes.
Figure 5.4 illustrates how FMN affects cache contents when compared with a system with
no FMN. The cache depicted in this example is set-associative, with each row corresponding to
a cache set. In the baseline system where no FMN exists, the cache contains only demand data
(e.g., data, instructions, etc.) and page-table data, whereas in the system where the FMN is
enabled some cache blocks are now occupied by FMN data. In this example, FMN has displaced
demand data from the cache, but it could have displaced page-table data or a combination of
both types, or even no data, if the FMN entries did not survive in the cache.
The key take-away is that the existing cache capacity is not partitioned in any way. Instead,
all types of cache blocks (demand, page-table, and FMN) freely contend for the entire cache
capacity via the existing cache replacement and allocation policies, similar to how demand cache
blocks compete with each other in regular caches. Choosing to treat FMN data differently, e.g.,
by employing a different replacement policy for the FMN, could be an interesting option, but
it is not explored in this work.
In order to access the FMN data in an FMN-capable system, an FMN probe (lookup) is
needed, which requires a load (read) request to be sent to the cache hierarchy, starting from the
L1 cache. Equation 5.1 shows the physical-address calculation for the FMN-probe address.
2The term virtual does not refer to the type of addresses used to access FMN.
Chapter 5. The Forget-Me-Not TLB 98
(a) Cache Contents when FMN is disabled. (b) Cache Contents when FMN is enabled.
Page-TableData
DemandData
FMNData
Legend
. . .
. . .
. . .
. . .
. . .
. . .
Figure 5.4: FMN’s effect on cache contents.
The FMN is probed with physical-addresses the same way the page-tables are accessed in
memory. FMNbase is the starting physical address of each FMN structure, if private, while
FMNset is computed via the current FMN indexing scheme. Both the FMNentry size and the
FMNassociativity used are a power of two to avoid expensive multiplication costs.
FMNprobe address = FMNbase + (FMNset ∗ FMNassociativity ∗ FMNentry size) (5.1)
The FMNbase address is page-aligned and should have a fixed address in each system. The
physical address space occupied by the FMN should be reserved. As we will explain shortly,
the size of each FMN entry is 16 bytes, and hence, for the simulated FMN sizes, a single 4MB
superpage would be more than sufficient, supporting up to 16K FMN entries per core in a
16-core CMP. The address computation for the probe address can be performed fast enough
so that an FMN probe request can be issued the cycle immediately after a TLB miss. For
the direct-mapped FMN configuration modeled in this work, this address computation involves
only a left shift for the multiplication and an addition.
To limit the FMN-lookup latency, an FMN-set should not span multiple cache lines. A cache
line can contain one or multiple FMN sets, with the cache line size (CLS) being a multiple of
the FMN set size as the equation below formally describes.
CLS modulo (FMNassociativity ∗ FMNentry size) = 0 (5.2)
In the modeled SPARC ISA (see Section 4.6 for details), the translation-entry tag and data fields
require 16B in total (8B each) without any compression. Thus, one could conservatively pack
four FMN entries in a single 64B cache line, the common cache line size for current processors,
and the one used for all cache levels in the simulated CMP system. PTEs traditionally have
some unreserved bits, often used for diagnostics; these bits can be used to store replacement
information in case of an associative FMN.
In a nutshell, the FMN’s associativity should be determined by the cache line size along
with the size of a page-table (translation) entry in the native machine. In some cases, satisfying
Chapter 5. The Forget-Me-Not TLB 99
the Equation 5.2 might require increasing the FMNentry size by a few bits beyond what is
absolutely necessary. Otherwise, some FMN sets would span multiple lines or extra padding
would be needed to avoid this but at the cost of a more complicated FMN indexing scheme and
wasted space.
Figure 5.5 illustrates how four sets of a 2-way SA FMN are mapped to the physical address
space and, by extension, to cache lines, in par with the above equations.
Tag1a Data1a
Tag2a Data2a
Tag3a Data3a
Tag4a Data4a
Tag1b Data1b
Tag2b Data2b
Tag3b Data3b
Tag4b Data4b
FMNset 0
FMNset 1
FMNset 2
FMNset 3
FMNway 0 FMNway 1
(a) Standalone FMN view
CacheLine
CacheLine
Tag1a Data1a Tag1b Data1b Tag2a Data2a Tag2b Data2b
Tag3a Data3a Tag4a Data4aTag3b Data3b Tag4b Data4b
FMNentry_size
(b) The virtualized FMN: mapping sets to cache lines
Figure 5.5: Virtualizing a small 8-entry 2-way SA FMN.
Until now this section has presented how, i.e., with what memory address, one can access the
FMN. Given a TLB miss, an FMN set is determined, as is the case in any cache-like structure,
and the FMN probe address is formulated. To complete the FMN lookup, two types of memory
requests are needed: (i) FMN probes, and (ii) FMN allocations. The former are memory reads
issued to the cache hierarchy to retrieve the translation, while the latter are memory writes
issued to the cache hierarchy to modify FMN’s contents. The two subsequent sections detail
both the functionality of these two request types and how they interact with the existing cache
controllers, thus concluding the necessary architectural support for caching the FMN.
5.4.1 FMN Probes
Effectively two types of lookups take place on every FMN probe: a regular cache lookup and
a secondary lookup (search) within the data contents of the returned cache block. The first
lookup, like any cache access, uses the existing cache tags to determine if the current FMN set
is present and valid in the cache or not (i.e., FMN-set hit versus FMN-set miss). The cache is
not aware that this lookup is targeting the FMN. The second lookup takes place on the cache
block contents, once they have been returned to the TLB controller. It is this second lookup
that determines if the required translation is present in the cached FMN-set (FMN translation
hit). The timelines depicted in Figures 5.3b and 5.3c both occur only on an FMN translation
hit.
On an FMN-set miss in the LLC, the probe request is currently dropped and an empty
cache block is returned to the requesting core. FMN’s contents do not currently spill to off-
chip memory, when evicted from the LLC, because with the long off-chip latency it is unlikely
Chapter 5. The Forget-Me-Not TLB 100
there would be any TLB miss latency reduction compared to the page-walk, and the additional
off-chip traffic would be wasteful. The LLC controller would need to know the FMN’s address
range, in some way, to properly handle FMN probes that miss, as well as LLC evicted FMN data.
LLC FMN evictions can either be dropped or alternatively the cache, if using writeback, could
never set the dirty bit for any LLC FMN block. With this design, when a cache line holding
FMN-data gets evicted from the LLC, the associated information is lost. This information will
be recreated whenever an FMN allocation takes place for the same FMN data. Deciding not
to spill to memory is a design choice; future work may evaluate an alternative.
5.4.2 FMN Allocation Requests
Unlike regular memory requests which are filled from memory, the cached FMN retrieves its
data via FMN allocation requests. Depending on the employed FMN allocation policy, these
write requests occur either when a TLB-entry is evicted or when the page-walk for the missing
virtual-address that triggered the FMN probe completed and thus the correct translation was
retrieved. In the second scenario, one could choose not to issue an allocation request on an
FMN-set-hit with a translation hit because the translation is already present in the FMN. In all
cases, FMN allocations are not in the critical path of a TLB miss and can thus proceed lazily.
5.4.3 Discussion
As Section 5.3 mentioned, caching the FMN adds another dimension to some FMN design
decisions. In the shared versus private FMN domain, the latency trade-off in a shared FMN
is slightly different than with a shared TLB. The same FMN translation entry can now be
replicated across the private upper-level (e.g., L1 or L2) caches of different cores. Even though
this replication wastes no FMN capacity, it could waste some cache capacity displacing more
regular demand data. Latency-wise, the replicated data might be faster to access, compared to
a centralized FMN structure, but coherence among FMN entries could add additional latency.
FMN’s cacheability should also guide FMN’s associativity, even beyond the constraint of a
cache line size to be a multiple of the FMN set size (Equation 5.2). Given a 64B cache line and
a 16B FMN entry (accounting for both FMN tag and data), the possible associativity options
are one, two, or four. In caches, limited associativity facilitates faster access time and less
time/space spent for replacement selection, but with the cost of additional conflict misses. A
direct-mapped FMN is an attractive choice because it allows multiple consecutive FMN sets
(four in the previous example) to map to the same cache line. This organization is effectively
equivalent to a next-line prefetcher. Here the translations for four consecutive VPNs (assuming
a cache-like indexing scheme) will map to the same cache line. Once one of these four pages
is accessed via an FMN hit, and the cache line with that FMN data is brought closer to the
processor (L1 cache), the remaining three pages will experience shorter FMN access times, and
thus reap more benefits, if the workload accesses consecutive pages (high spatial locality at the
page granularity).
Chapter 5. The Forget-Me-Not TLB 101
Other designs that also use caches to store translations are SPARC’s Translation Storage
Buffer (TSB) and the Part-Of-Memory TLB (POM-TLB) [67]. Contrary to SPARC’s TSB,
which is a software translation cache managed by the operating system, FMN is a hardware-
managed structure. The TSB is accessed as part of the TLB miss software trap handler before
the page walk commences, whereas FMN’s lookups are initiated by the hardware and occur in
parallel with the hardware page walk. FMN is also, by design, not kept coherent with the page
tables. Further, as discussed earlier, FMN is not a per-process structure, but instead can be
configured as a per-core private structure (the configuration evaluated in this work) or a shared
structure, with the potential for various in-between configurations.
Concurrently with this work, Ryoo et al. proposed their “Part-of-Memory TLB (POM-
TLB)” [67]. Their proposal targets virtualized environments where the page walk latency is
significantly longer due to the two-dimensional page walk. POM-TLB is a large structure,
acting as a shared L3-TLB, that is stored in DRAM. Because POM-TLB is part of memory, as
FMN is, POM-TLB’s translations can also be cached in on-chip data caches. POM TLB entries
are cached in the L2 and L3 caches, but not in the L1 cache (by design choice). Contrary to
FMN, POM-TLB is probed before the page-walk commences, as a TSB would be, and it is
reported to eliminate most page-walk accesses due to its large size (e.g., 16MB).
5.5 Simulation Methodology
This section presents the methodology used to evaluate the proposed Forget-Me-Not design.
Section 5.5.1 explains the simulation challenges we encountered, Section 5.5.2 details the timing
front-end model we developed to address them, and Section 5.5.3 describes how we simulate
page walks. Finally, Section 5.5.4 discusses the limitations and trade-offs of this methodology.
5.5.1 Simulation Challenges - Software-Managed TLBs in Simics
This work uses a full-system simulator based on Simics [52] that models the SPARC ISA and
boots Solaris. The TLBs in this system are software-managed. Unfortunately, the presence of
software-managed TLBs complicates the modeling of any architectural optimization that would
affect either the TLB hit ratio or how TLB misses and page walks are managed.
The TLB configurations present in the existing simulated system dictate whether a TLB miss
will be triggered or not. The employed Simics simulator models the Cheetah-MMU memory
management unit, which includes two per-core private TLBs: (i) a 512-entry 2-way SA TLB
for 8KB pages and (ii) a 16-entry FA TLB for superpages and locked translations.
If a memory access misses in the aforementioned Simics TLBs, which are different than
the TLB designs explored in this work, the operating system traps to the appropriate MMU
handler. Before walking the page-tables (in software), the MMU software trap handler probes
Chapter 5. The Forget-Me-Not TLB 102
the Translation Storage Buffer (TSB). A single 128-bit atomic memory load is required for this
purpose. On a TSB hit, which is the anticipated fast common case, the trap handler updates
the TLB with the retrieved translation and retries the memory instruction which had triggered
the TLB miss in the first place.
Table 5.1 shows a listing of the D-MMU trap handler code for a TLB miss that had resulted
in a TSB hit. This information was retrieved from Simics via the disassemble command. Added
comments explain the purpose of the various assembly instructions in that code snippet. The
key part of the TSB probe in the MMU trap handler is the 128-bit atomic load ldda which loads
the translation table entry from the TSB, both tags and data, into a set of global registers. See
instruction #9 in Table 5.1.
Unfortunately, in this simulation environment, it is challenging to evaluate FMN’s impact
on an x86-like baseline. Due to the presence of a TSB and the high frequency of TSB hits,
any results comparing the TLB miss handling cost achieved with the FMN design against that
baseline would be skewed. The objective is to compare the TLB miss handling overhead using
FMN with the overhead observed in an x86-like system, i.e., a system without a TSB that
has hardware-managed TLBs. Even disabling the TSB, e.g., by forcing TSB misses in Simics,3
would still compare FMN lookups with the latency overhead of a software TLB miss handler
walking the page tables, and will be directly influenced by how these page tables are organized
in this architecture.
To avoid the aforementioned challenges and negative side-effects of a system with software-
managed TLBs, we created a trace-driven timing simulator without Simics. In this new sim-
ulator, TLB misses do not probe the TSB, but they instead initiate a page-walk as in an x86
system. The page table walk now follows the x86 format and is thus not constrained by how the
page tables are organized in Solaris. The next section details our timing model and discusses
its trade-offs.
3One could effectively disable the TSB by storing zeros to the destination registers of the ldda instruction(#9 in Table 5.1). Doing so would trigger a TSB miss by making the comparison (instruction #10) fail. Thus,a Solaris page-walk would ensue. In the sun4u architectures where UltraSparc-III processors belong to, the pagetables are organized as Hashed Page Tables (HPT); “HPTs use a hash of the virtual address to index into apage table. The resulting hash bucket points to the head of a list of data notes containing table entries that aresearched for a matching virtual address and context” [53].
Chapter 5. The Forget-Me-Not TLB 103
Instr. SPARC v9 Assembly Explanation
1 ldxa [%g0 + %g0] 0x58, %g2
# ASI_DMMU
Read the contents of the D-TSB Tag Target Register. ThisMMU register holds information about the virtual address andcontext that missed in the D-TLB. SPARC uses the special loadextended word from alternate address space (ldxa) instructionto access special MMU registers. Global register g2 is the des-tination register.
2 ldxa [%g0 + %g0] 0x59, %g1
# ASI_DMMU_TSB_8KB_PTR_REG
Read the contents of the D-TLB 8KB Pointer MMU register.There is MMU hardware support that forms this TSB pointerto speedup the TSB lookup.
3 srlx %g2, 48, %g3 Global register g3 now holds the context in bits [12:0].
4 brz,pn %g3, 0x10000d38
Branch if register g3 contains 0 (i.e., this is a global page and nocontext comparison should take place). This is predicated not-taken (pn) and indeed it is not. Sparc v9 has branch delay slots:the instruction after a branch is commonly executed unlessannulled by the branch.
5 sll %g3, 4, %g5 Destination register g5 holds context in bits [16:4].
6 sra %g2, 11, %g6Reg. g6 contains virtual address bits [53:33] of missing addressin its least significant bits.
7 brgz,pt %g6, 0x10008840 Branch if the contents of reg. g6 are greater than zero. Thisbranch is incorrectly predicated as taken (pt).
8 xor %g5, %g1, %g1Xor context with TSB 8K pointer to come up with TSB ad-dress.
9 ldda [%g1 + %g0] 0x24, %g4
# ASI_NUCLEUS_QUAD_LDD
This is the only memory request that goes to the cache hier-archy (quad load). It is a 128 bit atomic load which loads theTTE (Translation Table Entry) tag entry to reg. g4 and theTTE data entry to register g5.
10 cmp %g2, %g4Compare TSB entry (retrieved from previous load) with TSBTag target (i.e., vaddr and context comparison).
11 bne,pn %xcc, 0x100088c0Branch on a TSB miss. Predicated not-taken as TSB hits arethe common case.
12 sethi %hi(0xffff8000), %g4
13 stxa %g5, [%g0 + %g0] 0x5c
# ASI_DTLB_DATA_IN_REG
Write the contents of g5 to the D-TLB Data In register. Reg-ister g5 was holding the TTE data after the ldda instructiongot executed (i.e., the translation (phys. address), protectionbits, etc.)
14 retryRetry the offending instruction (i.e., the memory request thathad missed in the DTLB).
Table 5.1: TSB hit code in D-MMU Trap Handler (Solaris)
Chapter 5. The Forget-Me-Not TLB 104
5.5.2 Timing Model
This work uses a trace-driven timing simulator that follows a blocking in-order core model for
all memory requests in a 16-core CMP. Figure 5.6 depicts a high-level model of our simulator’s
front-end. We do model a detailed memory system in our simulator (back-end), including
full-timing for caches, TLBs, on-chip network and memory (DRAMSim [66]). A trace parsing
component parses the collected memory TLB traces, or the synthetically generated ones, and
feeds them to per-core memory FIFO queues. There are also separate queues to keep track
of page-walks and FMN probes/allocations, as will be explained later. Please note that this
is the high-level software implementation for simulation purposes and not the architectural
implementation. For example, one could think of the Memory FIFO of Figure 5.6 as the Load-
Store Queue equivalent.
Trace To
Timing
Trace To
Timing
Trace Parsing Component Memory Trace (16-‐cores from
Simics)
. . .
C1 C2 C3 C16
TLBs
L1-‐D
Trace To
Timing
Trace To
Timing
. . .
TLBs
L1-‐D
Trace To Timing Engine
Memory FIFO
Page Walk Req. FIFO
FMN Probe/Alloc. FIFOs
Remaining Cache hierarchy/Network/Memory
Figure 5.6: Timing Model - Front End
In our system, each FIFO entry is tagged with a state. Initially all requests are inserted in
the Memory FIFO queue in the Unprocessed state, except for the request in the head of the
queue that is TLB ready. From that point, the life-cycle of that regular memory request is the
following:
1. The request at the head of the FIFO (TLB ready) is sent to the TLB hierarchy. The
entry then transitions to a TLB stalled state depending on it being a TLB hit or a miss
and the associated TLB latencies. Once all TLB associated latencies have elapsed, the
request is either ready to be sent to the memory hierarchy if its translation is known, or
a page-walk is in order.
Chapter 5. The Forget-Me-Not TLB 105
2. TLB Miss: On a TLB miss both a page-walk and an FMN-probe, if the FMN is enabled,
are initiated in parallel for that request. Entries are allocated into the per-core Page-Walk
and FMN-Probe FIFOs. The page-walk involves multiple memory requests to walk the
multi-level page tables; the Page-Walk FIFO state describes which part of the page-walk
each request corresponds to. The page-table format is presented in Section 5.5.3. De-
pending on which returns first (i.e., page-walk or FMN-probe), the next steps are:
(2.a) Page-Walk returned first (Figure 5.2): The FMN probe will be useless once it
comes back as it failed to speed up the page-walk process. The corresponding FMN FIFO
entry is thus marked as useless. Now that we have retrieved the translation for the mem-
ory request (physical address is known), the request can be sent to the memory hierarchy.
Depending on the FMN allocation policy, we might also send the FMN allocation request
to the cache hierarchy to fill the FMN. Modeling a direct-mapped FMN allows us to send
the FMN allocation request in advance of the FMN probe reply. For a set-associative
FMN, the system should wait for the FMN probe reply, do the LRU stack update there,
and then send the FMN allocation request that will update the entire FMN set if cached.
(2.b) FMN-probe returned first (Figure 5.3): In case of an FMN miss (either
due to a miss in the set, or a non-cached FMN-set), we fall back to the previous case
and wait for the page walk to complete. On an FMN hit, we do not issue a memory
request after an FMN-probe reply unless we are certain this will not be a misspeculation.
In other words, in the scenario illustrated in Figure 5.3c, we pay the latency penalty of
waiting for the page walk to complete, as in an actual system, but we do not issue mem-
ory requests to the memory hierarchy. The only side-effect of using this oracle knowledge
is that fewer requests are sent to the memory hierarchy, which is negligible given that
translation modifications are rare and latency is always properly modeled.
3. Once the page-walk returns, irrespective of whether this was earlier or later in time than
the FMN probe, we know that we have the most up-to-date translation for the given
virtual address. An FMN allocation message, i.e., a write request, is sent to the memory
hierarchy with the correct translation. This serves a dual purpose: it keeps the FMN
information up-to-date and it helps the cached FMN blocks survive in the cache.
4. Once the memory reply from the back-end reaches the memory FIFO, the subsequent
queued request is ready to be sent to the TLB. This process repeats.
5.5.3 Page Walk Modeling
To make the page-walk representative of a real, x86-like, system, the required multi-level page-
tables are modeled. To make them compatible with our SPARC ISA traces, some modifications
were needed. Each simulation starts with a known pool of free 8KB pages (physical frames), the
Chapter 5. The Forget-Me-Not TLB 106
smallest page-size in SPARC. These are pages that are not accessed within our trace and can
thus be freely allocated to the page-tables or the FMN without interfering with the application’s
access pattern. We populated this per-application pool of free 8KB pages during a preprocessing
step. Figure 5.7 shows how we model the page-walk in our infrastructure.
51 42 41 32 31 22 21 13 12 0 63 52
PDE L4 Page Table (1024 entries)
L3 Page Table (1024 entries)
L1 Page Table (512 entries)
Page Offset (8KB page)
PDE
PTE
PDE
L2 Page Table (1024 entries)
CR3 (x86) equivalent
Process Info. (e.g., context)
Page Offset (4MB page)
8KB page translaMon
Virtual Address (TLB miss)
Page Walk
Figure 5.7: Page Walk Model
Usually in architectures with multi-level page tables, there is a specialized register that
points to the beginning of the first-level page-table for the currently running process. In x86,
the CR3 register contains this physical memory address. Since we do not have an x86 system,
we dynamically emulate this behavior by keeping a map of a (process ID, virtual address bits
[63:52]) tuple to a physical address that marks the beginning of the upper-level page-table
for the process in question. No access/hardware cost is associated with this access as this
information is readily available to the hardware page-walkers of real systems.
From this point onwards we inject page-walk requests into the memory hierarchy, similar
to the hardware page-walker. We maintain 4-level page-tables. Bits from the TLB missing
virtual address are used to index into each page table level as shown in Figure 5.7. Each table
entry is 8B and is either a page directory entry, PDE, (i.e., a pointer to the beginning of a
lower page-table) or a page translation entry, PTE, (i.e., the final translation, present in the
lowest page table level or in a higher level if this address belongs to a superpage). Each of
the smaller tables have either 512 entries (lowest level which holds the page translation entry)
or 1024 entries (upper three levels which usually hold page directory entries). Page walks for
Chapter 5. The Forget-Me-Not TLB 107
8KB pages require 4 page-table accesses to retrieve the translation, whereas 4MB pages require
3 page-table accesses. In all cases, each small table fits within an 8KB page. Any time the
contents of a page table entry are invalid, we grab a new 8KB page from the free list. We
also specify if an entry needs to be a PTE rather than a PDE, as indicated by the page size
information retrieved from our Simics.
5.5.4 Discussion of Limitations
This section discusses the limitations our methodology.
• Blocking In-Order Core Model: This work uses a trace-driven timing simulator that
follows a blocking in-order core model for all memory requests in a 16-core CMP. In this
blocking in-order core model, a core C1 cannot issue another memory request to its L1
cache unless its previously pending memory request has completed. This constraint does
not apply to memory requests issued by the MMU: page-walks and FMN probes/alloca-
tions. This simple front-end reflects recent trends for simpler core microarchiectures [51].
POWER6 [48], ARM Cortex-53 [7], and Intel Xeon Phi [63] are a few examples of com-
mercial in-order machines.
In an Out Of Order (OoO) core, part of the TLB miss latency will likely be hidden
by the extracted Instruction Level Parallelism (ILP). However, there is a limit on how
much ILP can hide; some of the TLB miss handling time (in systems with hardware-
managed TLBs) will be non overlapping similar to our in-order core model. Many of the
simulated commercial workloads tend to have low Memory-Level Parallelism (MLP) and
hence low Instructions Per Cycle (IPC), thus making an in-order core model a reasonable
approximation. For example, Ferdman et al. report that the MLP for the scale-out
workloads (Cloud-Suite) ranges from 1.4 to 2.3 even in the presence of an aggressive 4-wide
issue OoO core with a 48-entry load/store buffer and a 128-entry instruction window [30].
The MLP numbers are even lower for “traditional server workloads” like TPC-C and
SPECweb09 [30]. They also report application IPC numbers “in the range of 0.6(Media
Streaming) to 1.1 (Web Frontend)” for the scale-out workloads, whereas workloads like
TPC-C exhibit even lower IPC.
• Memory-Only Traces: The traces contain only memory accesses, an approximation of
a memory-bound core which is expected for the commercial and Cloud-suite workloads
(Section 3.2) used in this work. Because the baseline system has software-managed TLBs
(as Section 5.5.1 discussed), the memory traces also include memory requests that cor-
respond to TSB probes and page walks from the original full-system simulation. These
requests are not distinguished from all the other memory requests, since neither the TSB
probes nor the page-walks would coincide with TLB misses in the simulated TLB base-
lines, and the page-table format in x86 is different. Further, because the TLB in the
original system was relatively large (512-entry SA TLB for 8KB pages and 16-way FA
Chapter 5. The Forget-Me-Not TLB 108
TLB for superpages), when compared to split L1-TLBs, the MPMI was much lower.
Therefore, these requests are a very small portion of the memory traces, with page-walk
requests being even fewer as TSB hits are the common case. Our infrastructure treats
these requests as accesses to yet another software hash-based data-structure.
• Synchronization: No synchronization is modeled across the memory requests of multiple
cores, other than the “synchronization” happening implicitly due to coherence. It is
possible that a request X from core C2 completes before a request Y from core C1 that
was stored earlier in the trace. The timing order of requests in the trace represents
one possible memory ordering in functional mode. Different permutations/orderings are
possible. In all cases, we do not anticipate this lack of synchronization/ordering to affect
the observed trends. First and foremost because any such variation would equally affect a
system with an FMN as well as all the baselines. Second, the lack of synchronization could
underestimate the page-walk impact on performance, and thus the potential FMN benefit.
For example, if a memory access Y in one core should succeed time-wise a memory access
X from another core due to a synchronization barrier, and X happens to miss in the TLB,
the performance benefit of reducing that page-walk overhead would be more significant
in reality than in the simulated system where access Y can proceed before access X. In
this context, the reported FMN benefits could be underestimated.
• Page-Walk Modeling: The physical addresses the page tables map to were determined
according to the process described in Section 5.5.3. These addresses and their spatial
vicinity in the memory address space would be OS and system dependent. Even for the
same OS and architecture, the system’s load would determine the free list from which page
frames will be allocated upon request. In our methodology, the four page-tables (one per
level) that map a virtual address for an 8KB page to its corresponding physical frame
would all be allocated contiguously in physical memory if they had no prior accesses.
This approach might favour the page-walk latency. It is thus possible that FMN’s benefit
might be greater had another scheme been followed. Regardless, the employed scheme
meets the following two requirements: (i) it reflects a multi-level page table walk, and (ii)
it is consistently used on all configurations including the baseline.
5.6 Reasoning about FMN’s Performance Potential
This section presents an analytical model to estimate the potential performance improvement
(i.e., execution time reduction) for the proposed FMN technique. Given such a model, measure-
ments from actual applications can be then plugged in to estimate what performance benefits,
if any, should be expected. FMN aims at reducing the TLB miss handling time. It will thus
achieve the following performance improvement:
Chapter 5. The Forget-Me-Not TLB 109
Performance Improvement =Tbaseline − TFMN
Tbaseline(5.3)
where Tbaseline is the execution time in cycles of a given workload on the baseline system, while
TFMN is the execution time of that same workload on a system with the proposed FMN design.
The execution time for the baseline system can be approximated as:
Tbaseline = TTLB Misses + TMemory + TOther (5.4)
which is the sum of the time spent servicing TLB misses (TTLB Misses), the time spent servicing
memory requests once their translation is known (TMemory), and the time spent on computation
(TOther).
FMN’s goal is to reduce the amount of time spent servicing TLB misses by trading the
latency of lengthy page-walks with hits in the proposed cached TLB. However, since FMN
introduces more memory requests compared to the baseline along with a new cached structure,
it could slightly increase the time spent servicing memory requests. FMN does not affect the
computation time (TOther) which will be the same both for the baseline and the FMN system.
The FMN execution time can thus be expressed as:
TFMN = (1 + ∆mem) ∗ TMemory + (1−∆TLB miss) ∗ TTLB Misses + TOther (5.5)
It is extremely unlikely for FMN to increase the TLB miss penalty (negative ∆TLB miss) as this
would imply a page walk latency increase not counterbalanced by any reduction due to FMN
hits. No such scenarios were encountered in simulation. Both delta values were measured to be
in a positive [0, 1] range.
For convenience, two ratios r and c are defined as:
r =TMemory
TTLB Misses(5.6) c =
TOther
TTLB Misses(5.7)
The execution times for the two systems can now be rewritten as:
Tbaseline = (1 + r + c) ∗ TTLB Misses (5.8)
and
TFMN = r ∗ (1 + ∆mem) ∗ TTLB Misses + (1−∆TLB miss) ∗ TTLB Misses + c ∗ TTLB Misses (5.9)
Therefore, the performance improvement (Equation 5.3) can be rewritten as:
Chapter 5. The Forget-Me-Not TLB 110
Performance Improvement =Tbaseline − TFMN
Tbaseline=
∆TLB miss − r ∗∆mem
1 + r + c(5.10)
Upper Bound Projection: Figure 5.8 plots a possible upper bound on the projected perfor-
mance % improvement achieved with FMN. The computations assume (i) no increase in memory
latency (∆mem = 0), and (ii) a 75% decrease in TLB miss handling latency (∆TLB miss = 0.75).
The latter assumes four memory accesses per page walk, all with the same latency as an FMN
probe, that are substituted by a single FMN memory request. Thus, the following figure plots
the equation:
Performance Improvement(%) =∆TLB miss − r ∗∆mem
1 + r + c=
75
1 + r ∗ (1 + TOtherTMemory
)(5.11)
0
5
10
15
20
25
30
35
40
1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Execu
tion T
ime R
ed
uct
ion (
%)
r (TMemory / TTLB_Misses)
TOther = 0 * TMemoryTOther = 0.25 * TMemory
TOther = 0.5 * TMemoryTOther = 1 * TMemoryTOther = 2 * TMemoryTOther = 4 * TMemoryTOther = 8 * TMemory
Figure 5.8: Projected ideal % performance improvement based on Equation (5.11) with∆TLB miss = 0.75 and ∆mem = 0.
The x-axis lists different values for r, while each series corresponds to a different value for theTOther
TMemoryfraction. Memory-bound workloads, which this work targets, will have a ratio less
than one. As anticipated, the lower the value for this ratio, i.e., the more memory-bound a
workload is (top few series in this figure), the higher the potential performance improvement
for a given r value. Large values of r indicate that the proposed FMN scheme will have very
little, if any, performance benefit. Workloads with r in the [4, 16] range are projected to achieve
Chapter 5. The Forget-Me-Not TLB 111
performance improvement in the [2%, 38%] range for ratios of one or lower. Even though the
projected performance benefits are negligible for higher values of r, the proposed FMN scheme
is still projected to not harm performance while not requiring any dedicated on-chip resources.
We believe this can be a compelling design choice for systems where chip real estate is at a
premium.
Figure 5.8’s projections assumed FMN’s data did not increase the memory latency for
demand requests. In this context, having a non-zero ∆Mem will vertically move this figure’s
series towards the x-axis, slightly reducing FMN’s projected performance improvement. Next,
Figure 5.9 plots the projected performance improvement for various values of ∆TLB miss, ∆mem
and r based on the Equation (5.10), assuming c = 0, i.e., TOther is significantly less than
TMemory. The ranges of values for the three other parameters of that equation reflect simulation
measurements. For the workload traces used in this work, ∆TLB miss was measured in the range
of 0.12 to 0.33, while ∆mem from 0.005 to 0.015. For ∆TLB miss the figure plots the entire
spectrum of possible valid values, starting from zero which stands for no reduction in TLB miss
cycles. It is highly unlikely for ∆TLB miss to be greater than 0.75, given that the four memory
requests of the page walk will be substituted with a single memory request in case of an FMN
hit. The figure also plots four ∆mem configurations. A zero ∆mem value means FMN has no
negative influence in the execution time of memory requests.
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
2
4
6
8
10
12
14∆mem = 0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8202468
101214
∆mem = 0.005
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8202468
101214
Perfo
rman
ce Im
prov
emen
t (%
)
∆mem = 0.01
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8∆TLB_misses (reduction in time spent servicing TLB misses)
2
0
2
4
6
8
10
12∆mem = 0.02
r=5r=10r=15r=20r=25r=30
Figure 5.9: Projected % performance improvement based on Equation (5.10) with c = 0.
The ratio r solely depends on the workload’s access pattern and the baseline configuration
(e.g., TLBs, cache hierarchy, etc.). The lower the value of r, the more page-walks’ latency
Chapter 5. The Forget-Me-Not TLB 112
dominates execution time. For the traces used in this work, r was measured in the [6, 27]
range. It was six for canneal (i.e., ∼14% of baseline execution time was spent servicing TLB
misses), while r was 27 for TPC-C1 (i.e., ∼4% of the baseline execution time was spent servicing
TLB misses). Figure 5.9 plots r values from 5 to 30 in increments of five. The smallest r value
of five represents a workload that spends 20% of its execution time servicing TLB misses.
As Figure 5.9 shows, for workloads that have high values of r, ∆TLB miss needs to be quite
high, e.g., more than 0.5, for the workload to experience even a small, one or two percent,
performance improvement. Otherwise, FMN can result in a slowdown. This behavior can
be accentuated the higher the ∆mem is. Disabling the FMN for workloads that have been
profiled to have a high r value could be one possible solution. For TLB intensive workloads,
characterized by low r values, a ∆TLB miss above 0.3 is beneficial. The lower the r is, the higher
the performance benefit of FMN will be.
5.7 Synthetic Memory Access Patterns
As Section 5.6 explained, FMN’s potential relies on the memory access patterns of the simulated
workloads. We created a suite of synthetic memory traces to explore how the proposed FMN
design behaves under different loads. We believe such an exploration is valuable. Even though
we simulated commercial and cloud workloads, one could anticipate that future workloads will
stress TLBs and memory hierarchies even more or in different ways.
The following design knobs, detailed below, are of interest: memory footprint size, memory
access patterns, presence of data sharing, and number of processes.
(1) Memory Footprint: Memory footprint refers to the memory space, i.e., number of bytes,
a workload accesses. In memory hierarchy research, this metric often relates to the required
cache capacity. For the synthetic traces in this work, a workload’s memory footprint comprises
(a) the number of unique 8KB pages this workload accesses during its execution, and (b) the
number of 64B cache blocks accessed within each page. Both these configuration parameters
affect the ratio r presented in Equation (5.6).
The first parameter, number of unique pages, directly relates to the number of translation
entries required for the address translation structures (e.g., TLB and FMN) to avoid a page
walk for non cold accesses. Scenarios where all unique pages fit in the TLB, thrash the TLB
but fit in the FMN, as well as scenarios where both the TLBs and FMN are thrashed, are all
modeled, as they directly affect the number of TLB misses and the time spent servicing them.
The second characteristic of the footprint, the number of cache blocks accessed, is also
important. Accessing only one cache block per page puts the TLB miss handling latency in
the forefront, whereas accessing many cache blocks from each page, blocks that might miss in
the caches, can reduce the performance impact of a potential TLB miss. Accessing more cache
blocks from each page can also result in more contention for the existing L1-D cache capacity,
Chapter 5. The Forget-Me-Not TLB 113
as FMN and page tables might need to contend more with the workload’s memory footprint.
As Section 5.4 mentioned, both FMN probes and page-walk requests percolate through the
memory hierarchy starting from the L1 cache.
(2) Access Patterns: This work explores various memory access pattern combinations, both
at the granularity of pages as well as at the granularity of cache blocks. These two categories
of access patterns are orthogonal, and together they reflect how different algorithms access
memory. The memory footprint configuration option, mentioned earlier, controls the size of a
pool that contains only unique pages. The memory access pattern dictates how this pool is
populated.
The page-level access pattern controls the page number relationship among consec-
utively accessed pages. For example, contiguous page numbers reflect a streaming (i.e., se-
quential) pattern, while a fixed stride between consecutively accessed page numbers reflects a
striding pattern. A random permutation of page numbers is also modeled. All pages in the
pool are accessed in a round-robin fashion.
The different memory patterns influence the intensity of accesses each TLB or FMN set sees.
For example, a stride pattern of two which accesses only even pages (e.g., page numbers: 2, 4,
6, etc.) would cause contention for a few TLB sets (all sets with an even index), leaving half
of the TLB (or FMN) sets underutilized. On the other hand, a streaming pattern (e.g., page
numbers: 1, 2, 3, etc.) would uniformly stress all sets.
The block access pattern, i.e., how cache blocks are accessed within a page, does not
influence TLB misses directly. However, it can affect the significance of TLB misses for the
baseline, as discussed earlier. We also explore how our system would behave if all cores followed
the same access pattern or different permutations of it.
(3) Data Sharing: This option controls the amount of data sharing present across cores.
Per core pools of unique pages cover multiprogrammed workload scenarios where no sharing is
present. Prepending a unique identifier (e.g., core ID) to each unique page number achieves this
purpose without interfering with the TLB (or FMN) indexing and skewing the measurements
on different cores. For multi-threaded workloads, all per core pools contain the same unique
pages.
Different degrees of data sharing can influence whether a shared FMN would be a beneficial
design choice. Shared footprints could also reduce the page walk latency for the baseline,
compared to private per-core footprints, as the various page table entries will most likely already
be cached.
(4) Processes: TLB entries contain an ASID that identifies the process a given translation
belongs too. In SPARC terminology, different contexts denote different processes, while the
global context specifies data shared across all processes. Systems that have a single context
running on all cores could take advantage of a shared FMN, whereas systems where every core
Chapter 5. The Forget-Me-Not TLB 114
has its own private context would not. This is similar to the effect of different degrees of data
sharing discussed earlier.
Scenarios with multiple processes running per core are also modeled. The lack of ASID-
aware TLB indexing schemes can result in increased contention and more TLB misses for the
baseline system. This also opens an interesting avenue for FMN indexing scheme exploration.
Section 5.9 will next present a case study with synthetic traces that follow the sequential page
access pattern, using the baseline CMP configuration described in Section 5.8. An evaluation
with commercial workloads will be presented later in this chapter.
5.8 Baseline CMP Configuration
All simulated configurations involve a 16-core CMP with a 4x4 mesh interconnect. There is a
3-level cache hierarchy with private L1 and L2 caches, and a distributed shared L3 cache. All
caches have a 64B cache block size and an LRU replacement policy. There are four memory
controllers. Table 5.2 presents the main parameters of the baseline configuration (i.e., caches
and TLBs) that are of interest. For the L1-TLBs, a split Haswell-like configuration is modeled.
An L2-TLB is not included in the baseline unless explicitly noted (e.g., B + L2).
Caches
L1-D Caches private, 4-way SA 32KB, 2-cycle latency
L2 Caches private, 8-way SA 256KB, 3-cycle tag / 9-cycle data latency
L3 Cache shared, 16-way SA 16MB (1MB per tile)
4-cycle tag / 10-cycle data latency
TLBs
L1-TLBs 4-way SA 64-entry (8KB pages)
three 4-way SA 32-entry TLBs for 64KB, 512KB and 4MB pages respectively
L2-TLB 8-way SA 1024-entry, 8-cycle latency
Table 5.2: System Configuration Parameters
Chapter 5. The Forget-Me-Not TLB 115
5.9 Sequential Page Access Patterns - A Case Study with Syn-
thetic Traces
This section presents an analysis of the synthetic trace results for a sequential (i.e., streaming)
page access pattern and demonstrates how different design knob values affect performance via
their interaction with the baseline TLB hierarchy. With this specific pattern, a core accesses
a pool of PS (Pool Size) contiguous 8KB pages in a round-robin manner. The pool can be
either private (per core) or shared (replicated across cores). 32 million memory accesses (reads)
are modeled in total (two million per core). The first half of the execution time warms-up the
memory hierarchy; results are presented for the second half (i.e., the last 16 million requests).
Because of the pool sizes modeled and the round-robin manner in which the pools are accessed,
the simulated number of memory accesses completely captures the behaviour of this synthetic
trace.
In every pool pass, a fixed number of cache-block reads is performed per each 8KB page; we
refer to these accesses as BCPP (Block Count Per Page). These accesses are cache block (64B)
aligned. A 64-byte offset is applied to these accesses to avoid L1-D cache contention. Without
this offset the first cache-block of all 8KB pages would map to the first of the 128 cache sets of
the modeled 4-way SA 32KB L1-D cache. Applying this offset is similar to compiler padding
optimizations that reduce cache conflict misses. For example, assume a unique pool of 80 8KB
pages per core with a streaming page access pattern which accesses two cache blocks per page.
That is, PS is 80 and BCPP is 2. Core 0 accesses the first two cache blocks of the first 8KB
page (e.g., virtual addresses 0x0, 0x40), and then accesses two blocks from the subsequent 8KB
page (i.e., virtual addresses 0x2080, 0x20c0). The blocks accessed in that second 8KB page are
the third and fourth blocks of that page and not the first two. Having this padding prevents
these accesses from mapping to the same L1-D cache set.
The padding can be formally computed as:
((pool index ∗BCPP ) + (block index ∗ 64)) Modulo 8192
where 64 is the cache line size and 8192 is the page size. The pool index is in the range [0, PS-1]
and is incremented every BCPP number of accesses, while the block index, with values in the
range of [0, BCPP - 1], is incremented on every access and it is reset to zero once equal to
BCPP. BCPP’s value should never exceed (page size / cache line size).
The remaining section is organized as follows. First, Sections 5.9.1 to 5.9.4 demonstrate
the impact the various design knobs have on the baseline configuration. Then, the subsequent
sections measure FMN’s effectiveness via metrics such as performance and TLB miss handling
latency.
5.9.1 Impact of Workload’s Footprint on Baseline Configuration
Because of the sequential page access pattern, three distinct groups of pool sizes (PS) exist
with respect to TLB hit-rate for the baseline L1-TLB configuration (Table 5.2). The groups
Chapter 5. The Forget-Me-Not TLB 116
shown below apply to any set-associative TLB with an LRU replacement policy; wherever
relevant, specific numbers are provided for the baseline 64-entry 4-way SA L1-TLB for 8KB
pages.
• PS <= # TLB entries (i.e., 64): For a pool with at most 64 contiguous pages, only
cold TLB misses occur in the baseline SA TLB that uses an LRU replacement policy.
Thus, as Figure 5.10 shows, the TLB hit-rate during the last 16M requests is consistently
at 100%, irrespective of the per-page accesses (i.e., BCPP, the number of 64B cache blocks
accessed within each page).
• PS >= (1 + 1TLBassociativity )∗ # TLB entries (i.e., 80): Any pool size above, or equal
to, the 80-entry boundary will thrash the L1-TLB, with 0% TLB hit-rate when only one
cache block is accessed per page (BCPP = 1). As Figure 5.10 shows in its second series,
the 0% measured TLB hit-rate, for a block count of one, becomes 50% if two cache blocks
are accessed per page, and 99.2% if all 128 cache blocks in each page are accessed. All hits
for this series are due to multiple accesses (BCPP > 1) to the same page. The hit-rates
listed in the figure (data labels) apply to any PS >= 80. In all these cases, having a single
TLB-entry that keeps the most recently used translation would have the same behaviour
as the baseline TLB.
• 64 < PS < 80: For any pool size between the two aforementioned boundaries, the
hit-rate curve would lie in the area between the two plotted series. As PS grows towards
80, and fewer translations persist from the first 16 pages in the pool (L1 TLB’s LRU way),
the corresponding curve moves towards the PS>=80 series.
0
50
7587.5
93.75 96.88 98.44 99.22
0
20
40
60
80
100
120
1 2 4 8 16 32 64 128
L1-TLBHitRa
te(%
)
BCPP(#64Bblocksaccessedperpage)
PS <= 64 PS >= 80
Figure 5.10: Effect of pool size on TLB hit rate.
For workloads with pool sizes that exceed the current TLB capacity and thrash the TLB,
the more spatial locality exists within a given page (i.e., the more cache blocks are accessed from
that page), the less significant the total TLB miss handling time becomes for the workload’s
Chapter 5. The Forget-Me-Not TLB 117
performance. Figure 5.11 depicts the percentage of execution time spent servicing TLB misses
for different pool sizes (graph series) and different degrees of per page spatial locality (x-axis).
If 16 cache blocks are accessed per page, TLB miss handling will account for less than 1.4% of
execution time. This percentage falls below 1% if more than 25% of a page’s cache blocks are
accessed.
020406080100
1 2 4 8 16 32 64 128
%Execu(o
nTime
servicingTLBMisses
BCPP
PS-64 PS-80 PS-128 PS-256 PS-512 PS-1024 PS-2048
Figure 5.11: TLB Miss Latency as percentage of execution time with varying PS and BCPPvalues. Figure 5.12 presents how the execution time changes.
As anticipated, the execution times for these synthetic traces vary as PS and BCPP change
(Figure 5.12). But one should not draw the simplistic conclusion that “TLB miss latency is a
larger fraction of execution time when the latter is shorter” as this would be misleading. It is
the memory footprint as determined by the pool sizes and the number of cache block accesses
within a page that affects both the execution time and the TLB miss latency. When more cache
blocks are accessed, either due to larger pool sizes (PS) or a combination of large PS and BCPP
values, the memory latency dominates: the requested cache blocks no longer fit in the L1-cache
but eventually spill in the L2 and L3 caches resulting in the drastic increases of the execution
time shown in Figure 5.12. The average memory request latency, depicted in Figure 5.13, also
reflects the same trends. For the PS-64 series, the average memory latency is 3-cycles for the
BCPP-1 to BCPP-8 configurations since all the memory data fit in the L1-D cache.
050100150200250
1 2 4 8 16 32 64 128
Execu&
onTim
e(M
illionCycles)
BCPP
PS-64 PS-80 PS-128 PS-256 PS-512 PS-1024 PS-2048
Figure 5.12: Execution time with varying PS and BCPP values.
Chapter 5. The Forget-Me-Not TLB 118
1
10
100
1000
1 2 4 8 16 32 64 128
AverageMem
ory
Req
uestLaten
cy(C
ycles)
BCPP
PS-64 PS-80 PS-128 PS-256 PS-512 PS-1024 PS-2048
Figure 5.13: Average Memory Request Latency in cycles. Note the logarithmic y-axis scale.
For similar reasons, the average TLB miss handling latency (page-walk latency) also in-
creases alongside PS and BCPP (Figure 5.14). While all pools with more than 80 pages have
the same number of TLB misses, for a given BCPP, these misses start getting - on average -
more costly with larger pool sizes. The increased data footprint no longer allows the page-table
entries to survive in the upper-level caches (e.g., the L1 cache). These results illustrate how
interconnected the TLB miss latency is with the workload’s footprint and access pattern. Even
though anticipated, they remind us that we should not look at the TLB miss latency in a
vacuum.
0102030405060
1 2 4 8 16 32 64 128
AverageL1TLB
MissL
aten
cy(C
ycles)
BCPP
PS-64 PS-80 PS-128 PS-256 PS-512 PS-1024 PS-2048
Figure 5.14: Average L1-TLB Miss Latency in cycles. No TLB misses exist for the PS-64 seriesin these last 16M references, as explained earlier.
5.9.2 Effect of Per-Page Access Pattern on Baseline
The block access pattern, i.e., how cache blocks are accessed within each page (e.g., sequential,
stride, random, etc.), can also affect execution time, even though it has no bearing on the
number of TLB misses. As page tables contend with the workload’s data for cache capacity,
the latency of a page walk as well as the latency of each memory request can vary.
Chapter 5. The Forget-Me-Not TLB 119
5.9.3 Effect of Data Sharing on Baseline
Figures 5.15 and 5.16 demonstrate the impact of data sharing on the average TLB miss latency
and memory request latency respectively. Results are presented as the percentage cycle reduc-
tion achieved with a shared data pattern compared to a private pattern. For the shared pool
pattern the same pool contents are replicated for each core. For the 2K pool size, a shared
pattern has a significant positive impact both on the TLB miss penalty and on the average
latency of each memory request, up to 38% and 62% decrease respectively. This impact is more
pronounced for large BCPP values where the data footprint and the execution time are much
higher. The two main reasons are: (i) the overall CMP data footprint is now 1/16th of the
private pattern footprint and can thus fit in upper level caches, and (ii) all cores will access
the same page-table entries which will thus occupy fewer cache resources, and which can also
survive in upper caches due to the smaller memory footprint discussed earlier. For small pool
sizes (e.g., up to PS-128), a shared pattern can, in some instances, increase execution time
(negative percentages); coherence overheads and remote private cache accesses are the most
likely causes.
-40
-20
0
20
40
60
1 2 4 8 16 32 64 128
%L1-TLBMissL
aten
cy
Redu
c4on
over
PrivatePa
9ern
BCPP
PS-64 PS-80 PS-128 PS-256 PS-512 PS-1024 PS-2048
Figure 5.15: Shared versus Private: Effect of data sharing on L1-TLB miss latency.
-40
-20
0
20
40
60
80
1 2 4 8 16 32 64 128
%M
emoryRe
questLaten
cy
Redu
c3on
over
PrivatePa
7ern
BCPP
PS-64 PS-80 PS-128 PS-256 PS-512 PS-1024 PS-2048
Figure 5.16: Shared versus Private: Effect of data sharing on average memory latency.
Chapter 5. The Forget-Me-Not TLB 120
5.9.4 Effect of Process Mix on Baseline
The process mix running on the CMP can also impact the TLB miss handling latency. The
following three configurations were modeled: (i) global : a single process is running across all 16
CMP cores (this is the configuration for the previous sections), (ii) private: a separate process
is running on each core, and lastly (iii) ctxt 2 : two processes are running on each core with no
process overlap across cores.
For this experiment, when multiple processes are running per core (private, ctxt 2 configu-
rations), the number of unique virtual page numbers in the pool is the pool size divided by the
number of contexts. This ensures that for the same pool sizes the same number of TLB entries
is required irrespective of the process mix. A 64-page pool with the ctxt 2 configuration means
32 unique contiguous pages are first accessed under context a and then the same 32 unique
contiguous pages are accessed under context b.
TLB miss count remains the same for private and ctxt 2 configurations compared to global
except for ctxt 2 and PS-80 which incurs 40% less misses than global or private for the same
BCPP. Because all processes share the same virtual page numbers, some translations will persist
in the TLB across pool passes. For a 4-way SA 64-entry TLB this behaviour is observed when
the pool size is in the [80, 94] range, i.e., when the pages for each context require between 2.5
to less than 3 TLB ways.
Figure 5.17 reports the number of cycles spent servicing a TLB miss on average for different
process mixes (graph series) and different PS and BCPP values (x-axis labels) for a private
sharing pattern. Figure 5.18 does the same but for a shared sharing pattern.
0
10
20
30
40
50
60
80
128
256
512
1024
2048
80
128
256
512
1024
2048
80
128
256
512
1024
2048
80
128
256
512
1024
2048
80
128
256
512
1024
2048
80
128
256
512
1024
2048
80
128
256
512
1024
2048
80
128
256
512
1024
2048
1 2 4 8 16 32 64 128
L1-TLBM
issL
aten
cy(cycles)
BCPP(lower)andPS(upper)
DataSharing:Private
global private ctxt_2
Figure 5.17: Private sharing pattern: Effect of process mix on baseline’s TLB miss latency.
Chapter 5. The Forget-Me-Not TLB 121
For the ctxt 2 process mix, the pages in the pool no longer share the same first-level page-
table causing an increased page-walk latency. This increase becomes more pronounced for
larger footprints when the extra cached page-table entries push the workload’s footprint to a
lower cache level (e.g., L3). For BCPP >= 32, the footprint no longer fits in the L3 cache
due to conflicts even for PS-80 and PS-128, causing a drastic increase in the average memory
latency from 70 to 230 cycles. The private process mix also sees a TLB-miss latency increase
compared to global, but consistently less than ctxt 2. Even though there is no change from the
perspective of a core, having a separate per-core context rather than a global one means that
there can be no sharing of the PDEs for the upper page-walk levels across cores.
0
10
20
30
40
50
60
80
128
256
512
1024
2048
80
128
256
512
1024
2048
80
128
256
512
1024
2048
80
128
256
512
1024
2048
80
128
256
512
1024
2048
80
128
256
512
1024
2048
80
128
256
512
1024
2048
80
128
256
512
1024
2048
1 2 4 8 16 32 64 128
L1-TLBM
issL
aten
cy(cycles)
BCPP(lower)andPS(upper)
DataSharing:Shared
global private ctxt_2
Figure 5.18: Shared sharing pattern: Effect of process mix on baseline’s TLB miss latency.
Similar conclusions hold for the shared access pattern as well (Figure 5.18) but with a greater
percentage TLB miss latency increase compared to the private sharing pattern. This behaviour
is expected as the shared access pattern with the global process mix configuration had all cores
sharing not only all their page-translations but also all their memory accesses.
Chapter 5. The Forget-Me-Not TLB 122
5.9.5 Private FMNs
Having explored the impact the various design knobs have on the baseline, this section will
now evaluate the proposed FMN design. This section’s results are for the private data sharing
pattern and the global process mix. Figure 5.19 depicts the percentage decrease in execution
time two FMN configurations - explained shortly - achieve over the baseline (denoted as B).
The higher the number the better; negative numbers signal performance degradation.
The ideal-FMN series is an unrealistic 1K-entry FMN configuration that asssumes all FMN
probes always hit in the L1-D cache (2-cycle latency) without causing any interference with
the cache data. While neither of the two aforementioned conditions can hold in a real system,
the ideal FMN offers an absolute upper bound on FMN’s performance benefits, reflecting the
benefits of a standalone, not-cached FMN. This configuration is different than a similarly sized
L2-TLB because, contrary to FMN, L2-TLB probing precedes a page walk. The ideal-FMN
results show that good performance benefits can be achieved when PS <= FMNentries and
PS ∗ BCPP <= 1024, with higher speedup for smaller BCPP values. An explanation for
the significance of 1024 will follow later in this section. Although rare, it is possible for ideal-
FMN to cause negligible performance degradation compared to the baseline; the worst case
measured here is -1.16% for PS-256 and BCPP-32. The cause is the different time-ordering
and interaction of memory requests and page-walk requests in the caches. Since the ideal-FMN
has perfect hit-rate and a very fast constant access time, speculative memory requests can be
issued well before their corresponding page walk completes.
-80
-60
-40
-20
0
20
40
60
80
1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128
PS-80 PS-128 PS-256 PS-512 PS-1024 PS-2048%Execu(o
nTimeRe
duc(on
PS(lower)andBCPP(upper)
B+FMN(private,1K-entries) B+idealFMN
Figure 5.19: Performance Impact: FMN versus Baseline.
The FMN series in Figure 5.19 achieves significant performance improvement for PS-80 and
PS-128 when one or two cache blocks are accessed per page. The measured execution time
reduction is around 58% for BCPP-1 and 49% for BCPP-2 for the two aforementioned pool
sizes that fit in the simulated private 1K-entry FMN. For BCPP values greater than eight,
FMN has negligible performance impact similar to the ideal-FMN, since TLB miss penalties do
Chapter 5. The Forget-Me-Not TLB 123
not account for a significant portion of the execution time. However, Figure 5.19 shows a few
outliers where FMN causes significant performance degradation. For example, the PS-512 and
BCPP-2 run sees a 22.26% execution time increase under FMN, while the PS-1024 and BCPP-1
run sees a 77.77% increase. This behaviour at first appears to be counter-intuitive because in
both cases the required number of translation entries can fit in each per core FMN. Examining
FMN’s effect on L1-TLB miss latency and memory access time, the two components described
earlier in Section 5.6, will help us understand this behaviour.
Figure 5.20 illustrates the average TLB miss latency in cycles. Only memory accesses that
miss in the baseline TLB experience this latency. For the baseline configuration, this is the
average latency of the page walk, while when FMN is enabled it is min(FMN latency, page
walk latency). As anticipated, FMN significantly reduces TLB miss latency for BCPP <= 8
when PS <= FMNentries. For example, for PS-1024, FMN reduces TLB miss latency by
60.13% for BCPP-1 and by 56.88% for BCPP-8. FMN has no impact for PS-2048 that thrashes
the 1K-entry Direct-Mapped (DM) FMN. Therefore, FMN achieves its original goal of reducing
the TLB-miss latency without using any additional hardware resources.
0
10
20
30
40
50
60
1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128
PS-80 PS-128 PS-256 PS-512 PS-1024 PS-2048
AverageL1-TLB
MissL
aten
cy(cycles)
PS(lower)andBCPP(upper)
Baseline(noFMN) FMN
Figure 5.20: Average TLB Miss Latency in cycles.
Figure 5.21 depicts the average latency for each memory access, after its translation has been
retrieved either via the page walk or an FMN probe. As the pool size and accessed block count
increases, the memory latency also naturally increases because the required cache blocks start
spilling to lower level caches. The cases where the memory latency for an FMN-enabled system
is visibly greater than that of the baseline are the cases which suffer a performance degradation
in Figure 5.19. For example, for PS-512 and BCPP-2 the memory latency increases by 47.5%
from 24.98 to 36.85 cycles, while for PS-1024 and BCPP-1 the latency increases by 146.49%
from 30.92 to 76.21 cycles.
Chapter 5. The Forget-Me-Not TLB 124
0
50
100
150
200
250
1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128
PS-80 PS-128 PS-256 PS-512 PS-1024 PS-2048
AverageMem
oryLatency
(cycles)
PS(lower)andBCPP(upper)
Baseline(noFMN) FMN
Figure 5.21: Average Memory Latency in cycles; this is measured after translation is retrieved.
But what causes such a drastic memory latency increase? The cause of this dramatic mem-
ory latency increase is, to a big extent, a side-effect of this sequential synthetic memory access
pattern. For all PS and BCPP values, consecutive in time accesses, from a core’s perspective,
map to consecutive L1-D sets. However, these same accesses - assuming they exceed the L1
capacity of 512 cache blocks - will only occupy 25% of the L2 capacity, i.e., 1024 of the 4096 L2
cache blocks. This L2 mapping becomes relevant only when the capacity of the 512 L1 cache
blocks is exceeded, and lower-level caches are accessed. As a result, when data just barely fits
in the L2 cache in the baseline, adding an FMN will spill some of this data to the L3 cache,
causing a non-negligible performance degradation.
Because the FMN is a software structure that linearly maps to the physical address space,
it will map to contiguous cache sets. It will thus compete with, and displace, some regular
(demand) memory accesses even though a significant portion of the L2 cache is unoccupied. The
configurations “PS-1024 with BCPP-1” and “PS-512 with BCPP-2” are two such examples. In
these cases, FMN drastically increases the average memory latency because it causes a portion
of memory requests to access the L3 cache.
The aforementioned behavior might be one isolated pathological scenario but it is an inter-
esting exercise to identify the potential shortcomings of the proposed FMN. One can anticipate
such behaviour to manifest in boundary cases where the workload’s footprint nicely fits in a
subset of the cache hierarchy, and FMN causes data spills to lower cache levels. Note that
“PS-512 with BCPP-4”, not such a boundary case as the data does not fit in the L2 cache in
the first place, does not experience any performance degradation under FMN.
The following mechanisms that would warrant future exploration could be potential solu-
tions for these problematic configurations. Compressing the size of each FMN entry would
allow more translations to fit in a cache line, thus reducing the contention between FMN en-
tries and memory requests for the same FMN size. This approach could take advantage of
FMN’s speculative nature to achieve this area reduction. A more ambitious approach would
Chapter 5. The Forget-Me-Not TLB 125
be to depart from the sequential structure paradigm and have FMN steal cache blocks that are
invalid. Identifying the FMN entries which cause significant performance degradation by evict-
ing useful data could be another design option that would trade FMN hit rate to preserve the
baseline’s memory latency. Sections 5.9.7 and 5.9.8 explore two simple optimization examples.
The first uses probe filtering, while the second targets FMN allocation/replacement. But first,
Section 5.9.6 contrasts a design with FMN versus one with a L2-TLB.
5.9.6 Private FMNs versus Private L2-TLBs
Whereas Figure 5.19 compared the performance of FMN (B + FMN series) with a baseline
system with private L1-TLBs (B), Figure 5.22 also examines the performance of a system
where an L2-TLB has been added to the baseline (B + L2). The L2-TLB has the same
number of entries as the FMN (1K entries), but, unlike the direct-mapped FMN, the L2-
TLB is 8-way set-associative to match commercial state-of-the-art L2-TLB configurations (e.g.,
Haswell’s L2-TLB [38]). This section assumes an 8-cycle penalty for L1-TLB misses that hit in
the L2-TLB; Intel reports seven cycles for a TLB with half the entries and half the associativity
(SandyBridge’s 512-entry 4-way SA L2-TLB) [38]. The B+L2 series with half that penalty
is simulated strictly to demonstrate how sensitive performance can be to the L2-TLB access
latency, and it is not a realistic configuration.
-80
-60
-40
-20
0
20
40
60
80
1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128
PS-80 PS-128 PS-256 PS-512 PS-1024 PS-2048%Execu(o
nTimeRe
duc(on
PS(lower)andBCPP(upper)
B+L2-TLB(4cycles) B+L2-TLB(8cycles) B+FMN(private,1K-entries)
Figure 5.22: Performance Impact: FMN compared to L2-TLB.
The FMN series significantly outperforms the “B + L2 (8 cycles)” configurations for all cases
where PS ∗BCPP < 1024. Latency is the sole reason why this happens. In all aforementioned
cases, the needed translations can fit in both the FMN and the L2-TLB structures and thus a
page-walk is technically never needed, assuming warmed-up structures. In these cases, FMN
provides a translation in 3-cycles (L1 cache latency), while the L2-TLB requires 8 cycles. The
unrealistic 4-cycle L2-TLB closes the performance gap, most of the time still in favour of FMN.
When some of FMN’s pathological scenarios that were discussed in earlier sections occur,
Chapter 5. The Forget-Me-Not TLB 126
when the workload footprint increases, the L2-TLB continues to reap performance benefits while
FMN results in performance degradation. “PS-1024 with BCPP-1” is the most characteristic
example. Once the needed translations exceed 1024 and no longer fit in the L2-TLB, the two
design options are comparable. In some of these cases, e.g., for “PS-2048 with BCPP-1”, even
the L2-TLB, which does not interfere with the on-chip cache data like the FMN does, can cause
minor performance degradation, as all L2-TLB probes, which will definitively result in a miss,
precede the page-walk.
To summarize, Figure 5.22 illustrated a number of cases where the FMN, which needs no
dedicated hardware storage for translation entries, can perform comparably or better than a
dedicated hardware L2-TLB of the same size. Also, in cases where the L2-TLB yields no or
limited performance benefits, the FMN can offer comparable performance. Thus, if someone
is able to accommodate the potential performance loss for the boundary cases, determined by
workload’s access pattern in conjunction to the underlying cache hierarchy, the FMN can be a
compelling design choice. FMN can also be a compelling design alternative for systems where
chip real-estate is at a premium and thus they cannot accommodate large dedicated second-level
TLB structures. Sacrificing performance for area and power is a common design choice for such
architectures. Optimizing FMN to further limit the extent of such a potential performance loss
can make FMN designs even more appealing. Two simple optimizations are presented next.
5.9.7 Private FMNs: Filtering Optimization
Figure 5.23 illustrates the effect FMN-probe filtering has on the problematic scenarios discussed
earlier in this chapter. Since the average memory latency of a request, with and without FMN,
cannot be easily measured at runtime (offline profiling would be an option), FMN-filtering
targets FMN probes that are either too slow or useless. By avoiding FMN probes and allocation
requests that do not provide any TLB-miss latency reduction, we reduce any unnecessary
destructive interference FMN has with demand memory accesses in the cache hierarchy, thus
minimizing data spills to lower cache levels that result in a hefty memory latency increase.
The proposed FMN-filtering mechanism operates as follows. A 4-bit saturating counter
is used for every four FMN entries (here also FMN sets); these four FMN entries would fit
in the same 64B cache line. Tracking FMN’s usefulness at this granularity is natural as this
is the granularity FMN data is allocated in the caches. For a 1K-entry FMN, only 128B of
additional storage are needed, the equivalent of just two extra cache lines. This extra storage
is not virtualized but resides at the TLB-miss controller.
Initially, all saturating counters are set to 15. The counter values define two operating
regions [15, 8] and [7, 0] as follows:
• Values [15, 8]: FMN-probes are always issued on a TLB miss. If the FMN-probe returns
after the page-walk or returns before but without valid information, the corresponding
saturating counter is decremented by one. If the FMN-probe returns first and is useful,
Chapter 5. The Forget-Me-Not TLB 127
the counter is incremented by two. Timely FMN-hits should be more beneficial than
FMN-misses.
• Values [7, 0]: When the saturating counter reaches 7 (half of its initial value), no FMN
probes are issued. This approach continues while the counter value is in the [7, 0] range.
The counter is decremented by one on every TLB miss that would have triggered an FMN
probe in the four FMN sets represented by this counter. Once zero is reached. the counter
is reset to 12.
When properly resetting the saturating counters, the benefits are very little if any. For the
“PS-1024 with BCPP-1” configuration, filtering decreases FMN’s execution time by 3%, slightly
limiting the original performance degradation; 1.93% fewer FMN probes are issued in this case.
The results indicate that a more aggressive filtering mechanism is needed. Also, one shortcoming
of this filtering approach is how it decides which probes to filter. For example, in the PS-128 with
BCPP-4 example, no probes are filtered since the vast majority of probes are both useful and
timely. FMN degrades performance because FMN probes, and their corresponding allocation
requests, push useful demand data down the cache hierarchy. The filtering mechanism proposed
here fails to capture and act upon this behaviour.
-80-60-40-20020406080
1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128 1 2 4 8 16
32
64
128
80 128 256 512 1024 2048
%Execu(o
nTimeRe
duc(on
PS(lower)andBCPP(upper)
FMN FMN-filtered
Figure 5.23: FMN Filtering and FMN vs. Baseline
5.9.8 Private FMNs: Replacement Optimization
The PS-2048 with BCPP-1 configuration does not benefit from the 1K-entry FMN as the
sequential access pattern thrashes the cached FMN. Filtering FMN-probes will prevent many
probes from being issued, since none of them will be FMN-hits; however, the FMN-structure
will be in principle useless. This section explores a replacement optimization that strives to
maintain part of the required translations in the FMN. This approach is motivated by adaptive
insertion and replacement policies that were proposed for caches [41, 61]. The idea here is to
not send allocation requests to replace FMN entries based on the value of a saturating counter.
This optimization does not affect FMN-probes but only FMN-allocation requests and the data
they might replace.
Chapter 5. The Forget-Me-Not TLB 128
Similar to the filtering optimization, 4-bit saturating counters are used, one for every four
FMN-sets. The counters, located at the TLB miss controller, are initialized to 15 and they
operate as follows:
• Values [15, 8]: Allocation and replacement requests are always sent for counter values in
that range (inclusive). If the FMN-probe returns after the page-walk, or returns before
but without valid information, the corresponding saturating counter is decremented by
one.
• Values [7, 0]: No allocation requests are sent. The counters are decremented by one like
above for useless or slow FMN-probe replies. However, when the FMN-probe returns first
and with correct information, the saturating counter is incremented by 2, capped to 8.
The rationale here is that if hits take place while we have chosen not to issue allocation
requests, then this design choice is working and part of the working footprint survives in
the FMN. Therefore, it is preferable to keep the saturating counter for the FMN in that
range and perpetuate this decision. However, if the counter reaches 0, the counter is reset
back to its initial value of 15. Reaching zero means there were 8 FMN accesses to these
four FMN-sets without any hits. Thus, no benefits arise from not allocating or replacing
existing FMN data. It is possible that these clusters of misses are due to a working set
change or the access pattern is so irregular that no part of it survives in the FMN for the
current saturating counter value range.
Using this replacement optimization for the PS-2048 with BCPP-1 configuration, made the
original 0% FMN-hit rate go to 43%. It also yielded a small 1.39% performance improvement.
5.10 FMN’s Evaluation for Commercial Workloads
This section evaluates FMN using the commercial and cloud workloads described in Section 3.3
and canneal from PARSEC. All timing runs in this section were run until 1600 million memory
references across the entire CMP were committed (excluding page walks and FMN requests).
The first 400 million requests were used to warm-up the TLBs and cache structures; the mea-
surements presented here are for the remaining 1200 million. This is not a complete execution
of the traces used in the previous two chapters. The time overheads of timing simulation made
that impossible within a reasonable timeframe (a week per simulation).
The remainder of this section is organized as follows. Section 5.10.1 first quantifies the
impact of address translation on the baselines, including the performance impact of adding
an L2-TLB. Section 5.10.2 reports the L1-TLB miss latency reduction FMN achieves, and
Section 5.10.3 presents FMN’s impact on the average memory request latency. Finally, Sec-
tion 5.10.4 presents the performance measurements.
Chapter 5. The Forget-Me-Not TLB 129
5.10.1 Impact of Address Translation on Baseline’s Performance
Figure 5.24 depicts the percentage of execution time spent servicing L1-TLB misses in a system
without FMN support with and without the presence of an L2-TLB. The higher this value
is, the more potential for improvement from address translation optimizations exists. Two
baseline systems are listed with respect to their L1-TLB configurations: (i) B denotes the
baseline L1-TLBs from Section 5.8, while (ii) HB denotes a Half-Baseline where the L1-TLBs
are half in size (same associativity as B). The HB configuration is included as a potential proxy
for future systems. Specifically, the intention is to study what would be the effect of growing
data footprints. Since scaling the data footprint of existing applications is not practical due to
excessive simulation times, we instead scale down the size of the TLB to half. While this is only
an approximation, we believe that in lieu of an actual future application with a large footprint,
this is a relevant and thus valuable measurement that is well defined and feasible today.
0
5
10
15
20
apache TPC-C2 TPC-C1 canneal cassandraclassific. cloud9 nutch streaming
Workloads
%Execu0onTimeServicingL1-TLBMisses
B B+L2 HB HB+L2
Figure 5.24: Percentage of execution time spent in L1-TLB miss handling.
Figure 5.25 depicts the execution time reduction achieved by adding the aforementioned
L2-TLB over the respective B and HB baselines. These results complement the results in Fig-
ure 5.24 translating them into performance. Negative values signify performance degradation.
This L2-TLB only hosts translations for 8KB pages. Table 5.3 reports the L2-TLB hit-rates.
The performance measurements show that adding a per core private L2-TLB can in some cases
yield negligible performance benefits (cloud9, cassandra) or even cause performance degradation
(canneal, classification).
These observations were surprising given that many systems employ such a structure and
it is usually considerable in size (e.g., 1K entries or more). Unfortunately, to the best of our
knowledge, there are no results, in the literature, of the performance benefit of an L2-TLB,
other than the MPMI reduction it can achieve. There are two factors - pertinent to the L2-
Chapter 5. The Forget-Me-Not TLB 130
-2
0
2
4
6
apache TPC-C2 TPC-C1 canneal cassandra classific. cloud9 nutch streaming
Workloads
Execu/onTimeReduc/onbyAddingL2-TLBtoBandHBrespec/vely
B+L2 HB+L2
Figure 5.25: Percentage of execution time reduction due to L2-TLB.
Workload L2-TLB Hit-Rate (%)
B + L2 HB + L2
apache 67.9 79.2
TPC-C2 74.4 80.9
TPC-C1 87.2 80.6
canneal 31 54.1
cassandra 63.8 31.1
classification 4.2 1.8
cloud9 59.8 80.6
nutch 88.2 87.3
streaming 93.7 95.7
Table 5.3: L2-TLB Hit-Rate (%)
TLB design - that set the ground for its poor performance:
• The L2-TLB access latency following an L1-TLB miss is considerable. Intel reports seven
cycles for a 512-entry 4-way SA L2-TLB (SandyBridge) [38]. We model an eight-cycle
L2-TLB access latency because our TLB has double the associativity and size (i.e., it is
1024-entry 8-way SA). This latency is comparable to a page walk where all four accesses
hit in a 2-cycle L1 cache.
• The L2-TLB is probed in series with the page tables. Thus, in case of an L2-TLB miss,
an extra 8-cycles latency is added to the page walk latency.
Section 5.11 discusses this issue and proposes an L2-TLB bypassing mechanism.
Chapter 5. The Forget-Me-Not TLB 131
5.10.2 FMN’s Impact on L1-TLB Miss Latency
FMN’s goal was to reduce the time spend servicing L1-TLB misses. Figure 5.26 shows the
average L1-TLB miss latency reduction over HB. The first column corresponds to a dedicated
L2-TLB. The remaining columns report several FMN configurations labeled as (FMN indexing
scheme, FMN entries). FMN is configured as either a 1K-entry or 8K-entry direct-mapped per
core virtualized structure. The precise indexing scheme (precise) uses the least significant bits
of the virtual page number assuming oracle page size knowledge. The 8KB_VPN scheme assumes
accesses are to 8KB pages. If during allocation the page is found to be a superpage, then only
the 8KB page that triggered the miss is allocated in the FMN.
As anticipated, the two schemes perform the same for workloads that mostly rely on 8KB
pages. The L2-TLB hurts the average L1-TLB miss latency for classification, and to a lesser
extent for canneal, as explained earlier. For workloads like cassandra or classification, the
8KB_VPN scheme does not reduce the TLB-miss latency by as much as the precise indexing.
A 1K-entry FMN with 8KB_VPN indexing improves the average L1-TLB miss latency across
all workloads by 31.4%, while a dedicated 1K-entry 8-way SA L2-TLB reduces it by 24.6%.
Figure 5.27 shows how this L1-TLB miss latency reduction is reflected in the percentage of
execution time spent servicing L1-TLB misses under FMN.
-73.96-20
0
20
40
60
80
apache TPC-C2 TPC-C1 canneal cassandra classific. cloud9 nutch streaming AMEAN
Workloads
%Avg.L1-TLBMissLatencyReduc=onoverHB
HB+L2 HB+FMN(precise,1K) HB+FMN(8KB_VPN,1K)
HB+FMN(precise,8K) HB+FMN(8KB_VPN,8K)
Figure 5.26: FMN or L2-TLB: Percentage L1-TLB Miss latency reduction over HB.
The average L1-TLB miss latency is 28.2 cycles for HB, 21.3 cycles for HB+L2 and 19.3
cycles for HB + FMN(8KB VPN, 1K). This latency greatly depends on where in the cache
hierarchy (assuming an L2-TLB miss) the required page-table entries are cached. If we were to
look only at the page-walk latency, we would see that adding an L2-TLB (HB+L2 configuration)
increases it. There might be fewer TLB-Misses in the HB+L2 configuration, but they are
more costly because now the required page-table entries are cached in lower-levels of the cache
hierarchy as they are accessed further apart in time. For instance, streaming has 162.4 cycles
L2-TLB miss latency for HB+L2, whereas its average L1-TLB miss latency is 15 cycles due to
Chapter 5. The Forget-Me-Not TLB 132
0
2
4
6
8
10
12
14
16
apache TPC-C2 TPC-C1 canneal cassandra classifica8on cloud9 nutch streaming
Workloads
%Execu0onTimeSpentHandlingL1-TLBMisses
HB HB+L2 HB+FMN(precise,1K)
HB+FMN(8KB_VPN,1K) HB+FMN(precise,8K) HB+FMN(8KB_VPN,8K)
Figure 5.27: FMN or L2-TLB: Percentage of execution time spent handling L1-TLB misses.
its high L2-TLB hit-rate.
FMN’s benefit greatly relies on its operating scenarios, described in Section 5.2.1. Not
only should an FMN probe return before a page walk, but it should also (a) find its FMN
set cached and (b) find a translation within that set. Figure 5.28 presents a breakdown of
FMN probes for the FMN(8KB VPN, 1K) configuration. The lower portion of the stacked
column represents the FMN probes that not only returned before the page-walk but also had
a correct translation. The second stacked part corresponds to FMN probes that returned first
but without a translation, due to FMN’s size and associativity constraints, as is the case with
any tagged structure. The FMN cache misses part of the bar, which is too small to be visible,
is for cases when the FMN set this probe mapped to was not cached anywhere in the 3-level
on-chip cache hierarchy. Finally, the upper portion of each column corresponds to FMN probes
that were useless because the page walk returned first. In this latter case, FMN data do not
remain as hot in the caches as the corresponding page table entries. Because FMN is configured
here as a private per core structure (an evaluation of a shared FMN structure is left for future
work), there is no potential for translation sharing across cores. The pages table entries, on the
other hand, can be transparently shared across cores. Entries for the upper page table levels
are more likely to remain hot in the caches; superpage accesses see more benefits.
5.10.3 FMN’s Effect on Average Memory Latency
As discussed earlier in this chapter, FMN targets a TLB-miss-handling latency reduction but
at the potential cost of higher memory latency. In this context, memory latency represents
the time needed to retrieve the data for a load instruction, or to perform a store, once the
virtual address translation is known. Since the FMN injects more memory references to the
cache hierarchy, the FMN data will now compete both with (a) the demand data needed by
Chapter 5. The Forget-Me-Not TLB 133
0%20%40%60%80%
100%
apache TPC-C2 TPC-C1 canneal cassandra classific. cloud9 nutch streaming
Workloads
BreakdownofFMNProbes
FMNHits(Useful) FMNMissesincachedFMNset
FMNCacheMisses Useless(pagewalkreturnedfirst)
Figure 5.28: Characterization of FMN probes for a 1K-entry per core FMN with 8KB VPNindexing scheme.
the application, and (b) with the page-table entries. Figure 5.29 depicts the average memory
latency increase for the HB, HB+L2, and the HB+FMN configurations.
-2
-1
0
1
2
apache TPC-C2 TPC-C1 canneal cassandra classific. cloud9 nutch streaming
Workloads
%Avg.MemoryLatencyIncreaseoverHB
HB+L2 HB+FMN(precise,1K) HB+FMN(8KB_VPN,1K)
Figure 5.29: FMN or L2-TLB: Percentage memory latency increase over HB.
The 1K-FMN configuration incurs an increase in the average memory request latency in the
range of 0.3% to 1.72% for the simulated workloads. This increase is due to increased contention
for the same cache blocks, as Figure 5.4 showed. As our analytical model showed (Figure 5.9),
this increase can be amortized by a high TLB miss latency reduction. Alternatively, filtering
techniques could be used to throttle the number of issued FMN probes. Figure 5.29 also shows
that, contrary to FMN, the L2-TLB can slightly reduce the memory-latency by up to 1.62%,
most likely due to less contention of the demand requests with the page-table entries in the L1
Chapter 5. The Forget-Me-Not TLB 134
and L2 private caches, a positive side-effect of the TLB-miss reduction.
5.10.4 FMN’s Effect on Performance
Figure 5.30 contrasts the performance benefit of a dedicated private L2-TLB with that of an
FMN. FMN is on average less effective than an L2-TLB achieving at most 1.9% performance
improvement over HB for the streaming workload. However, FMN achieves this without any
dedicated hardware for translation storage. In a couple cases, FMN performs better than the
dedicated L2-TLB. For instance, for canneal, FMN does not degrade performance. The Ideal-
FMN series in this graph is a utopian upper bound for FMN’s performance; it assumes an
FMN with a fixed 2-cycle latency (ideal hit-case in L1 cache) and no memory contention/al-
location requests. Its purpose is to show that if we could somehow reduce how certain less
than ideal scenarios occur (e.g., contention with data requests, etc.), then FMN could approach
this performance. There is significant upside motivating further work on optimizing the FMN.
FMN-filtering/replacement is one such optimization that is described next.
-4-202468
101214
apache TPC-C2 TPC-C1 canneal cassandra classific. cloud9 nutch streaming
Workloads
%Execu0onTimeReduc0onoverHB
HB+L2 HB+FMN(precise,1K) HB+FMN(8KB_VPN,1K) HB+Ideal-FMN
Figure 5.30: FMN or L2-TLB: Performance over HB.
FMN-Filtering Mechanism: Since we saw that FMN can in some cases cause negative
interference with other requests, we considered the FMN filtering and replacement optimizations
discussed in Sections 5.9.7 and 5.9.8 respectively. Unfortunately, neither results in significant
performance speedup compared to FMN. The only exception is the application of filtering for
TPC-C2: it yields a 1.18% performance improvement over the FMN(precise, 1K) configuration.
A more robust mechanism is thus needed.
Chapter 5. The Forget-Me-Not TLB 135
5.11 L2-TLB Bypassing
Our results in Figure 5.25 indicate that adding a per core private L2-TLB can in some cases
yield negligible performance benefits (cloud9, cassandra) or even cause performance degradation
(canneal, classification). These observations were surprising given that many systems employ
such a structure and it is usually considerable in size (e.g., ∼1K entries). Unfortunately, to the
best of our knowledge, there are no results, in the literature, of the performance benefit of an
L2-TLB, other than the MPMI reduction it can achieve.
We identified two factors -pertinent to the L2-TLB design- that are responsible for its poor
performance: (i) the high L2-TLB access latency and (ii) the serialization of L2-TLB lookups
with page-walks. The workload’s access pattern, and thus its TLB behaviour, is the determining
factor. If page walks are rare, because the L2-TLB hit-rate is high, then the extra L2-TLB
lookup latency added to the rather rare or infrequent page walk would be negligible. Also, the
latency of the page walk is important too. Adding seven or eight cycles on top of a 50-cycle
page walk might be a small penalty by comparison. However, if most page walk requests hit
in the upper two cache-levels and the page walk is in the range of 15 cycles, the overhead is
considerable.
Two scenarios when L2-TLB is likely to hurt performance are: (i) When the workload’s
translation footprint is too large and no working set fits in the L2-TLB. For example, in canneal
almost half the L2-TLB accesses are misses, even after warm-up. In these cases, the extra over-
head of the L2-TLB lookup is not counterbalanced by the page-walk latency it saves. (ii) When
most of a workload’s L1-TLB misses are to superpages, and the set-associative L2-TLB only
supports the smallest page size. For example, in classification most TLB misses are to 4MB
pages, and the L2-TLB only supports translations for 8KB pages. Thus, the L2-TLB hit-rate
is less than 10% (see Figure 3.6.4).
Redesigning the L2-TLB to support multiple page-sizes (e.g., by employing our superpage-
friendly TLB design TLBpred or splitting superpage translations into their smaller 8KB pages)
could potentially mitigate the second scenario’s overhead to some extent. However, L2-TLBs
with low hit-rates due to large workload footprints and/or poor TLB capacity utilization from
translations of different page sizes would still be an issue. Furthermore, the proposed bypassing
mechanism is a low-cost alternative to a drastic L2-TLB redesign.
Probing the L2-TLB in parallel to the page walk would, in theory, resolve this issue. The
L2-TLB access will no longer be serialized with the page-walk and so the translation will
be retrieved with either the L2-TLB lookup latency, in case of a TLB hit, or with the page
walk latency, in case of a TLB miss. Unfortunately, this approach has several shortcomings. It
unnecessarily initiates page-walks even in case of L2-TLB hits. Most of the simulated workloads
have L2-TLB hit-rates well above 60%. Thus, more than half the time these page walks would
be wasteful. These page walks would waste energy because of extra cache-lookups, increase
memory traffic on-chip, and likely displace useful demand data from the upper level caches in
the process. This approach would also require significant modifications of the existing hardware
Chapter 5. The Forget-Me-Not TLB 136
page walkers. For example, on an L2-TLB hit, the in-progress page walk should be canceled
to limit some of the negative side-effects. The most appropriate point to do so would likely be
once the current page-walk request, for one of the multiple page levels, returns. And it would
increase energy.
5.11.1 Proposed Solution: Bypassing the L2-TLB
We propose using an interval-based predictor to decide when to commence the page walk:
(a) immediately after an L1-TLB miss (L2-TLB bypassing) or (b) on an L2-TLB miss (the
system’s default option). A bypass condition triggers the transition from the default to the
L2-TLB bypassing option when:
(# L2-TLB hits <= M * # L2-TLB misses) and (# L2-TLB misses > threshold_misses)
The decision is made on a per core basis. The first part of the expression ensures we start a
page-walk on an L1-TLB miss only when the L2-TLB hit-rate is low. 50% was the empirical
hit-rate threshold value originally used for our workloads (M = 1). The second part of the ex-
pression ensures there is a non negligible number of L2-TLB misses within the selected interval;
otherwise, the hit-rate value would be meaningless. The threshold_misses value will depend
on the interval length. A disproportionally large threhsold_misses value for a small interval
would never enable L2-TLB bypassing. We found 1024 misses to be a good threshold for the
evaluated intervals.
Table 5.4 lists all the possible scenarios for consecutive in time intervals. But first, the
following terms are defined:
• Interval: An interval determines the granularity at which bypassing decisions are made.
This evaluation defines an interval as the number of memory replies, excluding page-walk
replies, received per core. Using a timestamp interval would be an alternative.
• Bypass Interval: An interval during which a page walk is initiated on an L1-TLB miss.
• Fallback Interval: An interval during which the system falls back to no L2-TLB bypass-
ing. A fallback interval occurs trigger intervals after a positive bypassing decision was
made. Having fallback intervals ensures bypassing decisions adapt to workloads’ changing
behaviour, and potentially harmful decisions are rectified.
As Table 5.4 indicates, once a decision to bypass the L2-TLB has been made, all subsequent
intervals conform to it until a fallback interval is reached. At that time, bypassing is disabled
for one interval and the decision is then re-evaluated. Translations are allocated in the L2-TLB
even when it is bypassed to ensure it remains warmed-up for the next fallback interval.
Chapter 5. The Forget-Me-Not TLB 137
Intervaln Intervaln+1 Trigger/Explanation
No Bypassing No Bypassing The bypassing condition was not met.
No Bypassing BypassingThe bypass condition was met. Starting countdown forthe next fallback interval.
Bypassing No Bypassing Intervaln+1 is a fallback interval.
Bypassing Bypassing Default, unless a fallback interval was reached.
Table 5.4: L2-TLB Bypassing Scenarios
Fallback intervals allow the technique to adapt to the workloads’ changing patterns and
also safeguard against poor decisions. However, in cases where there are no such changes,
the fallback intervals can be unnecessary. We propose to further adapt the frequency of these
fallback intervals based on how stable (i.e., repetitious) the application’s behaviour proves to
be. That is, if two consecutive fallback intervals4 reach the same decision, we double the trigger
value, to make them less frequent. If two consecutive fallback intervals contradict each other,
we halve the trigger value, unless the minimum trigger value has been reached. We are using
an empirically determined trigger value of 10 for this evaluation.
The proposed technique strikes a balance between alleviating the latency overhead of an
L2-TLB lookup likely to result in a TLB miss and blindly initiating all page walks in parallel to
L2-TLB lookups. It does not save L2-TLB lookup energy, as translations are allocated in the
second-level TLB at all times to ensure a warmed-up structure. No changes to the hardware
page-walker are required; no page-walks need to be aborted because an L2-TLB lookup returned
earlier as in the scenario above. The page walks simply start earlier on bypassing intervals.
The proposed scheme treats the L2-TLB as a black box and decides whether it is likely
that a translation, any translation, will hit in it. But one can also envision different types of
L2-TLB bypassing techniques. For example, predicting whether a specific translation would
hit in the L2-TLB might be an interesting future direction albeit at a higher area cost. The
L2-TLB bypassing predictor design could also be reminiscent of predictors for large stacked
DRAM caches which tried to determine whether the tag lookup should proceed in parallel with
the data lookup or whether the DRAM cache should be bypassed, and main memory be directly
accessed, when a miss is predicted [50,62].
5.11.2 L2-TLB Bypassing: Evaluation
This section evaluates the proposed L2-TLB bypassing predictor. Figure 5.31 presents the
percentage of execution time reduction over HB when using L2-TLB bypassing with different
(interval size, misses threshold) configurations. L2-TLB bypassing benefits classification and
cloud9, reducing close to zero the performance degradation caused by the L2-TLB for classifi-
4Consecutive fallback intervals are not consecutive intervals.
Chapter 5. The Forget-Me-Not TLB 138
cation. For the remaining workloads, it either has no impact, as expected, or slightly decreases
L2-TLB’s benefits. Canneal, which was the other workload targeted, sees no benefit due to the
bypassing condition selected; it needs a slightly higher hit-rate threshold.
-2
0
2
4
6
apache TPC-C2 TPC-C1 canneal cassandra classifica7on cloud9 nutch streaming
Workloads
%Execu0onTimeReduc0onoverHB.viaBypassing(interval,missesthreshold)
HB+L2 Bypassing(100K,0) Bypassing(100K,1024) Bypassing(500K,0) Bypassing(500K,1024)
Figure 5.31: Percentage of execution time reduction with L2-TLB bypassing.
5.12 Concluding Remarks
This chapter presented our Forget-Me-Not TLB, a cacheable TLB design. FMN reduces TLB
miss latency without using any dedicated on-chip translation storage. Instead, it uses the
existing on-chip capacity to transparently and on-demand store translation entries. A private
per core 1K-entry FMN configuration reduces L1-TLB miss latency by up to 45% in a set of
commercial and cloud workloads. However, it also increases the average memory request latency
by up to 1.72%, yielding at most 1.9% performance improvement. This chapter also presented
dynamic selective L2-TLB bypassing, a technique that results in more robust performance when
using an L2-TLB. We were motivated by the observation that an L2-TLB, which contrary to
the FMN has a fixed access latency, does hurt performance in some cases. Our technique
dynamically determines when the page-walk should commence immediately after an L1-TLB
miss, thus bypassing the L2-TLB; it reduces the performance degradation in classification over
the baseline from 1.61% to almost zero (0.02%).
Chapter 6
Concluding Remarks
Address translation is and will continue to be an intrinsic facility of computer systems, at
least in the foreseeable future. However, as both the software (applications, programming
paradigms) and hardware architectures continue to evolve, we need to continuously revisit the
hardware and/or software facilities that support it. As is often the case in computer architecture
research, computer architects opt for making the common case fast. But the current diversity
of workloads makes identifying this common case more nuanced than in the past, challenging
rigid hardware designs that are biased by designing towards this common case. This thesis
advocates for TLB designs and policies that dynamically adapt to the workloads’ behaviour for
a judicious use of the available on-chip resources.
To understand which are the workload behaviour and system architecture aspects that
influence TLB usage, in a chip multiprocessor system, this thesis analyzed the TLB-related
behaviour for a set of commercial and cloud workloads (Chapter 3). These workloads stress
the memory subsystem, and thus the existing address translation infrastructure. We classify
our measurements according to the following taxonomy: (i) characteristics inherent to the
workloads, that is, characteristics or metrics not influenced by the architecture of translation
caching structures like the TLBs, and (ii) other metrics that are influenced by these structures’
architecture. The first helps us relate the application requirements to system design choices and
identify opportunities for optimization. The latter helps us identify shortcomings of existing
state-of-the-art TLB hierarchies. The analysis covered a broad spectrum of questions from
the sizing requirements of translation caching structures, and their per core variations, the
characterization of the reach and lifetime of different contexts, the frequency of translation
modifications, to more nuanced observations about the compressibility of translation entries
and the predictability of the cache block address within a page triggering a TLB miss. A key
result of our analysis was quantifying the drastically different page size usage distributions across
workloads, with a bias for one superpage size when superpages are used. Our TLB capacity
sensitivity study shows this characteristic is at odds with current split L1-TLB designs that
make rigid sizing decisions about the translation capacity allocated to each page size.
The Prediction-Based Superpage-Friendly TLB designs proposed in Chapter 4 target this
139
Chapter 6. Concluding Remarks 140
discrepancy. Their key ingredient is a highly accurate superpage predictor that predicts, ahead
of time, if the next access is to a small page or to a superpage. A small 128-entry predictor
table with a meager 32B of storage, has an average misprediction rate of 0.4% across all our
workloads. This predictor enables TLBpred, a set-associative TLB where translations of multiple
page sizes can co-exist as needed. A 256-entry 4-way SA TLBpred has comparable energy to a
much smaller 48-entry FA TLB which has significantly higher MPMI and which cannot scale
as well. Chapter 4 also presented an evaluation of the previously proposed Skewed-TLB [69]
and augmented it with a predictor that extends its per page size effective associativity.
Finally, Chapter 5 presented our Forget-Me-Not (FMN) design, a cacheable TLB that uses
the on-chip caches to host translations as needed. A per core private 1K-entry direct-mapped
FMN with 8KB VPN indexing improves the arithmetic mean of the L1-TLB miss latency across
all workloads by 31.4%, over a baseline with only L1-TLBs, while a dedicated private 1K-
entry 8-way SA L2-TLB improves it by 24.6%. Nevertheless, the overall performance impact
was relatively small. Even in the case of a dedicated L2-TLB, our evaluation shows small
performance benefits, at most 5.6%, or in some cases performance degradation up to -1.6%.
We also proposed an L2-TLB bypassing mechanism as a potential first-step solution to limit
the latter. One of the key takeaways of this work is the observation that TLBs, memory
accesses, and cached translation entries, be it in page-tables or the FMN, are all parts of a
highly interrelated ecosystem. Optimizing one aspect of this ecosystem (e.g., by reducing the
frequency of TLB misses, or reducing the latency of a TLB miss) has ramifications, often
unwelcomed, for another aspect of the ecosystem. For example, reducing the percentage of
execution time spent servicing TLB misses, does not necessarily imply a performance speedup.
Thus, it is imperative that we do not think of these address translation components in a vacuum,
even though it is significantly easier to do so.
6.1 Future Research Directions
There are a multitude of ways that not only this current research can be extended, but also
address translation support in general. For example, one of FMN’s design challenges is to ensure
the TLB-miss latency reduction does not come at a significant expense of the memory latency
of regular application data. This work relies on the default allocation and replacement policies
of the caches for this purpose without distinguishing between FMN data and other cached data.
Ideally, the FMN would occupy cache space that is not utilized, in the short-term, by other
data. A dead block inspired predictor could predict which cache blocks, at each cache-level, are
least likely to be used in the near future. However, the current FMN implementation might not
be able to take advantage of these cache blocks because of how the FMN maps to the various
cache sets. Reserving a larger physical memory region for FMN, similar to the ballooning
memory allocators used in virtualization, along with a more flexible indexing scheme could
potentially take advantage of this additional space. Dynamic FMN resizing, compressing more
Chapter 6. Concluding Remarks 141
FMN entries within a cache line or use of prefetching could further extend the FMN coverage.
This work focused solely on data TLBs in terms of analysis and evaluation. However,
instruction TLBs can also suffer from the growing instruction footprints especially for OLTP-
like workloads like TPC-C. They could thus be an interesting research avenue, especially since
front-end processor stalls due to I-TLB misses would likely make any such optimizations quite
impactful. Address translation optimizations in virtualized systems may also be much more
impactful, given the higher overhead of two-dimensional page-walks.
Having the OS or the programmer provide hints to the underlying architecture about the
criticality of different memory regions (e.g., via ISA extensions) or the Quality-of-Service re-
quirements of a process could help the dynamic hardware policies make more informed deci-
sions. For example, when two processes stress a TLB but only one of them is critical to the
user, the TLB controller could either filter out translations for the low-importance process or
use a context-aware TLB indexing scheme to limit its negative interference with the translation
caching structure. Instead of relying on the hardware to dynamically relearn what the pro-
grammer already knows about a workload’s important data structures, this information could
be explicitly communicated to the hardware.
Bibliography
[1] D. H. Albonesi, “Selective cache ways: On-demand cache resource allocation,”
in Proceedings of the 32Nd Annual ACM/IEEE International Symposium on
Microarchitecture, ser. MICRO 32. Washington, DC, USA: IEEE Computer Society,
1999, pp. 248–259. [Online]. Available: http://dl.acm.org/citation.cfm?id=320080.320119
[2] AMD, “AMD I/O Virtualization Technology (IOMMU) Specification.” [Online].
Available: http://developer.amd.com/wordpress/media/2012/10/34434-IOMMU-Rev 1.
26 2-11-09.pdf
[3] AMD, “AMD-VTM nested paging,” 2008, [White Paper; accessed May-2017].
[Online]. Available: http://developer.amd.com/wordpress/media/2012/10/NPT-WP-1%
201-final-TM.pdf
[4] AMD, “Software Optimization Guide for AMD Family 10h and 12h Processors,” 2011,
[Online; accessed August-2014]. [Online]. Available: http://support.amd.com/TechDocs/
40546.pdf
[5] AMD, “Software Optimization Guide for AMD Family 15h Processors,” 2014, [Online;
accessed February-2017]. [Online]. Available: https://support.amd.com/TechDocs/
47414 15h sw opt guide.pdf
[6] N. Amit, M. Ben-Yehuda, and B.-A. Yassour, “IOMMU: Strategies for mitigating the
IOTLB bottleneck,” in WIOSCA 2010: Sixth Annual Workshop on the Interaction between
Operating Systems and Computer Architecture, 2010.
[7] ARM, “ARM Cortex-A53 MPCore Processor, Technical Reference Manual,” [PDF accessed
June-2017]. [Online]. Available: https://static.docs.arm.com/ddi0500/f/DDI0500.pdf
[8] ARM, “ARM Cortex-A72 MPCore Processor, Technical Reference Manual,” [PDF
accessed February-2017]. [Online]. Available: http://infocenter.arm.com/help/topic/com.
arm.doc.100095 0003 06 en/cortex a72 mpcore trm 100095 0003 06 en.pdf
[9] R. H. Arpaci-Dusseau and A. C. Arpaci-Dusseau, Operating Systems: Three Easy Pieces,
0th ed. Arpaci-Dusseau Books, May 2015.
142
Bibliography 143
[10] T. W. Barr, A. L. Cox, and S. Rixner, “Translation caching: Skip, don’t walk (the
page table),” in Proceedings of the 37th Annual International Symposium on Computer
Architecture, ser. ISCA ’10. New York, NY, USA: ACM, 2010, pp. 48–59. [Online].
Available: http://doi.acm.org/10.1145/1815961.1815970
[11] T. W. Barr, A. L. Cox, and S. Rixner, “SpecTLB: A mechanism for speculative address
translation,” in Proceedings of the 38th Annual International Symposium on Computer
Architecture, ser. ISCA ’11. New York, NY, USA: ACM, 2011, pp. 307–318. [Online].
Available: http://doi.acm.org/10.1145/2000064.2000101
[12] A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, “Efficient virtual memory
for big memory servers,” in Proceedings of the 40th Annual International Symposium on
Computer Architecture, ser. ISCA ’13. New York, NY, USA: ACM, 2013, pp. 237–248.
[Online]. Available: http://doi.acm.org/10.1145/2485922.2485943
[13] A. Basu, M. D. Hill, and M. M. Swift, “Reducing memory reference energy with
opportunistic virtual caching,” in Proceedings of the 39th Annual International Symposium
on Computer Architecture, ser. ISCA ’12. Washington, DC, USA: IEEE Computer
Society, 2012, pp. 297–308. [Online]. Available: http://dl.acm.org/citation.cfm?id=
2337159.2337194
[14] M. Ben-Yehuda, J. Xenidis, M. Ostrowski, K. Rister, A. Bruemmer, and L. van Doorn,
“The price of safety: Evaluating IOMMU performance,” in OLS ’07: The 2007 Ottawa
Linux Symposium, July 2007, pp. 9–20.
[15] R. Bhargava, B. Serebrin, F. Spadini, and S. Manne, “Accelerating two-dimensional
page walks for virtualized systems,” in Proceedings of the 13th International Conference
on Architectural Support for Programming Languages and Operating Systems, ser.
ASPLOS XIII. New York, NY, USA: ACM, 2008, pp. 26–35. [Online]. Available:
http://doi.acm.org/10.1145/1346281.1346286
[16] A. Bhattacharjee, “Large-reach memory management unit caches,” in Proceedings
of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, ser.
MICRO-46. New York, NY, USA: ACM, 2013, pp. 383–394. [Online]. Available:
http://doi.acm.org/10.1145/2540708.2540741
[17] A. Bhattacharjee, “Preserving the Virtual Memory Abstraction,” May 2017, [ACM
Sigarch Blog; accessed September-2017]. [Online]. Available: https://www.sigarch.org/
preserving-the-virtual-memory-abstraction/
[18] A. Bhattacharjee, “Translation-triggered prefetching,” in Proceedings of the Twenty-
Second International Conference on Architectural Support for Programming Languages
and Operating Systems, ser. ASPLOS ’17. New York, NY, USA: ACM, 2017, pp. 63–76.
[Online]. Available: http://doi.acm.org/10.1145/3037697.3037705
Bibliography 144
[19] A. Bhattacharjee, D. Lustig, and M. Martonosi, “Shared last-level TLBs for chip
multiprocessors,” in Proceedings of the 2011 IEEE 17th International Symposium
on High Performance Computer Architecture, ser. HPCA ’11. Washington, DC,
USA: IEEE Computer Society, 2011, pp. 62–63. [Online]. Available: https:
//doi.org/10.1109/HPCA.2011.5749717
[20] A. Bhattacharjee and M. Martonosi, “Characterizing the TLB behavior of emerging
parallel workloads on chip multiprocessors,” in Proceedings of the 2009 18th International
Conference on Parallel Architectures and Compilation Techniques, ser. PACT ’09.
Washington, DC, USA: IEEE Computer Society, 2009, pp. 29–40. [Online]. Available:
http://dx.doi.org/10.1109/PACT.2009.26
[21] A. Bhattacharjee and M. Martonosi, “Inter-core cooperative TLB for chip multiprocessors,”
in Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for
Programming Languages and Operating Systems, ser. ASPLOS XV. New York, NY,
USA: ACM, 2010, pp. 359–370. [Online]. Available: http://doi.acm.org/10.1145/1736020.
1736060
[22] C. Bienia, “Benchmarking modern multiprocessors,” Ph.D. dissertation, Princeton Uni-
versity, January 2011.
[23] D. L. Black, R. F. Rashid, D. B. Golub, and C. R. Hill, “Translation lookaside
buffer consistency: A software approach,” in Proceedings of the Third International
Conference on Architectural Support for Programming Languages and Operating Systems,
ser. ASPLOS III. New York, NY, USA: ACM, 1989, pp. 113–122. [Online]. Available:
http://doi.acm.org/10.1145/70082.68193
[24] J. Bradford, J. Dale, K. Fernsler, T. Heil, and J. Rose, “Multiple page size address
translation incorporating page size prediction,” Jun. 15 2010, uS Patent 7,739,477.
[Online]. Available: https://www.google.com/patents/US7739477
[25] I. Burcea and A. Moshovos, “Phantom-BTB: a virtualized branch target buffer design,” in
Proceedings of the 14th International Conference on Architectural Support for Programming
Languages and Operating Systems, ser. ASPLOS ’09. New York, NY, USA: ACM, 2009,
pp. 313–324. [Online]. Available: http://doi.acm.org/10.1145/1508244.1508281
[26] I. Burcea, S. Somogyi, A. Moshovos, and B. Falsafi, “Predictor virtualization,” in
Proceedings of the 13th International Conference on Architectural Support for Programming
Languages and Operating Systems, ser. ASPLOS XIII. New York, NY, USA: ACM, 2008,
pp. 157–167. [Online]. Available: http://doi.acm.org/10.1145/1346281.1346301
[27] M. Clark, “A new X86 core architecture for the next gener-
ation of computing,” August 2016, [Presentation in Hot Chips
Bibliography 145
Symposium, accessed September-2017]. [Online]. Available: https:
//www.hotchips.org/wp-content/uploads/hc archives/hc28/HC28.23-Tuesday-Epub/
HC28.23.90-High-Perform-Epub/HC28.23.930-X86-core-MikeClark-AMD-final v2-28.pdf
[28] G. Cox and A. Bhattacharjee, “Efficient address translation for architectures with
multiple page sizes,” in Proceedings of the Twenty-Second International Conference
on Architectural Support for Programming Languages and Operating Systems, ser.
ASPLOS ’17. New York, NY, USA: ACM, 2017, pp. 435–448. [Online]. Available:
http://doi.acm.org/10.1145/3037697.3037704
[29] P. J. Denning, “Virtual memory,” ACM Comput. Surv., vol. 2, no. 3, pp. 153–189, Sep.
1970. [Online]. Available: http://doi.acm.org/10.1145/356571.356573
[30] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak,
A. D. Popescu, A. Ailamaki, and B. Falsafi, “Clearing the clouds: a study of emerging
scale-out workloads on modern hardware,” in Proceedings of the seventeenth International
Conference on Architectural Support for Programming Languages and Operating Systems,
ser. ASPLOS ’12. New York, NY, USA: ACM, 2012, pp. 37–48. [Online]. Available:
http://doi.acm.org/10.1145/2150976.2150982
[31] J. Fotheringham, “Dynamic storage allocation in the Atlas computer, including an
automatic use of a backing store,” Commun. ACM, vol. 4, no. 10, pp. 435–436, Oct. 1961.
[Online]. Available: http://doi.acm.org/10.1145/366786.366800
[32] J. Gandhi, A. Basu, M. D. Hill, and M. M. Swift, “BadgerTrap: A tool to instrument
x86-64 TLB misses,” SIGARCH Comput. Archit. News, vol. 42, no. 2, pp. 20–23, Sep.
2014. [Online]. Available: http://doi.acm.org/10.1145/2669594.2669599
[33] B. Hall, P. Bergner, A. Housfater, M. Kandasamy, T. Magno, A. Mericas, S. Munroe,
M. Oliveira, B. Schmidt, W. Schmidt et al., Performance Optimization and Tuning
Techniques for IBM Power Systems Processors Including IBM POWER8. IBM Redbooks,
2017. [Online]. Available: https://books.google.ca/books?id=7ph0CgAAQBAJ
[34] P. Hammarlund, “4th Generation Intel Core Processor, codenamed Haswell,”
August 2013, [Presentation in Hot Chips Symposium, accessed August-2014].
[Online]. Available: http://www.hotchips.org/wp-content/uploads/hc archives/hc25/
HC25.80-Processors2-epub/HC25.27.820-Haswell-Hammarlund-Intel.pdf
[35] N. Hardavellas, S. Somogyi, T. F. Wenisch, R. E. Wunderlich, S. Chen, J. Kim,
B. Falsafi, J. C. Hoe, and A. G. Nowatzyk, “SimFlex: a fast, accurate, flexible
full-system simulation framework for performance evaluation of server architecture,”
SIGMETRICS Perform. Eval. Rev., vol. 31, no. 4, pp. 31–34, Mar. 2004. [Online].
Available: http://doi.acm.org/10.1145/1054907.1054914
Bibliography 146
[36] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, “Reactive NUCA:
Near-optimal block placement and replication in distributed caches,” in Proceedings
of the 36th Annual International Symposium on Computer Architecture, ser.
ISCA ’09. New York, NY, USA: ACM, 2009, pp. 184–195. [Online]. Available:
http://doi.acm.org/10.1145/1555754.1555779
[37] Intel, “Intel Virtualization Technology for Directed I/O, Architecture Specification.”
[Online]. Available: https://www.intel.com/content/dam/www/public/us/en/documents/
product-specifications/vt-directed-io-spec.pdf
[38] Intel, “Intel 64 and IA-32 Architectures Optimization Reference Manual,” June 2016,
[PDF accessed February-2017]. [Online]. Available: http://www.intel.com/content/dam/
www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
[39] Intel, “Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A:
System Programming Guide, Part 1,” April 2016, [PDF accessed June-2016]. [On-
line]. Available: http://www.intel.com/content/www/us/en/architecture-and-technology/
64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html
[40] Intel, “5-Level Paging and 5-Level EPT,” 2017, [White Paper; revision 1.1;
accessed September-2017]. [Online]. Available: https://software.intel.com/sites/default/
files/managed/2b/80/5-level paging white paper.pdf
[41] A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, “High performance
cache replacement using re-reference interval prediction (RRIP),” in Proceedings
of the 37th Annual International Symposium on Computer Architecture, ser.
ISCA ’10. New York, NY, USA: ACM, 2010, pp. 60–71. [Online]. Available:
http://doi.acm.org/10.1145/1815961.1815971
[42] G. B. Kandiraju and A. Sivasubramaniam, “Going the distance for TLB prefetching: An
application-driven study,” in Proceedings of the 29th Annual International Symposium on
Computer Architecture, ser. ISCA ’02. Washington, DC, USA: IEEE Computer Society,
2002, pp. 195–206. [Online]. Available: http://dl.acm.org/citation.cfm?id=545215.545237
[43] V. Karakostas, J. Gandhi, A. Cristal, M. D. Hill, K. S. McKinley, M. Nemirovsky,
M. M. Swift, and O. S. Unsal, “Energy-efficient address translation,” in 2016 IEEE
International Symposium on High Performance Computer Architecture (HPCA), March
2016, pp. 631–643. [Online]. Available: https://doi.org/10.1109/HPCA.2016.7446100
[44] V. Karakostas, O. S. Unsal, M. Nemirovsky, A. Cristal, and M. Swift, “Performance
analysis of the memory management unit under scale-out workloads,” in 2014 IEEE
International Symposium on Workload Characterization (IISWC), Oct 2014, pp. 1–12.
[Online]. Available: https://doi.org/10.1109/IISWC.2014.6983034
Bibliography 147
[45] V. Karakostas, J. Gandhi, F. Ayar, A. Cristal, M. D. Hill, K. S. McKinley, M. Nemirovsky,
M. M. Swift, and O. Unsal, “Redundant memory mappings for fast access to large
memories,” in Proceedings of the 42nd Annual International Symposium on Computer
Architecture, ser. ISCA ’15. New York, NY, USA: ACM, 2015, pp. 66–78. [Online].
Available: http://doi.acm.org/10.1145/2749469.2749471
[46] S. Kaxiras and A. Ros, “A new perspective for efficient virtual-cache coherence,” in
Proceedings of the 40th Annual International Symposium on Computer Architecture,
ser. ISCA ’13. New York, NY, USA: ACM, 2013, pp. 535–546. [Online]. Available:
http://doi.acm.org/10.1145/2485922.2485968
[47] D. Kim, H. Kim, and J. Huh, “Virtual snooping: Filtering snoops in virtualized
multi-cores,” in Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium
on Microarchitecture, ser. MICRO ’43. Washington, DC, USA: IEEE Computer Society,
2010, pp. 459–470. [Online]. Available: http://dx.doi.org/10.1109/MICRO.2010.16
[48] H. Q. Le, W. J. Starke, J. S. Fields, F. P. O’Connell, D. Q. Nguyen, B. J. Ronchetti,
W. M. Sauer, E. M. Schwarz, and M. T. Vaden, “IBM POWER6 microarchitecture,” IBM
Journal of Research and Development, vol. 51, no. 6, pp. 639–662, Nov 2007.
[49] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, “Mcpat:
An integrated power, area, and timing modeling framework for multicore and manycore
architectures,” in Proceedings of the 42nd Annual IEEE/ACM International Symposium
on Microarchitecture, ser. MICRO 42. New York, NY, USA: ACM, 2009, pp. 469–480.
[Online]. Available: http://doi.acm.org/10.1145/1669112.1669172
[50] G. H. Loh and M. D. Hill, “Efficiently enabling conventional block sizes for very large
die-stacked dram caches,” in Proceedings of the 44th Annual IEEE/ACM International
Symposium on Microarchitecture, ser. MICRO-44. New York, NY, USA: ACM, 2011, pp.
454–464. [Online]. Available: http://doi.acm.org/10.1145/2155620.2155673
[51] P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, O. Kocberber, J. Picorel, A. Adileh,
D. Jevdjic, S. Idgunji, E. Ozer, and B. Falsafi, “Scale-out processors,” in Proceedings
of the 39th Annual International Symposium on Computer Architecture, ser. ISCA ’12.
Washington, DC, USA: IEEE Computer Society, 2012, pp. 500–511. [Online]. Available:
http://dl.acm.org/citation.cfm?id=2337159.2337217
[52] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg,
F. Larsson, A. Moestedt, and B. Werner, “Simics: A full system simulation
platform,” Computer, vol. 35, no. 2, pp. 50–58, Feb. 2002. [Online]. Available:
http://dx.doi.org/10.1109/2.982916
[53] R. McDougall and J. Mauro, Solaris Internals: Solaris 10 and OpenSolaris Kernel Archi-
tecture (Second Edition). Upper Saddle River, NJ, USA: Prentice Hall PTR, 2007.
Bibliography 148
[54] J. Navarro, S. Iyer, P. Druschel, and A. Cox, “Practical, transparent operating system
support for superpages,” SIGOPS Oper. Syst. Rev., vol. 36, no. SI, pp. 89–104, Dec. 2002.
[Online]. Available: http://doi.acm.org/10.1145/844128.844138
[55] M. Papadopoulou, X. Tong, A. Seznec, and A. Moshovos, “Prediction-based
superpage-friendly TLB designs,” in 2015 IEEE 21st International Symposium on High
Performance Computer Architecture (HPCA), Feb 2015, pp. 210–222. [Online]. Available:
https://doi.org/10.1109/HPCA.2015.7056034
[56] C. H. Park, T. Heo, and J. Huh, “Efficient synonym filtering and scalable delayed
translation for hybrid virtual caching,” in 2016 ACM/IEEE 43rd Annual International
Symposium on Computer Architecture (ISCA), June 2016, pp. 90–102. [Online]. Available:
https://doi.org/10.1109/ISCA.2016.18
[57] B. Pham, A. Bhattacharjee, Y. Eckert, and G. H. Loh, “Increasing TLB reach by
exploiting clustering in page translations,” in Proceedings of the 2014 IEEE 20th
International Symposium on High Performance Computer Architecture, ser. HPCA ’14,
February 2014. [Online]. Available: https://doi.org/10.1109/HPCA.2014.6835964
[58] B. Pham, V. Vaidyanathan, A. Jaleel, and A. Bhattacharjee, “CoLT: Coalesced large-reach
TLBs,” in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium
on Microarchitecture, ser. MICRO-45. Washington, DC, USA: IEEE Computer Society,
2012, pp. 258–269. [Online]. Available: http://dx.doi.org/10.1109/MICRO.2012.32
[59] S. Phillips, “M7: Next Generation SPARC,” August 2014, [Presenta-
tion in Hot Chips Symposium, accessed February-2017]. [Online]. Avail-
able: http://www.oracle.com/us/products/servers-storage/servers/sparc-enterprise/
migration/m7-next-gen-sparc-presentation-2326292.html
[60] B. Pichai, L. Hsu, and A. Bhattacharjee, “Architectural support for address
translation on GPUs: Designing memory management units for CPU/GPUs with
unified address spaces,” in Proceedings of the 19th International Conference on
Architectural Support for Programming Languages and Operating Systems, ser.
ASPLOS ’14. New York, NY, USA: ACM, 2014, pp. 743–758. [Online]. Available:
http://doi.acm.org/10.1145/2541940.2541942
[61] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer, “Adaptive insertion
policies for high performance caching,” in Proceedings of the 34th Annual International
Symposium on Computer Architecture, ser. ISCA ’07. New York, NY, USA: ACM, 2007,
pp. 381–391. [Online]. Available: http://doi.acm.org/10.1145/1250662.1250709
[62] M. K. Qureshi and G. H. Loh, “Fundamental latency trade-off in architecting DRAM
caches: Outperforming impractical SRAM-tags with a simple and practical design,”
Bibliography 149
in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on
Microarchitecture, ser. MICRO-45. Washington, DC, USA: IEEE Computer Society,
2012, pp. 235–246. [Online]. Available: http://dx.doi.org/10.1109/MICRO.2012.30
[63] R. Rahman, “Intel Xeon Phi Core Micro-architecture,” May 2013, [accessed August-
2014]. [Online]. Available: https://software.intel.com/sites/default/files/article/393195/
intel-xeon-phi-core-micro-architecture.pdf
[64] B. Romanescu, A. Lebeck, D. Sorin, and A. Bracy, “UNified Instruction/Translation/Data
(UNITD) coherence: One protocol to rule them all,” in High Performance Computer
Architecture (HPCA), 2010 IEEE 16th International Symposium on, Jan 2010, pp. 1–12.
[Online]. Available: https://doi.org/10.1109/HPCA.2010.5416643
[65] A. Ros and S. Kaxiras, “Complexity-effective multicore coherence,” in Proceedings of
the 21st International Conference on Parallel Architectures and Compilation Techniques,
ser. PACT ’12. New York, NY, USA: ACM, 2012, pp. 241–252. [Online]. Available:
http://doi.acm.org/10.1145/2370816.2370853
[66] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “DRAMSim2: A cycle accurate memory
system simulator,” IEEE Comput. Archit. Lett., vol. 10, no. 1, pp. 16–19, Jan. 2011.
[Online]. Available: http://dx.doi.org/10.1109/L-CA.2011.4
[67] J. H. Ryoo, N. Gulur, S. Song, and L. K. John, “Rethinking TLB designs
in virtualized environments: A very large part-of-memory TLB,” in Proceedings
of the 44th Annual International Symposium on Computer Architecture, ser.
ISCA ’17. New York, NY, USA: ACM, 2017, pp. 469–480. [Online]. Available:
http://doi.acm.org/10.1145/3079856.3080210
[68] A. Saulsbury, F. Dahlgren, and P. Stenstrom, “Recency-based TLB preloading,” in
Proceedings of the 27th Annual International Symposium on Computer Architecture,
ser. ISCA ’00. New York, NY, USA: ACM, 2000, pp. 117–127. [Online]. Available:
http://doi.acm.org/10.1145/339647.339666
[69] A. Seznec, “Concurrent support of multiple page sizes on a skewed associative TLB,”
IEEE Trans. Comput., vol. 53, no. 7, pp. 924–927, Jul. 2004. [Online]. Available:
http://dx.doi.org/10.1109/TC.2004.21
[70] A. Seznec, “A case for two-way skewed-associative caches,” in Proceedings of
the 20th Annual International Symposium on Computer Architecture, ser. ISCA
’93. New York, NY, USA: ACM, 1993, pp. 169–178. [Online]. Available: http:
//doi.acm.org/10.1145/165123.165152
[71] A. Seznec, “A new case for skewed-associativity,” Internal Publication No 1114, IRISA-
INRIA, Tech. Rep., 1997.
Bibliography 150
[72] M. Shah, R. Golla, G. Grohoski, P. Jordan, J. Barreh, J. Brooks, M. Greenberg,
G. Levinsky, M. Luttrell, C. Olson, Z. Samoail, M. Smittle, and T. Ziaja, “Sparc T4: A
dynamically threaded server-on-a-chip,” IEEE Micro, vol. 32, no. 2, pp. 8–19, Mar. 2012.
[Online]. Available: http://dx.doi.org/10.1109/MM.2012.1
[73] B. Sinharoy, J. A. V. Norstrand, R. J. Eickemeyer, H. Q. Le, J. Leenstra, D. Q. Nguyen,
B. Konigsburg, K. Ward, M. D. Brown, J. E. Moreira, D. Levitan, S. Tung, D. Hrusecky,
J. W. Bishop, M. Gschwind, M. Boersma, M. Kroener, M. Kaltenbach, T. Karkhanis,
and K. M. Fernsler, “IBM POWER8 processor core microarchitecture,” IBM Journal of
Research and Development, vol. 59, no. 1, pp. 2:1–2:21, Jan 2015.
[74] SPARC International, Inc., CORPORATE, The SPARC Architecture Manual (Version 9).
Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1994.
[75] S. Srikantaiah and M. Kandemir, “Synergistic TLBs for high performance address
translation in chip multiprocessors,” in Proceedings of the 2010 43rd Annual IEEE/ACM
International Symposium on Microarchitecture, ser. MICRO ’43. Washington, DC,
USA: IEEE Computer Society, 2010, pp. 313–324. [Online]. Available: http:
//dx.doi.org/10.1109/MICRO.2010.26
[76] Sun Microsystems, “SPARC Joint Programming Specification 1 Implementation
Supplement: Sun UltraSPARC III,” 2002, [Online; accessed May-2017]. [Online].
Available: http://www.oracle.com/technetwork/server-storage/sun-sparc-enterprise/
documentation/sparc-3-usersmanual-2516678.pdf
[77] Sun Microsystems, “Schizo Programmer’s Reference Manual,” 2007.
[78] M. Talluri and M. D. Hill, “Surpassing the TLB performance of superpages with
less operating system support,” in Proceedings of the Sixth International Conference
on Architectural Support for Programming Languages and Operating Systems, ser.
ASPLOS VI. New York, NY, USA: ACM, 1994, pp. 171–182. [Online]. Available:
http://doi.acm.org/10.1145/195473.195531
[79] M. Talluri, S. Kong, M. D. Hill, and D. A. Patterson, “Tradeoffs in supporting two
page sizes,” in Proceedings of the 19th Annual International Symposium on Computer
Architecture, ser. ISCA ’92. New York, NY, USA: ACM, 1992, pp. 415–424. [Online].
Available: http://doi.acm.org/10.1145/139669.140406
[80] A. S. Tanenbaum, Modern Operating Systems, 2nd ed. Prentice Hall Press, 2002.
[81] P. J. Teller, “Translation-lookaside buffer consistency,” Computer, vol. 23, no. 6, pp.
26–36, Jun. 1990. [Online]. Available: http://dx.doi.org/10.1109/2.55498
Bibliography 151
[82] J. Vesely, A. Basu, M. Oskin, G. H. Loh, and A. Bhattacharjee, “Observations
and opportunities in architecting shared virtual memory for heterogeneous systems,”
in 2016 IEEE International Symposium on Performance Analysis of Systems
and Software (ISPASS), April 2016, pp. 161–171. [Online]. Available: https:
//doi.org/10.1109/ISPASS.2016.7482091
[83] C. Villavieja, V. Karakostas, L. Vilanova, Y. Etsion, A. Ramirez, A. Mendelson,
N. Navarro, A. Cristal, and O. S. Unsal, “DiDi: Mitigating the performance impact of
TLB shootdowns using a shared TLB directory,” in Proceedings of the 2011 International
Conference on Parallel Architectures and Compilation Techniques, ser. PACT ’11.
Washington, DC, USA: IEEE Computer Society, 2011, pp. 340–349. [Online]. Available:
http://dx.doi.org/10.1109/PACT.2011.65
[84] Virtutech, “Simics Reference Manual (Simics Version 3.0),” 2007.
[85] H. Yoon and G. S. Sohi, “Revisiting virtual L1 caches: A practical design using
dynamic synonym remapping,” in 2016 IEEE International Symposium on High
Performance Computer Architecture (HPCA), March 2016, pp. 212–224. [Online].
Available: https://doi.org/10.1109/HPCA.2016.7446066