Exploring Virtualization Techniques for Branch Outcome ... · Exploring Virtualization Techniques...

Preview:

Citation preview

Exploring Virtualization Techniquesfor Branch Outcome Prediction

by

Maryam Sadooghi-Alvandi

A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science

Graduate Department of Electrical and Computer Engineering

UNIVERSITY OF TORONTO

© Copyright by Maryam Sadooghi-Alvandi 2011

Abstract

Exploring Virtualization Techniques for Branch Outcome Prediction

Maryam Sadooghi-AlvandiMaster of Applied Science

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

2011

Modern processors use branch prediction to predict branch outcomes, in order to fetch

ahead in the instruction stream, increasing concurrency and performance. Larger predic-

tor tables can improve prediction accuracy, but come at the cost of larger area and longer

access delay.

This work introduces a new branch predictor design that increases the perceived pre-

dictor capacity without increasing its delay, by using a large virtual second-level table

allocated in the second-level caches. Virtualization is applied to a state-of-the-art multi-

table branch predictor. We evaluate the design using instruction count as proxy for timing

on a set of commercial workloads. For a predictor whose size is determined by access delay

constraints rather than area, accuracy can be improved by 8.7%. Alternatively, the design

can be used to achieve the same accuracy as a non-virtualized design while using 25% less

dedicated storage.

ii

Acknowledgements

First and foremost, I would like to thank my supervisor, Professor Andreas Moshovos,

for his guidance, inspiration, and support. His dedication and expertise has helped me

immensely during this work.

I am grateful to my thesis defense committee members, Professors Natalie Enright-

Jerger, Greg Steffan, and Dimitrios Hatzinakos for their time and invaluable feedback.

My sincerest appreciation goes to Henry, Kaveh, Ioana, and Jason for all the assistance

and suggestions that they provided me in completing my thesis. I would like to thank all

the members of our group and the many wonderful people I have met in graduate school,

especially Elias, Myrto, Goran, Ian, Patrick, Vitaly, Islam, Nahi, Andrew, Diego, Xander,

Danyao, Pam, and Meric.

Finally, I would like to thank my parents, Mohammad and Mahrokh, and my brother,

Milad, for their love, support, and encouragement throughout my life. I wouldn’t be here

without you.

iii

Table of Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 Background on Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Branch Prediction Mechanisms . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Background on Predictor Virtualization . . . . . . . . . . . . . . . . . . . . . 11

3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Related Work on PV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Related Work on Hierarchical Branch Predictors . . . . . . . . . . . . . . . . 14

3.3 Related Work on Improving TAGE Prediction Accuracy . . . . . . . . . . . . 16

4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1 Simulation Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

iv

Table of Contents

4.1.1 Timing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1.3 Branch Prediction Accuracy Metric . . . . . . . . . . . . . . . . . . . . 19

4.1.4 Simulated Microprocessor . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 TAGE Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2.1 Number of Tagged Tables . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.2 Entry Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.3 History Lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.4 Table Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Virtualization Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 Paging Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.3 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.3.1 TAGE Baseline and Potential for Virtualization . . . . . . . . . . . . . 33

6.3.2 Choosing Ltable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.3.3 Maximum Potential Improvement with the Paging Scheme . . . . . . 36

6.3.4 Page Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.3.5 Instruction Block Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.3.6 L2 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.3.7 Sensitivity to Initial Table Sizes . . . . . . . . . . . . . . . . . . . . . . 59

6.4 Virtualized Design Accuracy and Improvement Over Baseline for Delay-

Constrained Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7 Packaging Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

v

Table of Contents

7.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7.3.1 Choosing Ptable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7.3.2 Potential Improvement with the Packaging Scheme . . . . . . . . . . 71

7.3.3 Package Swap Frequency . . . . . . . . . . . . . . . . . . . . . . . . . 73

7.3.4 Instruction Block Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7.3.5 Improvement for Packaging Scheme Before Considering L2 Delay . . 77

7.3.6 Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

8.1.1 Paging Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

8.1.2 Packaging Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

8.1.3 Comparison of the Two Schemes . . . . . . . . . . . . . . . . . . . . . 84

8.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Appendices

A Branch Prediction Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

A.1 Bimodal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

A.2 Two-Level Adaptive Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . 86

A.3 Gshare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

A.4 2BC-Gshare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

A.5 The Skewed Predictor and 2BC-Gskew . . . . . . . . . . . . . . . . . . . . . . 90

A.6 YAGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

A.7 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

A.8 O-GEHL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

vi

List of Tables

4.1 Workload Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 Tag Width for Predictor Table Entries . . . . . . . . . . . . . . . . . . . . . . . 20

4.3 Geometric Functions Used for History Lengths . . . . . . . . . . . . . . . . . 20

4.4 Delay-Constrained Table Size Configurations Established Using CACTI . . . 24

6.1 Default Ltable Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.2 Ltable History Lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.3 Paging Scheme Virtualized Design Configuration . . . . . . . . . . . . . . . . 52

7.1 Default Ptable Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7.2 Packaging Scheme Configuration . . . . . . . . . . . . . . . . . . . . . . . . . 71

7.3 Final Packaging Scheme Configuration . . . . . . . . . . . . . . . . . . . . . . 77

vii

List of Figures

2.1 TAGE Predictor with N Tagged Tables . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Phantom-BTB Architecture [10] . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1 Organization of Tag and Data Arrays in CACTI [35] . . . . . . . . . . . . . . . 22

4.2 Access Latency for TAGE Modeled as an N-Way Set-Associative Cache (Tag

Width: 5 bits, Data Width: 8 bits) . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Access Latency for the Bimodal Table Modeled as an SRAM Array of Bytes . 23

5.1 Paging scheme increases the perceived length of a predictor table . . . . . . . 26

5.2 Packaging scheme increases the perceived width of a predictor table . . . . . 26

6.1 Paging Scheme Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.2 MPKI for Baseline TAGE at Three Different Sizes . . . . . . . . . . . . . . . . 34

6.3 MPKI for TAGE with a Single 32K-Entry Table . . . . . . . . . . . . . . . . . . 35

6.4 Maximum potential improvement achievable by adding a single 32K-entry

table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.5 Effect of page size on MPKI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.6 Maximum potential improvement with a page size of 32 entries . . . . . . . . 39

6.7 Effect of page size on page swap frequency . . . . . . . . . . . . . . . . . . . . 41

6.8 Effect of page size on predictor tables’ MPKI contribution (N=1) . . . . . . . 45

6.9 Effect of page size on predictor tables’ MPKI contribution (N=8) . . . . . . . 45

6.10 Effect of instruction block size on page swap frequency (instruction block

size = 2X× page size) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

viii

List of Figures

6.11 Effect of instruction block size on MPKI (instruction block size = 2X× page

size, N=12) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.12 Effect of instruction block size on the fraction of pages never accessed (in-

struction block size = 2X× page size) . . . . . . . . . . . . . . . . . . . . . . . 49

6.13 Maximum potential improvement with an instruction block size of 256 B

(X = 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.14 Effect of latency on percentage improvement in MPKI . . . . . . . . . . . . . 52

6.15 Potential improvement for virtualized design with 2K-entry dedicated tables 53

6.16 Effect of latency on L1 miss rate . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.17 Effect of latency on page swap frequency . . . . . . . . . . . . . . . . . . . . . 56

6.18 effect of latency on predictor table’s contribution to MPKI and final predic-

tion (a) MPKI contribution (b) contribution to final prediction . . . . . . . . 57

6.19 Maximum potential improvement achievable by adding a single 32K-entry

table – 4K-entry dedicated tables . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.20 Improvement loss due to page size, instruction block size, and latency con-

straints – 4K-entry dedicated tables . . . . . . . . . . . . . . . . . . . . . . . . 60

6.21 Accuracy improvement for virtualization of delay-constrained configurations 61

6.22 Effect of virtualization on L2 traffic for delay-constrained configurations . . 62

7.1 Packaging scheme architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7.2 Package number determined by n bits of branch PC . . . . . . . . . . . . . . . 68

7.3 MPKI for TAGE with a table of 1K packages added to a single predictor table 70

7.4 Maximum potential improvement with packaging scheme using 64 B pack-

ages and instruction block size . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7.5 Package swap frequency for prediction and packaging buffers . . . . . . . . . 73

7.6 Effect of instruction block size on prediction buffer package swap frequency 75

7.7 Effect of instruction block size on the fraction of packages never accessed . . 76

7.8 Effect of instruction block size on MPKI . . . . . . . . . . . . . . . . . . . . . 77

ix

List of Figures

7.9 Potential improvement for packaging scheme using an instruction block

size of 1 KB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

7.10 Maximum potential improvement before considering L2 delay for paging

and packaging schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

7.11 Accuracy of a 1K-entry PTB on the SPEC CPU 2000 benchmark set . . . . . . 80

A.1 Bimodal Predictor [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

A.2 Two-Level Adaptive Predictor [14] . . . . . . . . . . . . . . . . . . . . . . . . 88

A.3 Gshare Predictor [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

A.4 2bc-gshare Predictor [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

A.5 2bc-gskew Predictor [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

A.6 YAGS Predictor [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

A.7 Perceptron Predictor [24] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

A.8 GEHL Predictor [25] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

x

Chapter 1

Introduction

1.1 Motivation

Modern microprocessors are continuously challenged to execute more instructions per

second. To achieve this goal, microprocessors execute multiple instructions in parallel,

exploiting Instruction Level Parallelism, the inherent potential for multiple instructions to

be processed simultaneously. This Instruction Level Parallelism can be limited by the

various dependencies between instructions. An important class of such dependencies is

control flow dependencies, which determine the order of instruction execution. This class

of dependencies is introduced by branch instructions, instructions that can alter the flow

of execution, usually based on some conditions. When a conditional branch instruction

is fetched, it can take several cycles before the outcome of its condition is evaluated, and

until then, it is unclear which path the processor should follow. To avoid stalling for

branches, microprocessors rely on speculation. Branch prediction is a form of speculation

that allows the processor to speculatively fetch and execute instructions along a predicted

path. The branch predictor uses statistical information based on past branch behaviour to

determine the likely outcome of each branch. This technique enhances performance as it

keeps the pipeline full, even in the presence of control hazards.

Branch prediction accuracy impacts processor performance as branch mispredictions

represent a significant opportunity loss. If a branch is incorrectly predicted, a considerable

number of cycles is wasted executing instructions on the wrong path, as well as recovering

from such incorrect execution. Current microprocessors have large branch misprediction

penalties. For instance, the newest Intel microprocessors, the Intel Sandy Bridge proces-

1

1.1. Motivation

sors, are reported to have a misprediction penalty of 14 cycles and a fetch width of four

instructions per cycle [1]. Since younger instructions in the pipeline must be flushed on a

misprediction, for these processors, each mispredicted branch introduces 56 bubbles into

the pipeline. If 10 out of every 1000 instructions are mispredicted branches, on average

560 out of every 1560 instructions fetched are wasted executing wrong-path instructions,

wasting more than a third of the fetch bandwidth. Although based on many simplified as-

sumptions, this example illustrates that due to the large penalties associated with branch

mispredictions, small improvements in accuracy can have a large impact on performance.

To predict a branch outcome, the branch address and some additional information are

used to lookup predictions in one or more tables of counters. These tables store infor-

mation about the past behaviour of branches. Since the tables have a limited number of

entries, some branches will inadvertently be forced to use the same entries. This phe-

nomenon is known as aliasing, which can, in principle, be constructive or destructive.

However, aliasing has been shown to be predominantly destructive, significantly limiting

the accuracy of branch predictors [2]. A variety of techniques have been proposed to limit

aliasing and reduce its negative effects [3, 4, 5, 6, 7].

Increasing the branch predictor table size(s) is a simple way of reducing aliasing. How-

ever, larger tables require more area and have longer access latencies. As the branch pre-

dictor resides on the critical path of instruction fetch, it is desirable that it provides a

prediction within a single cycle. Therefore, branch prediction designs are constrained not

only by their chip area budget, but also by the delay to access their structures. It has been

shown that increasing delay to improve prediction accuracy is almost never a good tradeoff

[8]. This effect can be explained with the following equation, which roughly approximates

the cost C (in cycles) of executing a branch instruction:

C = d + (r × p) (1.1)

where d is the delay of the branch predictor, r is the misprediction rate, and p is the mis-

prediction penalty. Because misprediction rates tend to be close to a few percent, changes

2

1.1. Motivation

in delay have a larger impact than small changes in the misprediction rate. For this reason,

we are motivated to find methods that allow branch predictors to use larger tables without

sacrificing the predictor access delay.

One such potential method is Predictor Virtualization (PV) [9]. PV can be applied to

hardware optimization engines (known as predictors in PV terminology) that collect in-

formation about program behaviour at runtime and store it in on-chip tables. PV uses ded-

icated on-chip resources to record the active predictor entries. Whenever this dedicated

storage is not enough, PV spills entry contents to the memory hierarchy. PV attempts to

emulate large predictor tables without requiring large amounts of dedicated on-chip re-

sources to be set aside at design time. PV offers the ability to dynamically adapt predictor

table sizes based on the observed application behaviour, as well as, allowing the use of a

feedback mechanism to turn the optimization on and off without wasting storage.

Applying Predictor Virtualization to branch outcome predictors could potentially be

used to provide the best balance between predictor accuracy, area overhead, and latency.

Augmenting the dedicated branch predictor tables with a large virtual second-level table

is beneficial, not only because the second-level table does not come at the cost of larger

dedicated storage, but also because, if orchestrated properly, the delay of the predictor can

remain constant at the latency of accessing the smaller dedicated tables.

The main challenge with Predictor Virtualization is tolerating the variable and rela-

tively long access latency introduced by placing the virtual storage in the memory hierar-

chy. Some predictors naturally benefit from virtualization as they can inherently tolerate

higher latencies, because they themselves perform high-latency operations [9]. However,

a straightforward application of PV techniques is not feasible for branch predictors, since

branch predictors are delay sensitive. PV has also been successfully applied to branch tar-

get buffers, a predictor class that is not inherently tolerant of higher prediction latencies.

For these predictors, the inherent spatial and temporal locality of their access stream is

used to prefetch predictor entries, hiding the latency of virtual table accesses [10]. These

predictors use branch addresses to access their structures, which inherit the spatial and

3

1.2. Thesis Contributions

temporal locality present in program instructions. However, branch predictors’ access

streams do not exhibit spatial or temporal locality since they use a hash of the branch

address and other information to index the predictor tables. These hash functions are de-

signed to randomize the predictor access stream as much as possible, in order to reduce

aliasing. To take advantage of PV, the design of branch predictors will have to be revisited

with virtualization in mind.

This thesis explores virtualization techniques as applied to branch outcome predictors.

As most branch predictors were designed with area constraints in mind, it is useful to

rethink branch predictor design in the absence of area constraints and in the context of

virtualization; Virtualization makes area constraints a secondary factor by providing the

predictor with a large virtual space. In this work, virtualization is applied to a state-of-the-

art multi-table branch predictor, the TAgged GEometric History Length predictor (TAGE)

[11]. Our designs augment a single table in TAGE with a large virtual second-level table.

We propose two schemes for the design of such a two-level predictor, a paging scheme and

a packaging scheme.

1.2 Thesis Contributions

The contributions of this work are as follows:

• We propose a novel paging scheme for branch outcome prediction virtualization

that introduces coarse-grain locality in the branch prediction access stream. In this

scheme, the entries in a single predictor table are divided into pages shared by a

group of branches in close proximity in the program code space. The first-level table

caches the most recently used subset of the second-level pages (Chapter 6).

• We evaluate the paging scheme using instruction count as proxy for timing, on a set

of commercial workloads. For a predictor with moderate table sizes (2K entries), pre-

diction accuracy can be improved by 10.6% to 30.5%, depending on the number of

dedicated tables, while adding an overhead of only 704 bits of extra dedicated stor-

4

1.2. Thesis Contributions

age. Alternatively, the virtualized design can be used to reduce the dedicated budget

and complexity of the predictor without affecting accuracy. In this case, depending

on the number of dedicated predictor tables, 20% to 50% of the tagged tables can

be removed, while achieving same or better accuracy than a non-virtualized design.

For a predictor whose size is determined not by area constraints but by the delay

to access its contents (delay-constrained), the accuracy of the best configuration of

the baseline predictor can be improved by 8.7%. Alternatively, virtualization can be

used to achieve the same accuracy as the baseline predictor while removing approx-

imately 25% of the dedicated storage.

• As an alternative to paging, we propose a novel packaging scheme for branch out-

come prediction virtualization. In this scheme, the evicted entries from the first-level

predictor table are grouped into packages that are placed in the second-level table,

and fetched in time to be used for prediction (Chapter 7).

• Before considering the effect of L2 delay on the predictor, we demonstrate that the

packaging scheme achieves better accuracy (up to 9% more) than the paging scheme,

especially for a predictor with fewer dedicated tables. However, the packaging scheme

exhibits higher traffic to the L2 compared with the paging scheme. We leave investi-

gation into the effect of L2 latency on the packaging scheme for future work.

The rest of this thesis is structured as follows. We present a background on branch

prediction and Predictor Virtualization in Chapter 2. We review related work in Chapter

3. We describe the methodology in Chapter 4. Chapter 5 provides an overview of the

two virtualization schemes. In Chapter 6, we present and evaluate the paging scheme. In

Chapter 7, we present the packaging scheme and show the potential improvement that

can be achieved with this scheme before considering the effects of latency. We conclude in

Chapter 8.

5

Chapter 2

Background

2.1 Background on Branch Prediction

Virtually all modern processors contain a hardware branch prediction unit. The unit is

generally organized as a branch outcome predictor and a branch target buffer (BTB). It is

placed in the instruction fetch stages and is in charge of recognizing an incoming branch

instruction, providing a prediction about the outcome of the branch (taken or not taken),

and providing the target instruction address (in case of a predicted taken branch). The

target of the branch is predicted by the BTB. The BTB is a cache-like structure, indexed

by a portion of the branch address, that keeps branch target addresses as its entries. Each

entry also typically includes a tag field. The branch outcome predictor can be coupled with

the BTB or implemented as a separate structure. In the former case, only branches that

hit in the BTB are predicted by the outcome predictor, while a static prediction algorithm

is used on a BTB miss. Static prediction can be as simple as an always-predict-taken (or -

not-taken) method or could rely on more sophisticated compiler techniques that are based

on the behaviour of branches predictable at compile time [12]. As an alternative to static

prediction, dynamic branch prediction can be used to predict branch outcomes by the

hardware at execution time.

In this thesis, we focus on dynamic branch outcome prediction and use the term branch

prediction to refer to this type of prediction from this point on. Although branch prediction

encompasses the prediction of both the direction and target of branches, we will explicity

use the term branch target prediction when referring to the latter. To help put our work

into context, this section reviews background on branch prediction schemes. We present

6

2.1. Background on Branch Prediction

a general overview of the evolution of branch predictors, while focusing on the branch

prediction schemes that directly relate to our work.

Simple dynamic branch predictors were first proposed by Smith [13]. The main idea

behind these schemes is to use the past behaviour of the branch to speculate about its

future outcome. A hash of the branch address is used as an index into a table of counters

that are incremented when a branch is taken, and decremented otherwise. Prediction

comes from the high bit of the corresponding counter; 1 means predict taken, 0 means

predict not taken.

In 1991, Yeh and Patt observed that the outcome of a given branch is often highly

correlated with the outcomes of other recent branches [14]. They use the pattern formed

by the history of branch outcomes to provide a dynamic context for prediction. As each

branch outcome becomes known, one bit (1 for taken, 0 for not taken) is shifted into a

pattern history register. A pattern history register is accessed by a hash of the branch

address, and is in turn used to access a pattern history table of 2-bit saturating counters,

resulting in a two-level predictor. In a later work, the authors provide a taxanomy of

two-level adaptive branch prediction schemes [15].

Subsequent work focused on refining the schemes of Yeh and Patt. Several predictors

were proposed that used de-interference techniques to address the issue of destructive

aliasing, which represented a significant cause of mispredictions [5, 6, 16]. Hybrid predic-

tors that combined two branch prediction schemes were also proposed, taking advantage

of the strengths of each prediction scheme to increase the overall accuracy [3]. These pre-

dictors form the basis of all publically disclosed implementations of branch predictors on

real processors [17, 18, 19, 20].

In 2001, Jimenez et al. introduced the Perceptron predictor, which uses simple neural

elements, known as perceptrons, to perform branch prediction [21]. The predictor main-

tains the organization of previous two-level schemes, but replaces the table of counters

with a table of perceptrons. While the Perceptron predictor achieved better accuracy than

previous two-level adaptive methods by allowing the use of longer history lengths at the

7

2.1. Background on Branch Prediction

cost of linear resource growth (traditional methods required an exponential growth of re-

sources), the predictor suffered from high latency. Later research would improve both the

latency of the predictor, as well as its prediction accuracy [22, 23, 24].

The Championship Branch Prediction competitions, first held in December 2004, re-

sulted in a new wave of innovative branch predictors. Combining the summation and

training approach of neural predictors with the idea of hashing into several tables of coun-

ters, the Optimized GEometric History Length (O-GEHL) branch predictor won second

place at the first Championship Branch Prediction (CBP-1) [25]. The O-GEHL predictor

uses a set of history lengths that form a geometric series to index a set of predictor tables.

The PPM-Like Predictor, a multi-table tag-based scheme based on prediction by partial

matching (PPM) was introduced at the same competition, and placed fifth [26]. At CBP-

2, the L-TAGE branch predictor, composed of a loop predictor and a TAgged GEometric

history length (TAGE) predictor, placed first [11]. The main component of the predictor,

TAGE, was built upon the PPM-Like predictor, while using geometric history lengths as in

the O-GEHL predictor to access the different tables. The winner at CBP-3, held recently

in June 2011, was the ISL-TAGE predictor [27], which augments the previously proposed

L-TAGE predictor with a Statistical Corrector predictor. At the time of this writing, TAGE-

based predictors represent the state of the art in branch outcome prediction.

2.1.1 Branch Prediction Mechanisms

This section reviews the L-TAGE predictor. A brief description of several other predictors

is provided in Appendix A.

L-TAGE

The L-TAGE predictor is composed of a loop predictor and a TAGE predictor. The loop

predictor simply tries to identify regular loops with constant iteration counts. The loop

predictor provides the final prediction when a loop has been executed three times with

the same number of iterations. Each entry in the loop predictor consists of the maximum

8

2.1. Background on Branch Prediction

iteration count determined in previous runs of the loop, the current iteration count, a

partial tag used to identify the loop branch, a confidence counter, and an age counter. The

loop predictor is trained to recognize the maximum loop count. When the current iteration

counter reaches the maximum loop count, it predicts the branch as not-taken.

The rest of this section focuses on the main predictor component, the TAGE predic-

tor. The virtualization techniques described in this work are applied only to the TAGE

component of the predictor.

Organization Figure 2.1 shows the organization of the TAGE predictor. The predictor

consists of a base bimodal predictor and a set of (partially) tagged predictor tables. Each

tagged table is indexed through an independent function of global and path histories and

the branch address. Similar to the GEHL predictor, the global history lengths used by

the tagged tables form a geometric series. The base predictor is in charge of providing

a default prediction and is implemented as a simple PC-indexed 2-bit counter bimodal

table, which shares a hysteresis bit among several counters to reduce storage [19]. An

entry in the tagged components consists of a 3-bit counter that provides the prediction, an

unsigned 2-bit useful counter u, and a tag. The width of the tag varies for each table.

Index and Tag Hash Functions The index to each tagged table is a hash of the branch

address, global and path histories, and the table number. The tag is a function of the global

history and branch address only.

To allow the use of long history lengths, a folding mechanism is used to compress the

history from its original length down to the number of bits used to index the table. For

example, for a table of 1024 entries (which uses a 10-bit index), 30 bits of global history are

folded as follows: h[9 : 0]⊕ h[19 : 10]⊕ h[29 : 20] down to 10 bits. The folding mechanism

is implemented as a circular shift register and a few XOR gates [26].

Prediction At prediction time, all tables are accessed simultaneously to provide a pre-

diction. The base predictor always provides a default prediction, while the tagged com-

9

2.1. Background on Branch Prediction

index tag

tag match?

index tag

tag match?

index tag

tag match?

PChist[L2:0]

°°°

°°°

prediction

hist[LN:0]hist[L1:0]

Figure 2.1: TAGE Predictor with N Tagged Tables

ponents provide a prediction only on a tag match. If there is no matching tag on any of

the tables, the default prediction is used. Otherwise, two predictions from the two longest

history length tables with a tag match (or bimodal, if there is only one tag match) are

considered. The second prediction is called the alternate prediction. In general, the first

prediction is used unless the entry is deemed as newly allocated. An entry is considered

newly allocated, if its prediction counter is weak, or if a global 4-bit counter called use_alt,

used to monitor the effectiveness of the alternative prediction, indicates so. This counter

is updated whenever the alternate prediction differs from the main prediction: if the al-

ternate prediction is correct, the counter is incremented; otherwise, it is decremented.

Update At update time, several steps are taken as follows:

• The prediction counter of the component providing the final prediction is updated.

When the useful counter of the provider component is 0, the alternate prediction is

also updated.

• The u counter of the provider is updated when the alternate prediction is different

10

2.2. Background on Predictor Virtualization

!"#$%&'(")#*

+,+-*.''()*/0"#1-2"03/

()*/0"#1-2"03/.$4)5

+,+-&.( 67-%#%&)-1.$)

Figure 2. Temporal groups

temporal groups, as shown in Figure 2. In the current PBTBdesign, each temporal group represents a virtual table entrythat maps to a single L2 cache line. A temporal group is as-sociated with a prefetch trigger: a region of code surround-ing the branch previous to the group that also missed in theBTB. Any subsequent miss for a branch within this regiontriggers the prefetch of the temporal group, such that identi-cal repetition in the branch miss stream, while desirable, isnot necessary for prefetching. This property is valuable forapplications that exhibit only partial repetition in their con-trol flow patterns; as long as the application follows a simi-lar path through the code, prefetching will be triggered andsome branches will be covered. Using a branch previous tothe group to index in the virtual table allows PBTB to tol-erate the latency of the L2 cache when temporal groups areprefetched.BTB misses trigger prefetching of their associated tem-

poral blocks into a prefetch buffer collocated with the dedi-cated BTB. The prefetch buffer and the BTB are accessed inparallel on each branch access. On a BTB miss and prefetchbuffer hit, the corresponding branch is installed in the dedi-cated BTB, thus, increasing its perceived capacity.Section 2.1 describes the architecture of PBTB and Sec-

tion 2.2 discusses the implications for the on-chip memoryhierarchy.

!"#$%&'(

!"#$%&'"#()(

)"*+,-&./-,0+/"1"-&',-

2-"3"'%4516$1"

7$-'0&.8)&9."

9-&1%483""#9&%:

Figure 3. Phantom-BTB architecture

2.1 Phantom-BTB Architecture

Figure 3 shows the PBTB architecture. PBTB adds threecomponents to a dedicated, conventional BTB: 1) a virtualtable, 2) a temporal group generator, and 3) a prefetch en-gine. The next sections describe these components in detail.

2.1.1 Virtual Table

The main component that PBTB adds to a conventional BTBis the virtual table that contains branch target metadata col-lected as the application runs. This table is used to emulatea large BTB by prefetching its contents into the dedicatedBTB.The metadata table does not have dedicated physical stor-

age, hence the name virtual. Instead, the virtual table residesin the L2 cache, allocated on demand at cache line granular-ity, competing for space with application data and instruc-tions. The table has associated reserved physical addressspace (i.e., physical addresses that are not exposed to the op-erating system) for the sole purpose of installing its contentsin the memory hierarchy non-intrusively. Given that the off-chip memory latency is prohibitively high for timely branchprefetches, the branch metadata is not propagated off-chipin the current design. This requires simple extensions to thecache controller, as detailed in Section 2.2.A fundamental challenge for the Phantom-BTB design

is how to organize the branch metadata such that it can beprefetched in a timely fashion, increasing the hit rate of thededicated BTB. Inspired by previous research on branch andpath correlation [24, 30, 29], the virtual table stores groupsof temporally correlated branches.A straightforward option for generating temporal groups

is to aggregate target information for branches that are ac-cessed closely in time. This would result in high informa-tion redundancy across groups, reducing the effective ca-pacity of the virtual table. In addition, generating temporalgroups based on all branches would lead to high traffic tothe L2 cache for storing metadata. Instead, PBTB uses theBTB itself as a filter and generates temporal groups basedon branches that miss in the dedicated BTB. The metadatacorresponding to several consecutive branches that miss inthe dedicated BTB is packed into a memory block equal insize to the L2 cache block. This memory block is sent andinstalled into the L2 cache, similarly to dirty-evict lines fromthe L1 cache. Later on, this memory block can be retrievedfrom the L2, and its information can be used to augment thededicated BTB.Conceptually, the virtual table is tag-less and direct

mapped. Each temporal group is indexed with a prefetchtrigger: a code region surrounding the last branch in the pre-vious temporal group. Any branch within this region of codethat misses in the dedicated BTB triggers the prefetch of theassociated temporal group.

2.1.2 Temporal Group Generator

The temporal group generator (TGG) collects consecutivebranches that missed in the dedicated BTB together withtheir target information and packs them into a contiguousmemory block that is installed in the L2 cache. The numberof branch entries in the TGG is dictated by the branch targetmetadata representation in the virtual table.

Figure 2.2: Phantom-BTB Architecture [10]

from the final prediction.

• The use_alt counter is updated whenever the alternate prediction differs from the

main prediction. If the alternate prediction is correct, the counter is incremented;

otherwise, it is decremented.

• On a misprediction, a new entry is allocated on a longer history length table. Us-

ing a pseudo-random number generator, one of the three next tables is chosen with

probabilities 1/2, 1/4, and 1/4, respectively. This randomness in the allocation pol-

icy is used to reduce the impact of aliasing. An allocated entry is initialized with the

prediction counter set to weakly taken or not-taken as determined by the outcome

of the branch. Counter u is initialized to 0 (i.e. strongly not useful).

2.2 Background on Predictor Virtualization

Predictor Virtualization (PV) is a technique that can be applied to hardware optimiza-

tions that gather information about program behaviour at runtime and store it in on-chip

buffers or lookup tables for future use. PV uses the memory hierarchy to transparently

store predictor metadata, while using the dedicated on-chip resources to record the ac-

tive entries of the predictor. Conceptually, PV allows for emulating large predictor tables

without the cost of large amounts of dedicated on-chip resources.

This section reviews PV in the context of a virtualized BTB design, Phantom-BTB [10].

Similar to branch outcome prediction, BTB prediction is latency-sensitive; therefore, a

11

2.2. Background on Predictor Virtualization

straightforward application of PV techniques is ineffective in both cases. The original PV

study demonstrated its applicability to a data prefetcher, which proved inherently tolerant

of the high prediction latency introduced by virtualization. Instead, Phantom-BTB takes

a different approach by using a prefetch-based virtualized design as described next.

In Phantom-BTB, a first-level conventional BTB is augmented with a large second-level

virtual table. While the first level uses dedicated storage and operates the same way as a

conventional BTB, there is no dedicated storage for the second-level table. Instead, the

second-level table entries are transparently allocated and stored on demand, at cache line

granularity, in the L2 cache. Figure 2.2 shows the design of the Phantom-BTB predictor.

To compensate for the high latency of the virtual table, the virtual table is designed

to facilitate the timely prefetching of the BTB entries. The virtual table stores groups

of temporally correlated branches that have missed in the first-level BTB. As branches

miss in the dedicated BTB, they are grouped into temporal groups. Each temporal group

represents a virtual table entry that maps to a single L2 cache line. A temporal group is

associated with a prefetch trigger: a region of code surrounding the branch previous to the

group that also missed in the BTB. Any subsequent misses for a branch within this region

triggers the prefetch of the temporal group.

BTB misses trigger prefetching of their associated temporal blocks into a prefetch

buffer co-located with the dedicated BTB. The prefetch buffer and the BTB are accessed

in parallel on each branch access. On a BTB miss and prefetch buffer hit, the correspond-

ing branch is installed in the dedicated BTB, thus, increasing its perceived capacity.

12

Chapter 3

Related Work

3.1 Related Work on PV

Predictor Virtualization was first introduced as a generic framework that allows the exist-

ing memory hierarchy to be used to emulate large predictor tables [9]. The main goal of

PV in this context was to reduce the amount of dedicated on-chip resources required for

hardware predictor-based optimizations, while preserving their original effectiveness. In

the same paper, the authors describe an application of PV techniques to a state-of-the-art

data prefetcher, which is inherently tolerant of the higher prediction latency introduced by

virtualization. The virtualized prefetcher matches the performance of the original scheme,

while reducing the dedicated on-chip resources from 60 KB down to less than 1 KB.

A later work by the same authors applies virtualization techniques to a Branch Target

Buffer, as described in Section 2.2 [10]. Although the design shares the same goals as PV,

the inherent latency sensitivity of the BTB necessitates a prefetch-based virtualized design

that relies on the temporal correlation in the BTB miss stream. The virtualized design

improves IPC performance by an average of 6.9% over a conventional 1K-entry BTB, while

adding only 8% extra storage overhead. The new design performs within 1% of a 4K-entry

conventional BTB, while requiring 3.6 times less dedicated storage.

Finally, virtualization has been applied to the EXplicit dynamic branch predictor with

ACTive updates (EXACT) [28]. EXACT employs a different approach to improving branch

outcome prediction, based on the observation that global history alone can fail to dis-

tinguish between dynamic instances of branches. Instead, a combination of the branch

address and the address of the load instruction on which the branch depends is used to

13

3.2. Related Work on Hierarchical Branch Predictors

uniquely identify the branch. A store to an address on which a dynamic branch depends

may also change its outcome the next time it is encountered. As a result, such store in-

structions are allowed to actively update the prediction counter in the predictor. The

branch predictor is augmented with two structures in charge of keeping track of the effect

of load and store instructions. Virtualization is applied to the latter structure, the Active

Update unit, which takes the address of a store instruction as index and outputs the branch

address that is affected by the store, as well as the effects of the change. Since active up-

dates are tolerant of 400+ cycles of latency, a straightforward application of PV is used

on the Active Update unit. In the virtualized design, a 10KB dedicated first-level table is

augmented with a 512KB virtual level-two table.

3.2 Related Work on Hierarchical Branch Predictors

Closely related to our work are delay-sensitive hierarchical branch predictors [8]. These

predictors share the same goal as our work: to allow the use of a larger more accurate

predictor without increasing prediction delay to more than a single cycle. These predictors

were introduced in the context of addressing the problem of scaling branch predictors to

future technologies. Although the aggressive clock rates assumed in the original study

(~8GHz at 35nm) have not been realized to date, the conservative clock rates (~5GHz

at 35nm) are close to what can be achieved at the time of this writing. The problem is

only significant at the aggressive clock rate, at which point a table of only 512 entries

can be accessed within one cycle. At the conservative clock rate, 16 K entries are still

accessible within a cycle. Although the proposed designs are geared toward addressing

the aforementioned problem, they represent valid design options for hiding the latency of

large predictor tables.The paper introduces three schemes for overcoming delay: a caching

approach, an overriding approach, and a cascading lookahead approach applied to the

gshare branch predictor.

In the caching scheme, a small cache of branch prediction table entries (called PHTC)

that is accessible in a single cycle, is backed by a larger second-level prediction table (PHT).

14

3.2. Related Work on Hierarchical Branch Predictors

At prediction time, both tables are accessed in parallel. If the correct entry is found in the

PHTC, a prediction can be made immediately. Otherwise, a third table called the Auxiliary

Branch Predictor (ABP) is used to provide a prediction, so that the delay of the second-

level table does not stall instruction fetch. Once the prediction is retrieved from the PHT,

it replaces an entry in the PHTC.

The second approach combines the concepts of lookahead branch prediction [29] and

cascading branch predictors [30]. In lookahead branch prediction, the natural spacing

between branches is used to perform prediction for the next branch that is likely to arrive.

For example, if branches are spaced so that the predictor is accessed only every other cycle,

then the predictor can have a two-cycle latency without introducing additional delay. A

cascading branch predictor implements a series of tables of ascending size and accuracy.

In their scheme, the gshare predictor is arranged in a two-level cascading organization and

is adapted to look one branch ahead. Prediction is begun simultaneously on both levels

of the predictor. The second-level table is used when the next branch to be predicted is

several cycles away, so that the access latency of the table fits in the space between the

current prediction and the next. If the next branch arrives before the second level table

can complete its access, then the prediction from the first table is used.

Finally, an overriding branch predictor provides two predictions. The first prediction

comes from a fast PHT (PHT1) and the second prediction comes from a slower, but more

accurate PHT (PHT2). To predict a branch, the first prediction is used and acted upon

while the second prediction is still being made. If the second prediction differs from the

first, the actions based on the first prediction are reversed and instructions are fetched

using the second prediction; thus, the second predictor overrides the first predictor. A

similar technique is used in the Alpha 21264 branch predictor [18], in which a branch

predictor with a latency of 2 cycles overrides a less accurate instruction cache line predic-

tor at the cost of a single stall cycle.

The last two schemes could be considered as an alternative to virtualization. However,

in both cases the accuracy of the predictor depends on the accuracy of both the fast and

15

3.3. Related Work on Improving TAGE Prediction Accuracy

less accurate predictor and the larger more accurate one. Theoretically, these schemes

could benefit from using a virtualized predictor (which can provide a prediction in a single

cycle) as the first-level predictor. More importantly, unlike a virtualized design, these two

schemes require large dedicated budgets for implementing the two predictor levels.

These predictors could also be considered as candidates for virtualization. However,

the design of both the cascading lookahead predictor and the overriding scheme assume a

latency of only a few cycles for the second-level table (2-4 cycles). With the latency of the

L2 cache at more than 10 cycles, implementing their second-level table as part of the L2

cache would force the predictors to rely almost exclusively on the first-level predictor. Of

the three schemes, the caching approach is closest to our virtualized design. The crucial

difference is that in our design, an entire block of correlated entries is retrieved on a first-

level table miss. In comparison, in the caching approach, only a single predictor entry is

retrieved on a miss. The reported results indicate that fetching only a single entry from

the second-level cache is insufficient, as the table is used for 7.5% of all branches, and

this access results in a prediction different from the auxiliary predictor in 0.013% of all

branches [8]. Therefore, this approach almost always relies on the auxiliary predictor

rather than the second-level table, and when it does not the two predictions almost always

agree.

3.3 Related Work on Improving TAGE Prediction Accuracy

Other predictors, such as ISL-TAGE [27] and EXACT [28], have been proposed to improve

the accuracy of the TAGE predictor. These predictors augment TAGE with new structures

that are geared toward predicting branches that cannot easily be predicted with the TAGE

predictor. The TAGE predictor is apt at predicting branches that are highly corelated with

the behaviour of other branches. By using a geometric set of history lengths, it efficiently

captures correlation on recent branch outcomes as well as on very old branches. Our work

is orthogonal to these works, since virtualization is applied to the core branch predictor

structure, in order to increase its perceived capacity and reduce destructive aliasing.

16

Chapter 4

Methodology

This chapter explains the experimental methodology used in this work. The simulation

infrastructure is described, followed by a summary of the parameters of the baseline pre-

dictor that we compare our virtualized designs to.

4.1 Simulation Infrastructure

To evaluate the designs explored in this work, we run simulations on the functional simula-

tion infrastructure provided for the second Championship Branch Prediction Competition

(CBP-2) [31]. The results are measured using functional simulation of the first two billion

instructions of a set of three commercial workloads. For each of the workloads, a trace of

branch addresses and outcomes is collected for a run of the first two billion instructions

on Flexus, a full-system simulator based on Simics [32]. The simulator is configured as a

functional in-order uniprocessor running the UltraSPARC III ISA.

4.1.1 Timing Model

In Section 6.3.6, we investigate the effect of the second-level predictor table latency on the

overall prediction accuracy using a simple timing model. In this model, we use instruction

count to approximate processor cycles and to emulate the second-level table delay. This

model uses a different trace format that includes the number of instructions in the basic

blocks encountered in the program, followed by the address and outcome of the branch at

the end of the basic block. Although both sets of traces were collected for the same phase

of the benchmarks, due to the move from Simics 2 to Simics 3, the traces differ slightly and

so do the measurements of the related data. The presented results use these timing traces

17

4.1. Simulation Infrastructure

Online Transaction Processing (OLTP)TPC-C 100 warehouses (10GB), 16 clients, 1.4 GB SGA

Decision Support (DSS)TPC-H Throughput oriented, 4 queries, 450 MB buffer pool

Web Server (WEB)Apache 16K connections, Fast CGI, worker thread modeling

Table 4.1: Workload Descriptions

starting from Section 6.3.5.

To emulate the second-level predictor table delay, we employ a request queue model.

We model the second-level table as having separate read and write ports, and manage a

queue of requests initiated by the branch predictor for each of these ports (we do not model

regular requests to the L2 cache). We service at most one read and one write request per

cycle. We assume that the second-level table is pinned in the L2 cache. Therefore, each

request contains the requested cache line index, in addition to being stamped with the

cycle number in which the request is made. For write requests, a snapshot of the data at

the time the request is made is also passed along with the request. To simulate a latency of

N cycles, a request made at time t is serviced at the beginning of cycle t+N . The simulator

fetches four instructions per cycle (fetch width of four). We place a constraint that only

a single conditional branch can exist in the fetch width, as we assume that the dedicated

predictor tables have a single read port. Since each branch has single a delay slot for the

SPARC ISA, there can be a maximum of two branches in a fetch width of four.

4.1.2 Benchmarks

Table 4.1 lists the simulated workloads, which include: the TPC-C v.3.0 online transaction

processing (OLTP) workload running on IBM DB2 v8 ESE, the TPC-H workload running

on IBM DB2 v8 ESE representing a decision support system (DSS), and finally represent-

ing a conventional web-server (WEB) workload, the SPECweb99 benchmark running over

Apache HTTP Server v2.0. The web server is driven with separate simulated client sys-

tems, the results present the server activity.

18

4.2. TAGE Parameters

We originally intended to use the Simplescalar simulation tool for our work, which has

both functional and timing models [33]. However, since these commercial workloads could

not be simulated on Simplescalar, we switched to the CBP infrastructure (Section 4.1). In

Section 7.3.6, we present some results gathered prior to our move to the new infrastructure

using the SPEC CPU 2000 benchmarks [34]. These benchmarks were simulated for a run

of 1 billion instructions on the functional sim-bpred simulator of Simplescalar.

4.1.3 Branch Prediction Accuracy Metric

The primary metric used in this work to measure prediction accuracy is mispredictions per

kilo instructions (MPKI). This metric is a better measure of prediction accuracy than mis-

prediction rate, as it includes information about both the frequency of branches in the

benchmark, as well as the predictor’s ability to correctly predict these branches. MPKI is

also more representative of the performance impact of the predictor accuracy. For exam-

ple, a poor misprediction rate on a program with few branches may not have a significant

impact on performance.

4.1.4 Simulated Microprocessor

The only two processor parameters outside the branch prediction unit considered in this

work are the L2 cache line size and latency, which we assume to be 64 bytes and 20 cy-

cles respectively. All virtualized designs described in this work use a 64 KB second-level

predictor, equivalent to 1K L2 cache lines.

4.2 TAGE Parameters

In this section, we summarize the parameters used for the TAGE predictor used in our

experiments. We have used the parameters of the original TAGE implementation [11],

unless otherwise specified.

19

4.2. TAGE Parameters

Table 1 2 3 4 5 6 7 8 9 10 11 12Tag Width (bits) 7 7 8 8 9 10 11 12 12 13 14 15

Table 4.2: Tag Width for Predictor Table Entries

N Geometric Function Min/Max History Length (bits)1-5 Li = L1 × 2i−1 L1 = 8

6-12 Li = hmin × (hmaxhmin)( i−1N−1 ) + 0.5 hmin = 4, hmax = 640

Table 4.3: Geometric Functions Used for History Lengths

4.2.1 Number of Tagged Tables

The original TAGE contains 12 tagged tables. In our experiments, we vary the number of

tagged tables (N ), from 1 to 12. We do this since one of our goals is to use virtualization

to reduce the dedicated budget and complexity of the predictor. In this way, we can deter-

mine if a non-virtualized design can be replaced by a virtualized design with fewer tables

but equivalent accuracy.

We use the notation TAGEN,M to refer to a TAGE with N tagged tables each with M

entries. In the rest of this thesis, we use the terms table and component interchangeably

when referring to TAGE’s tagged tables.

4.2.2 Entry Sizes

The entry sizes remain unchanged from the original predictor. The prediction and useful

counters are 3 and 2 bits respectively. Tag width varies for each table, and is summarized

in Table 4.2.

4.2.3 History Lengths

The original geometric function used for calculating history lengths takes a minimum and

maximum history length, and computes a history length for each predictor table depend-

ing on the total number of tables. The minimum and maximum history lengths used in the

original TAGE are 4 and 640 bits respectively. This allows a wide range of history lengths

for a predictor with a reasonable number of tables. For a predictor with fewer tables, how-

20

4.2. TAGE Parameters

ever, this results in very short and very long history lengths with few moderate values in

between. To allow a better range of history lengths for smaller N , we tested the predictor

with a simple geometeric function: Li = L1 ×2i−1. A better accuracy was achieved with the

new geomteric function for up to five tagged components. Therefore, we use two sets of

geomteric functions: the new function for N <= 5, and the original for N > 5 (Table 4.3).

4.2.4 Table Sizes

For our baseline predictor, we allow the predictor capacity to be set by constraints on the

predictor access time rather than any chip area limitations. The predictor is allowed as

much capacity as it can access to make a prediction in a single cycle. We use the CACTI

cache modeling tool [35] to estimate the latency of the predictor structures and choose the

largest configuration that can provide a prediction within a single cycle. Specifically, we

use CACTI to determine the number of entries on the predictor tables.

For a predictor with N tagged components, the prediction time roughly consists of the

time to compute the index/tag hash functions, access the predictor table, and perform a

tag match (performed in parallel for all the tables). This is followed by the time to select

the longest matching history length prediction and the alternate prediciton, taking into

account the confidence of each prediction and the global use_alt counter, as described in

Section 2.1.1. To estimate the branch predictor access time, we optimistically assume that

delay is dominated by the latency to access the tagged components, perform tag matching,

and choose the longest history table with a tag match. In this way, we can model a predic-

tor with N tagged components as an N-way set-associative cache and estimate its access

delay using CACTI. We use CACTI’s pure RAM interface (with no tags) to determine the

maximum size of the bimodal table separately.

Figure 4.1 shows the basic logical structure of a uniform cache access organization in

CACTI. CACTI uses separate arrays to model the tag and data in a cache. The address re-

quest to the cache is first provided as input to the decoder, which then activates a wordline

in each of the tag and data arrays. The contents of an entire row are placed on the bitlines,

21

4.2. TAGE Parameters

Input address

Deco

derWordline

Bitlines

Tag

arra

y

Data

arra

y

Column muxesSense Amps

Comparators

Output driver

Valid output?

Mux drivers

Data output

Output driver

(a) Logical organization of a cache.

Data output bits

Bank

Address bits

(b) Example physical organization of the data array.

Figure 1. Logical and physical organization of the cache (from CACTI 3.0 [13]).

• Sub-bank - In a typical cache, a cache block is scattered across multiple sub-arrays to improve thereliability of a cache. Irrespective of the cache organization, CACTI assumes that every cache block in acache is distributed across an entire row of mats and the row number corresponding to a particular blockis determined based on the block address. Each row (of mats) in an array is referred to as a sub-bank.

• Ntwl/Ndwl - Number of horizontal partitions in a tag or data array i.e., the number of segments that asingle wordline is partitioned into.

• Ntbl/Ndbl - Number of vertical partitions in a tag or data array i.e., the number of segments that a singlebitline is partitioned into.

• Ntspd/Nspd - Number of sets stored in each row of a sub-array. For a given Ndwl and Ndbl values, Nspddecides the aspect ratio of the sub-array.

• Ntcm/Ndcm - Degree of bitline multiplexing.

• Ntsam/Ndsam - Degree of sense-amplifier multiplexing.

3 New features in CACTI 6.0

CACTI 6.0 comes with a number of new features, most of which are targeted to improve the tool’s ability tomodel large caches.

• Incorporation of many different wire models for the inter-bank network: local/intermediate/global wires,repeater sizing/spacing for optimal delay or power, low-swing differential wires.

• Incorporation of models for router components (buffers, crossbar, arbiter).

• Introduction of grid topologies for NUCA and a shared bus architecture for UCA with low-swing wires.

4

Figure 4.1: Organization of Tag and Data Arrays in CACTI [35]

which are then sensed. The multiple tags thus read out of the tag array are compared

against the input address to detect if one of the ways of the set does contain the requested

data. This comparator logic drives the multiplexor that finally forwards at most one of the

ways read out of the data array back to the requesting component.

This model of cache delay parallels our simplified model of the predictor access time.

The delay for both models consists of the time to access the tag array, perform tag match-

ing, and choose the final data among N different options. There are two differences be-

tween an N-way cache and the simplified model of the predictor. First, the comparison

logic for the two differs slightly. For a cache, a single tag hit must be determined, whereas

for TAGE, the tables have associated priorities. As a result, even in the presence of multiple

tag hits the longest history length table must be chosen. Second, in a set associative cache,

the delay of the structure consists of the time to drive an entire row of N tags. Since TAGE

is organized as separate tables, each table is accessed separately, reading out a single tag.

Figure 4.2 shows the access latency of an N-way set-associative cache with up to 16

ways (in powers of 2 as allowed by CACTI), while varying table sizes from 1K up to 32K.

Each entry is composed of a 5-bit tag (the minimum tag width considered in our exper-

iments) and 8 bits of data. The actual size of the data portion in the predictor is 5 bits.

However, CACTI places a limit on the minimum size of the data. Although not shown, we

ensure that the reported access time is dominated by the tag array access time.

22

4.2. TAGE Parameters

0 2 4 6 8 10 12 14 16 18

0

100

200

300

400

500

600

700

800

900

1000

1K Entries 2K Entries 4K Entries 8K Entries16K Entries 32K Entries Max Access Time

Number of Ways

Ac

ces

s T

ime

(ps)

Figure 4.2: Access Latency for TAGE Modeled as an N-Way Set-Associative Cache (TagWidth: 5 bits, Data Width: 8 bits)

1K 2K 4K 8K 16K 32K 64K

0

50

100

150

200

250

300

350

400

450

Data Array Access Latency Max Access Latency

Number of Entries

Ac

ces

s L

ate

ncy

(p

s)

Figure 4.3: Access Latency for the Bimodal Table Modeled as an SRAM Array of Bytes

23

4.2. TAGE Parameters

Number of Tagged Tables (N) Number of Entries1 16K2 8K

3-4 4K5-8 2K

9-12 1K

Table 4.4: Delay-Constrained Table Size Configurations Established Using CACTI

We assume a processor with a 4 GHz clock rate in a process technology of 32 nm,

resulting in a 250 ps clock cycle. Current high-end processors use approximately the same

clock rate [36, 37]. To mitigate the effects of discrepancies intoduced by our simplified

model for the predictor access time, we allow delays within a 20% margin of the 250 ps

cycle time, up to 300 ps. Based on Figure 4.2, Table 4.4 summarizes the table sizes that

can be accessed in a single cycle for each N . We refer to these baseline configurations as

delay-limited or delay-constrained, since the size of the predictor is determined not by area

constraints, but by the delay to access the tables.

In Figure 4.3, the access latency of a pure SRAM array of 1B entries is used to determine

the maximum size of a bimodal predictor that can be accessed within a single cycle. The

figure shows that an array with a maximum of 16 K entries (for a total storage of 16 KB)

can be accessed in a single cycle. We use this as the capacity of the bimodal table, allowing

a maximum of 64K entries on this table.

24

Chapter 5

Virtualization Overview

This section overviews the presented virtualized designs. In our work, we abstract the

design of a branch predictor that can be readily virtualized with the design of a two-level

predictor where constraints are imposed so that the second level could be directly stored in

the L2 cache. The second-level predictor table is not accessed directly by the processor’s

prediction mechanism. Instead, a separate mechanism is used to transfer data between

the two levels. Data is transferred at the granularity of an L2 cache line. The latency of

accessing data in the second level is equal to the delay of an L2 cache access. Without these

constraints, the designs could be used to build traditional hierarchical branch predictors.

We propose two schemes for the design of such a two-level predictor: a paging scheme

and a packaging scheme. For both schemes, virtualization is applied to a single predictor

table. In Chapter 6, we describe the paging scheme. We present an experimental analysis

of the different design parameters and evaluate the design in the presence of a large-delay

second-level table. In Chapter 7, we present the packaging scheme, we describe the de-

sign and provide data to show the method’s potential before considering the effects of

the second-level table delay. Figures 5.1 and 5.2 show the high-level designs of the two

schemes.

In the first scheme, the predictor table entries are grouped into pages shared by a group

of branch instructions in close proximity in the program code space. Based on the current

portion of the code, the corresponding page is fetched from the second level and is placed

in the first-level table. The first-level table caches the most recently used subset of the

second-level table pages. Effectively, this scheme creates multiple, virtual, mini prediction

tables, where each block of instructions maps onto one of these tables.

25

Chapter 5. Virtualization Overview

⎬⎫⎭

pageindex tag

tag match?

historyPC

Virtualization Engine

L2perceived length

Figure 5.1: Paging scheme increases the perceived length of a predictor table

index tag

tag match?

history

Virtualization Engine

PC

L2

perceived width

package

Figure 5.2: Packaging scheme increases the perceived width of a predictor table

In the second scheme, evicted entries from the first-level predictor table are grouped

into packages that are placed in the second-level table. Similar to the paging method, a

group of branch instructions spaced closely in the program share a package. Based on the

current portion of the code, the correct package is placed in a buffer co-located with the

first-level predictor table, and its data is used as part of the prediction mechanism.

The two methods can be viewed as increasing the perceived length (size) and width

(associativity) of one of the predictor tables respectively. In the first scheme, the predictor

26

Chapter 5. Virtualization Overview

table’s perceived size is increased. The table is divided into chunks, a subset of which

is cached in the dedicated predictor table. In the second scheme, packages augment the

dedicated predictor table with extra positions for evicted entries. In this way, the table can

be thought of as having multiple ways, and the entries in packages can be thought of as

having been stored in virtual predictor table ways. In the paging scheme, the entries in

the first level table are a subset of the second-level table, while the package entries in the

packaging scheme supplement the dedicated table entries.

27

Chapter 6

Paging Scheme

6.1 Design

In the paging scheme, one of the tagged predictor tables is backed by a larger second-

level table. The first-level table is accessible within a single cycle, while the second-level

table has a larger latency. The two levels are divided into pages, while the other predictor

components remain unchanged. The organization of entries in both levels remains the

same as the original predictor. The first level table serves as a cache for the most recently

accessed pages of the second level.

Branches are assigned specific pages on the predictor table. The instruction address

space is conceptually divided into instruction blocks, and each instruction block corre-

sponds to a page on the second-level predictor table. Branches in one instruction block

use the corresponding page for making prediction and allocating new entries. The page

number is the upper portion of the branch PC, and therefore, maintains the locality of the

instruction stream at a coarser grain.

Prediction is performed as usual. The prediction mechanism is unaware of the exis-

tence of pages and is unaffected by the introduction of the second-level table. The only

change to the prediction mechanism is the change in the index hash function for this com-

ponent, so that each branch, based on its PC, accesses entries within a specific page. To

make a prediction, only the first-level table is accessed directly. Updates are also made

only on the first-level table. The second-level table is not accessed directly by the predic-

tion mechanism. Instead, a virtualization engine swaps pages between the two levels as

necessary.

28

6.2. Architecture

The virtualization engine is responsible for maintaining the correct pages on the first-

level table and swapping pages between the two levels. For each branch, the engine de-

termines the page number, and on a page miss, fetches the corresponding page from the

second-level table. At the same time, the residing page, if dirty, is written back to the

second level.

Since the prediction mechanism is unaware of the existence of pages, while waiting

for the correct page to be fetched from the second-level table, a branch may inadvertantly

use data on a page different than its own. We will show in Section 6.3.4 that the entry’s

tag provides a second layer of filtering, so that the effect of aliasing is insignficant in such

cases. Similarly, a branch may allocate entries on a page different than its own. Conceptu-

ally, this allows repeating patterns to take advantage of entries allocated on other pages to

make a prediction while waiting for the correct page.

6.2 Architecture

This section describes the architecture of the paged design. Our design adds a second-level

table to the predictor and logic for swapping pages from the two levels of the predictor

tables as necessary. We assume that the first level predictor table has ports to facilitate

page swaps, separate from the original ports used for reading and updating it. Figure 6.1

shows the architecture of the paged design. We describe the components of the design

next.

L1 and L2 tables A single predictor component is augmented with a large second-level

table. We refer to the two levels of this predictor component as the L1 and L2 tables. We

also refer to this component as a whole as long table, or ltable for short, since it appears vir-

tually longer than the other predictor tables. The two levels are divided into pages. A page

is the granularity at which data is swapped between the two tables, and in a virtualized

design fits in a single cache block. The L1 table acts as a cache for the pages of the L2 that

were most recently accessed. In caching terminology, the L2 is inclusive and writeback;

29

6.2. Architecture

PC

pXa

a

page index

Auxiliary Table Virtualization Engine

L1 Table

L2 Table

bn

L1 index

⎬⎫⎭

ltable prediction

page

page match?

history

Figure 6.1: Paging Scheme Architecture

the L1 table is a subset of the L2, and data is written back to the second level at the time a

new page is replaced in the L1.

Virtualization Engine The virtualization engine is responsible for maintaining pages in

the two levels. The address of the branch is used to determine the page number, and

the corresponding page is requested from the second-level table in case of a miss. On a

page miss, the prediction is not stalled, as the prediction mechanism is unaware of the

existence of pages. As a result, the paging mechanism is not on the critical path of making

a prediciton. In addition, pages are swapped to keep the first-level predictor table in

sync with the current instruction block, regardless of whether ltable provides the final

prediction or not. Consequently, the sequence of pages swapped is a characteristic of the

application alone, and is independent of configuration of the predictor.

Index Hash Function The index hash function is modified so that each branch allocates

and uses entries on a specific page. Figure 6.1 shows how the bits of the branch PC are

used to compute the index hash function. In the figure, p is defined as log2 of the page

size, b is defined as log2 of the instruction block size, a is log2 of the number of pages that

30

6.2. Architecture

fit in the L1, and n is log2 of the number of pages that fit in the L2 and determines the

page number. The parameter X is used to vary the instruction block size as a function of

the page size, such that b = p+X. This variable is explained in more detail in Section 6.3.5.

For each branch, bits [b + n − 1 : b] determine the corresponding page number. The page

number is a function of the branch address only and preserves the locality of the branch

stream at a coarser grain. The position of a page in the L1 table is determined by the lower

a bits of the page number, bits [b+ a− 1 : b] of the branch address.

The index to the table is the a bits of the branch PC that determine the location of the

corresponding page in the L1 table, concatenated with the page index (index within the

page). The page index hash function parallels the original predictor hash function [11],

with minor modifications. The hash function uses the same information bits (PC, global

and path histories) as the original hash function. If log2 of the number of entries on the

table is t, then the original hash function folds these bits to t bits. Instead in the new hash

function, the information is compressed to p bits, the log2 of the page size.

Auxiliary Table The auxiliary table is in charge of keeping tags for the pages currently

residing in the L1 table. The number of entries in this table is equal to the number of pages

that fit in the ltable first-level table. Each entry consists of the page number of the page

currently residing in the L1 table, as well as a single dirty bit, which indicates whether the

page has been modified since it was fetched. Only dirty pages are written back to the L2.

The auxiliary table has 2a entries, where each entry consists of n bits for the page number

tag and 1 bit for the dirty bit. Therefore, the auxiliary table represents an additional

dedicated storage of 2a(n + 1) bits. However, since we do not use the page numbers when

making a prediction (we use them only to fetch pages from L2), the auxiliary table is not

on the critical path of making a prediction.

31

6.3. Design Space Exploration

6.3 Design Space Exploration

In this section, we describe the experiments used to explore the design space of the paged

TAGE predictor and analyze the effects of each parameter. In order to allow these effects to

be shown more clearly, we use a fixed size for all of the predictor tables (regardless of the

number of tagged components), as opposed to using the delay-constrained configurations

established using CACTI in Section 4.2.4. In Section 6.4, we apply the parameters deter-

mined in the design exploration phase to the delay-constrained configurations in order to

show the benefits of virtualization.

In this section, the number of entries on the tagged components is fixed at 2K entries.

This is the size determined by CACTI for the range of 5 <= N <= 8. The reason for using

this number is that the baseline TAGE8,2K predictor achieves the best MPKI of all the

delay-constrained TAGEN configurations, as will be shown in Section 6.4. The size of the

bimodal table is 16K entries, unchanged from the original TAGE predictor.

A summary of the rest of the section is presented here:

• In Section 6.3.1, we show that the accuracy of TAGE can be improved by increasing

either the number of tables or the size of each table. This motivates the need for

virtualization.

• In Section 6.3.2, we show that increasing the size of a single predictor component

to 32 K entries can provide improved accuracy equivalent to doubling the size of all

predictor components. We also determine the table that provides the best benefit

from an increased capacity. This justifies our choice to virtualize one table only.

• In Section 6.3.3, we show the maximum potential improvement that can be achieved

through virtualization of a single predictor table by implementing an idealized model

of the predictor. This experiment establishes an upper bound on the accuracy im-

provement that we can expect.

• In Section 6.3.4, we evaluate the effect of restricting branches to a single mini-predictor

32

6.3. Design Space Exploration

table page for various page sizes. We show that smaller page sizes have higher MPKI

compared with larger pages, because of a decrease in the frequency of tag matches.

• In Section 6.3.5, we show that increasing the instruction block size can be used to re-

duce the page swap rate (traffic to and from the L2 table), but at the cost of increasing

the MPKI.

• In Section 6.3.6, we examine the impact of the L2 table’s delay. We show that the

paging scheme improves accuracy even with high latencies, such as 20 cycles. We

present the accuracy of the final virtualized design, as well as the improvement it

represents over the baseline predictor.

• Finally, in Section 6.3.7, we virtualize a predictor with double the initial table size

as that used up to this point, in order to explore the sensitivity of the results to the

CACTI estimates used for the initial table sizes. We show that although the potential

for improvement is smaller in this case, virtualization is still beneficial.

6.3.1 TAGE Baseline and Potential for Virtualization

Figure 6.2 shows the average MPKI of the baseline predictor (TAGEN,2K) as a function of

N , the number of tagged tables. To show the potential for improvement as more capacity

is allotted to the predictor tables, the figure also shows the average MPKI of TAGEN,4K

and TAGEN,8K .

The figure shows that the accuracy of the predictor can be improved both by increasing

the table count and the table sizes. As more tables are added to the predictor, MPKI im-

proves, but at a decreasing rate. Similarly, increasing the number of entries on the tagged

components results in a better MPKI, but at a diminishing rate. The improvement achieved

going from TAGEN,2K to TAGEN,4K is larger than going from TAGEN,4K to TAGEN,8K , al-

though the latter case requires a larger extra capacity (extra 4N K entries as opposed to an

extra 2N K entries for the first case).

33

6.3. Design Space Exploration

1 2 3 4 5 6 7 8 9 10 11 12

0

1

2

3

4

5

6

7

8

9

10

8.6

6.46

5.41

4.81

4.39

3.973.66

3.393.23 3.09 2.97 2.87

7.4

5.01

4.02

3.553.32

3.112.89

2.72 2.61 2.53 2.46 2.39

6.24

3.97

3.22.88

2.71 2.58 2.44 2.32 2.23 2.18 2.12 2.07

TAGE_N,2K TAGE_N,4K TAGE_N,8K

N

Av

era

ge

MP

KI

Figure 6.2: MPKI for Baseline TAGE at Three Different Sizes

6.3.2 Choosing Ltable

Since the paging mechanism is applied only to a single predictor table, we need to ensure

that increasing the perceived capacity of a single table provides enough improvement to

justify a paged design. We also need to determine the table that would utilize the increased

capacity most efficiently to improve accuracy. Intuitively, we expect one of the first few

tables (that use shorter history lengths) to benefit most from an increased capacity. This is

because the behaviour of most branches depends on the outcome of a few recent branches,

while a smaller portion of branches depend on very long history lengths. In this section,

we use an idealized model of the paged predictor to determine which table benefits most

from an increased capacity. In the next section, we use these results to show the maximum

potential gain of the paging scheme.

We model an idealized version of the predictor by replacing one of the predictor tables,

ltable, with a large table equal in size to the L2 table. In this model, ltable is not broken

into pages. This represents an idealized version of the paged predictor, since a branch

34

6.3. Design Space Exploration

Baseline 1 2 3 4 5 6 7 8 9 10 11 122.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0N=12 N=11 N=10 N=9 N=8 N=7 N=6 N=5 N=4 N=3 N=2

ltable

Av

era

ge

MP

KI

Figure 6.3: MPKI for TAGE with a Single 32K-Entry Table

can directly access any ltable entry as necessary, without any delay. The accuracy of this

predictor represents an upper bound on the improvement that can be achieved with the

paging scheme.

We set ltable size to 32 K entries to match the L2 size used in the rest of the experi-

ments. In all of our experiments, we use an L2 table with 32 K entries, for a total storage

of ~64 KB. The L2 size was chosen so that it is reasonably large compared to the L1.

Figure 6.3 shows the average MPKI of the idealized predictor for different values of

N . The x-axis shows which table is the ltable, while each line represents a different N .

The average MPKI of the baseline predictor (with no ltable) is also shown at N = 0 for

comparison. In all cases, the first few tables benefit the most from an increased capacity.

The best MPKI is roughly achieved with ltable = 2 for 2 <=N <= 6 (except for N = 5), and

ltable = 3 for N > 6. These results, which we refer to as the default ltable configuration, are

summarized in Table 6.1. We use the default ltable configuration in the rest of this chapter.

35

6.3. Design Space Exploration

Number of Tagged Tables (N) ltable1 1

2 to 6 27 to 12 3

Table 6.1: Default Ltable Configuration

1 2 3 4 5 6 7 8 9 10 11 12

2

3

4

5

6

7

8

9

0

5

10

15

20

25

30

35

40

45

4.84

3.67

3.3

3.07

2.91

2.87

2.74

2.6 2.53

2.49

2.44

2.4

13.97

22.46

25.6 26.1124.38

21.74 20.9319.72 19.12

18.02 17.26 16.82

43.73 43.16

39.01

36.23

33.73

27.8

25.1923.52

21.67

19.5617.96

16.34

Average MPKI: TAGE_N,2KAverage MPKI: TAGE_N,4KAverage MPKI: Idealized PredictorAverage % Improvement over Baseline: TAGE_N,4K (second y-axis)Average % Improvement over Baseline: Idealized Predictor (second y-axis)

N

Av

era

ge

MP

KI

Av

era

ge

% I

mp

rove

men

tFigure 6.4: Maximum potential improvement achievable by adding a single 32K-entrytable

6.3.3 Maximum Potential Improvement with the Paging Scheme

In this section, we use the idealized predictor with the default ltable configuration (Table

6.1) to show the upper bound on the potential improvement that can be achieved through

the paging scheme. To assess the efficiency of using a single large predictor table, we

compare this ideal improvement with the improvement that can be achieved by doubling

the size of all predictor tables.

In Figure 6.4, two sets of plots are shown: The solid plots show the average MPKI of

the baseline predictor (TAGEN,2K), the idealized predictor, and the TAGEN,4K predictor.

36

6.3. Design Space Exploration

The dashed plots show the percentage improvement in MPKI for the latter two predictors

over the baseline (second y-axis).

The figure shows that the maximum potential improvement for the paging scheme is

in the range of ~16% to ~44% for different N . As N increases and the dedicated capacity

of the predictor grows larger, the potential improvement decreases.

Next, we evaluate the efficiency of using a single large predictor table. The idealized

predictor adds a fixed extra storage of 62 KB to a single predictor table. In comparison,

the TAGEN,4K predictor adds an extra 2 K entries (~4 KB) to each tagged table, for a total

of 4N KB. It can be observed from the figure that a single large table does not represent

the most efficient use of storage. Although the idealized predictor achieves much better

accuracy for smaller N , the accuracy of the two predictors starts to grow closer. At N = 12,

the extra ~48 KB of TAGEN,4K spread across all the tables results in better accuracy than

the extra ~62 KB added to a single predictor table for the idealized paged predictor.

Although the extra capacity is not used most efficiently, the margin of improvement is

large enough to justify the choice of exploring paging mechanisms with a single L2 table.

Using a single L2 table is less complex compared to designs that break the extra storage in

the form of multiple L2 tables. We leave the exploration of such designs for future work.

6.3.4 Page Size

In the previous section, we showed the maximum improvement that can be achieved with

the paging scheme in the absence of page size and delay constraints. In this section, we

add page size constraints to the idealized predictor model by dividing the table into pages

and restricting branches to access entries within a single page. The entries in the table can

still be accessed directly without any delay. We evaluate the effect of page size on accuracy,

as well as on the frequency of page swaps. We conclude the section by providing a detailed

analysis of the behaviour of the predictor in the presence of page size constraints. In this

section, the instruction block size is equal to the page size.

37

6.3. Design Space Exploration

32K 16K 8K 4K 2K 1K 512 256 128 64 32 16 8 4 2 10

1

2

3

4

5

6

7

8

9

10

N=1 N=2 N=3 N=4 N=5 N=6 N=7 N=8 N=9 N=10 N=11 N=12

Page Size (Entries)

Ave

rag

e M

PK

I

Figure 6.5: Effect of page size on MPKI

Effect of Page Size on MPKI

In order to show the impact of page size on the predictor accuracy, we run experiments

for ltable having a single page of 32 K entries up to 32 K pages of 1 entry each. Figure 6.5

shows the predictor’s MPKI as a function of the page size. Each line in the figure shows

the MPKI for a different N . The figure shows that smaller page sizes have a higher MPKI

than larger page sizes. However, the effect of page size on MPKI is minor for pages of 256

entries and higher. For smaller page sizes, the MPKI starts to grow at an increasing rate,

and the effect is generally more pronounced for smaller N .

One reason that smaller page sizes adversely affect MPKI is because history bits are

folded into fewer bits to form the page index. History serves two purposes: First, it carries

information about the past behaviour of branches, and the fewer bits it is compressed to,

the higher are the chances of information loss. Second, XORing the branch address with

history to index a predictor table helps to allow even distribution of entries on the table.

The fewer bits of history that are XORed with the branch address to form the index, the

38

6.3. Design Space Exploration

1 2 3 4 5 6 7 8 9 10 11 12

2

3

4

5

6

7

8

9

0

5

10

15

20

25

30

35

40

45

5.11

4.13

3.6

3.31

3.12

3.01

2.96

2.77

2.65

2.57

2.5 2.44

43.7 43.2

39.0

36.2

33.7

27.8

25.223.5

21.7

19.618.0

16.3

40.6

36.0

33.4

31.2

28.9

24.2

19.2 18.3 17.816.8

16.015.0

Average MPKI: TAGE_N,2KAverage MPKI: Idealized PredictorAverage MPKI: Paged Predictor – 32-Entry PagesAverage % Improvement over Baseline: Idealized PredictorAverage % Improvement over Baseline: Paged Predictor – 32-Entry Pages

N

Av

era

ge

MP

KI

Av

era

ge

% I

mp

rove

men

t

Figure 6.6: Maximum potential improvement with a page size of 32 entries

higher the chances of uneven distribution of entries on the table. Between the two effects,

the second one is especially important, and is illustrated in the figure by comparing the

predictor accuracy for 1-entry and 2-entry pages. In the case of 1-entry pages, a branch

always maps to the same entry on the predictor table, based solely on its PC. For 2-entry

pages, a branch is allowed to access one of two entries on the table. Since the tag width

remains constant for all page sizes, the same amount of history information is embedded

in each entry’s tag for both cases. It can be seen in the figure that the MPKI for 1-entry

pages is much worse than if only a single history bit was used in the index hash function,

as in the case of 2-entry pages. For this reason, using too small a page size can have a

detrimental effect on accuracy, even if the aggregate capacity of the predictor is large.

39

6.3. Design Space Exploration

Number of Tagged Tables (N) Ltable Global History Length1 1 8

2 to 5 2 166 2 117 3 228 3 179 3 14

10 3 1211 3 1112 3 10

Table 6.2: Ltable History Lengths

Choosing Page Size

To simplify the design of the virtualized predictor, we choose the page size such that it fits

in a single 64 B cache line, the typical line size of L2 caches today. Since each entry is ~2 B,

this corresponds to a 32-entry page size.

Figure 6.6 shows the average MPKI of the predictor with 32-entry pages. Also shown

for comparison are the average MPKI of the baseline predictor (TAGEN,2K) and the ide-

alized predictor (with no page constraints, Section 6.3.3). The two dashed plots show the

percentage improvement in the accuracy of the idealized predictor and the paged predic-

tor over the baseline. The difference between the two lines represents the improvement

loss due to page size constraints.

The figure shows that the accuracy of the paged predictor is close to the idealized

predictor accuracy. The largest difference between the two designs is at N = 7. One reason

N = 7 suffers most from paging constraints is because N = 7 uses the longest history for

ltable (Table 6.2).

In conclusion, index limitations imposed by page size constraints can have a large nega-

tive impact on accuracy, even for a predictor with tagged entries and with a large aggregate

capacity. Tables with smaller history lengths make better candidates for virtualization, as

the adverse effects of paging are smaller for shorter history lengths. Finally, the impact of

dividing a single table into pages is less noticeable the more tables the predictor contains.

40

6.3. Design Space Exploration

2K 1K 512 256 128 64 32 16

0

5

10

15

20

25

30

35

40

45

50

21.6

18.9 19.120.5

23.5

28.2

32.834.3

Apache TPC-C TPC-H Average

Page Size (Entries)

Pag

e S

wap

Fre

qu

ency

(P

er C

on

dit

ion

al B

ran

ch)

Figure 6.7: Effect of page size on page swap frequency

Effect of Page Size on Page Swap Frequency

In this section, we examine the effect of page size on the page swap frequency. The page

swap frequency determines the impact of virtualization on the traffic to the L2 cache. In-

creasing L2 traffic is undesirable for performance and power efficiency reasons. However,

past work has shown that today’s L2 caches can absorb considerably higher traffic than

that induced by typical applications [10]. We calculate the page swap frequency as the

total number of page-ins to the L1 over the total number of branch predictions made (the

page swap frequency has a unit of swaps per conditional branch).

Figure 6.7 shows the page swap frequency as a function of the page size for each bench-

mark, as well as on average. The page size is varied from 2 K entries to 16 entries per page.

Larger page sizes are not shown since the L1 table is 2K entries, while smaller page sizes

are not considered due to their high MPKI. The page swap rate is independent of N , since

it is determined by the stream of branch addresses (and, consequently, pages) encountered

in the program. The page swap rate is a property of the benchmark itself and is a function

41

6.3. Design Space Exploration

of the entropy present in the branch address bits.

The figure shows that as page sizes become smaller, the swap rate first decreases and

then starts to grow. Two opposing effects are at play: with smaller page sizes, more pages

can fit in the L1 table, decreasing the need for a swap. This is the dominant effect up

to a page size of 512 entries. On the other hand, as pages become smaller, more bits are

used to determine the page number. These bits come from the lower portions of the PC,

which generally have higher entropy, resulting in a higher entropy in the sequence of pages

encountered.

At the page size of interest for our virtualized design (32-entry pages), the average page

swap rate is 32.8%. This means that a page would have to be swapped approximately every

three conditional branches. Lowering the swap rate is beneficial for two reasons: First, it

reduces the impact of virtualization on the traffic to the L2 cache. Second, a design with a

lower swap rate is potentially less likely to be affected by the L2 table’s large latency. We

address the issue of lowering the page swap frequency in Section 6.3.5.

Analysis of Effect of Page Size Constraints

In this section, we provide an analysis of the impact of paging on the predictor by examin-

ing the specific behaviour of ltable and its contribution to final prediction in the presence

of page size constraints. The overall MPKI results presented in this section are the same

as those in Figure 6.5, shown for each benchmark and at specific N to aid our analysis.

Figure 6.8 shows the breakdown of the MPKI contributed from each of the bimodal

and the single tagged table at different page sizes, for N = 1. The graph also shows the

percentage of branches for which the tagged component provides the final prediction,

with the remainder of the predictions coming from the bimodal table.

The MPKI for a predictor is the sum of the MPKI contribution from each of the predic-

tor tables. In addition, the MPKI of each predictor table and the table’s misprediction rate

are related by the following equation,

42

6.3. Design Space Exploration

MPKIi =B

10× I(Ci ×Ri) (6.1)

where MPKIi is the MPKI contribution of table i, Ci is the contribution rate of the

table (in percent), Ri is the misprediction rate of the table (in percent), and B and I are

the total number of conditional branches and instructions in the benchmark, respectively.

The contribution rate of a table is calculated as the number of times the table provides the

final prediction (regardless of the outcome) over the total number of predictions made. In

addition, for a single benchmark, this relation holds:

MPKI1MPKI2

=C1

C2× R1

R2(6.2)

Therefore, the relationship between the misprediction rate of a table at two different

page sizes can be inferred from the graph, based on the relationship between the MPKIs

and the contribution rates.

The figure shows that as pages get smaller, the overall MPKI starts growing compared

with a non-paged design. However, the MPKI contribution of the tagged table starts to

decrease, while the bimodal table’s MPKI contribution increases. At the same time, the

contribution rate of the tagged component drops. This means that the decrease in the

MPKI of ltable is due to the drop in the contribution of the table in providing the final

prediction (and not because of an increased prediction accuracy of this table, for example).

Although not shown, the prediction accuracy of the table, which can be inferred from the

graph according to Equation 6.2, stays relatively constant. Therefore, limiting a branch

to the entries on a single page does not result in an increased level of aliasing on the

tagged table. Instead, the predictor finds a tag match less frequently and, consequently,

the contribution of the table to the final prediction drops. As this happens, the burden of

providing the prediction falls on the bimodal table, which falls short of making a correct

prediction.

This finding indicates that the index and tag hash functions in TAGE can identify

43

6.3. Design Space Exploration

branches well, since the increase in MPKI with smaller page sizes is not due to a poorer

accuracy of the table itself. Other predictors, especially those with tagless entries, would

not be as readily compatible with a paging scheme.

Figure 6.9 shows the same breakdown of MPKI contribution for N = 8. Similar to

the previous graph, a breakdown of the MPKI contribution from each of the ltable and

bimodal tables is shown. Also shown is the aggregate MPKI contribution from the 7 re-

maining tagged components. Again, a similar pattern is present: as the page size shrinks,

the percentage of branches for which the ltable provides the final prediction decreases,

resulting in a lower contribution to the overall MPKI. As this table fails to provide pre-

dictions, the responsibility of making a prediction falls on the other tagged tables, as well

as the bimodal table. The increase in the overall MPKI is a result of the increase in the

MPKI contribution of both the bimodal table and the remaining tagged tables. However,

the combined increase coming from the remaining 7 tagged components is less than that

coming from the bimodal table. In addition, the increase in the MPKI due to paging is not

as severe as the previous case (N = 1), because the other tagged components are more apt

at compensating for the decline in the contribution of ltable to the final prediction.

As a sidenote, an observation can be made regarding the contribution rates of ltable at

different values of N . With a single tagged component (N = 1), the percentage of branches

for which ltable provides the final prediction starts at more than 50%. The contribution

rate decreases with more than a single tagged component, down to approximately 22%

for N = 8. The contribution rate of ltable is not only a function of the number of tagged

components on the predictor, but also depends on the history length used for the table

(Table 6.2). The contribution rate of each table depends on the number of branches that

can be satisfactorily predicted with the history length assigned to the table.

The predictor tables do not contribute equally to the final prediction. A minimum of

40% of the predictions come from the bimodal table, as will be seen in Section 6.3.6. With

the contribution rate of ltable at around 25%, every fourth branch is predicted by ltable

on average. When considering timing effects, this means that there is on average a three

44

6.3. Design Space Exploration

32K16K

8K4K

2K1K

512256

12864

3216

84

21

32K16K

8K4K

2K1K

512256

12864

3216

84

21

32K16K

8K4K

2K1K

512256

12864

3216

84

21

0

2

4

6

8

10

12

0

4

8

12

16

20

24

28

32

36

40

44

48

52

56

60

64

68

MPKI Contribution of LtableMPKI Contribution of Other Tagged TablesMPKI Contribution of BimodalLtable Percentage Contribution to Final Prediction (Second Y-axis)

Page Size (Entries)

MP

KI

Co

ntr

ibu

tio

n R

ate

(%

)

Apache TPC-C TPC-H

Figure 6.8: Effect of page size on predictor tables’ MPKI contribution (N=1)

32K16K

8K4K

2K1K

512256

12864

3216

84

21

32K16K

8K4K

2K1K

512256

12864

3216

84

21

32K16K

8K4K

2K1K

512256

12864

3216

84

21

0

1

2

3

4

5

6

0

4

8

12

16

20

24

28

MPKI Contribution of LtableMPKI Contribution of Other Tagged TablesMPKI Contribution of BimodalLtable Percentage Contribution to Final Prediction (Second Y-axis)

Page Size (Entries)

MP

KI

Co

ntr

ibu

tio

n R

ate

(%

)

Apache TPC-C TPC-H

Figure 6.9: Effect of page size on predictor tables’ MPKI contribution (N=8)

45

6.3. Design Space Exploration

branch gap between two predictions provided by ltable, allowing time for the correct page

to be fetched from the second-level table. Section 6.3.6 provides more detail on how the

contribution rate of ltable factors into the sensitivity of the predictor to L2 delay.

6.3.5 Instruction Block Size

In the previous section, we assumed that the instruction block size and the page size are

equal. For our virtualized design, we chose a page size of 32 entries, and showed that

for this page size, a page would have to be swapped into the L1 table once every three

conditional branches.

In this section, we vary the instruction block size in order to lower the page swap

frequency. We use the parameter X (Figure 6.1) to vary the instruction block size as a

function of the page size. We evaluate the effect of varying X on page swap frequency as

well as on the predictor accuracy. We comment on the L2 table utilization rate for different

instruction block sizes, by looking at the fraction of pages that are never swapped into the

L1. We will show that using too large an instruction block can lead to wasting a large

portion of the L2.

The predictor model remains unchanged from the previous section.

Effect of Instruction Block Size on Page Swap Frequency

Figure 6.10 shows the average page swap frequency as a function of X, for three different

page sizes. For a virtualized design, we are interested in 32-entry pages, but we also show

the effect of X on a larger (128-entry) and smaller (16-entry) page size for completeness.

For all three page sizes, a larger X leads to a lower page swap frequency. As X is

increased, and more branches share a page, the page swap frequency drops, but at a slower

rate. Therefore, a larger instruction block size can be used to reduce page swap frequency.

Next, we evaluate the impact of using larger instruction blocks on the predictor’s MPKI.

46

6.3. Design Space Exploration

0 1 2 3 4 5 6 7 8 9

0

5

10

15

20

25

30

35

40

128-Entry Page Size 32-Entry Page Size 16-Entry Page Size

X

Ave

rag

e P

age

Sw

ap F

req

uen

cy

Figure 6.10: Effect of instruction block size on page swap frequency (instruction block size= 2X× page size)

Effect of Instruction Block Size on MPKI

Figure 6.11 shows the effect of varying X on the predictor MPKI, for the three page sizes

used in the previous experiment. The effect is shown on a predictor with N = 12 as a

representative, although the same effect is present for other values of N .

The figure shows that for all three page sizes, a larger X leads to a worse MPKI. As

X is increased, and more branches share a page, the MPKI grows at an increasing rate.

Therefore, using a larger instruction block size to reduce page swap frequency comes at

the cost of an increased MPKI.

Based on the two figures (Figures 6.10 and 6.11), smaller values of X represent a better

tradeoff in increasing MPKI for a lower page swap frequency, as they result in a larger drop

in page swap frequency for a smaller increase in MPKI.

47

6.3. Design Space Exploration

0 1 2 3 4 5 6 7 8 9

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

128-Entry Page Size 32-Entry Page Size 16-Entry Page Size

X

Av

era

ge

MP

KI

Figure 6.11: Effect of instruction block size on MPKI (instruction block size = 2X× pagesize, N=12)

Effect of Instruction Block Size on L2 Table Utilization

In this section, we examine the effect of the instruction block size on the utilization of the

L2 table pages. As X is increased and higher portions of the PC are used for the page

number, some page numbers may not be encountered in the program run. Figure 6.12

shows the percentage of such pages as a function of X, for the three page sizes used in the

previous experiments. Similar to the page swap frequency, this number is a property of

the benchmark and is independent of N .

The figure shows that up to X = 3, almost all pages are accessed for all three page sizes.

As X is increased, the fraction of pages that are never accessed grows larger, especially for

smaller page sizes. This decrease in page utilization for largerX values is one of the reasons

for the MPKI increase observed in Figure 6.11. For a certain L2 size, larger instruction

block sizes lead to a lower page swap rate, but a portion of the pages remain unused.

An alternative method to reduce the page swap frequency with less impact on the page

utilization rate is to incorporate branch history in the page fetch mechanism. For example,

48

6.3. Design Space Exploration

0 1 2 3 4 5 6 7 8 9

0

10

20

30

40

50

60

70

128-Entry Page Size 32-Entry Page Size 16-Entry Page Size

X

Fra

cti

on

of

Pag

es

Ne

ver

Ac

cess

ed

(%

)

Figure 6.12: Effect of instruction block size on the fraction of pages never accessed (in-struction block size = 2X× page size)

two bits of history can be XORed with the upper bits of the page number (still determined

solely by the upper bits of the PC) to allow each page four dynamic instances. Page swaps

are triggered by page number changes, but at fetch time, one of the four instances of the

page is fetched based on the current history. This scheme has the same page swap rate as

a scheme with no history, but a lower percentage of pages remain untouched. We leave

investigation into this method for future work.

Choosing Instruction Block Size

For our virtualized design, we choose X = 3, resulting in an instruction block size eight

times larger than the page size. This means that branches in a 256 B block of code share

a single 32-entry page on the predictor table. Figures 6.10 and 6.11 showed that the first

few values of X represent the best tradeoff in MPKI versus page swap frequency. For X = 3,

almost all pages are utilized, while the page swap frequency is cut down to a third of its

original value for X = 0 (from ~33% to ~11%).

49

6.3. Design Space Exploration

1 2 3 4 5 6 7 8 9 10 11 12

2

3

4

5

6

7

8

9

0

5

10

15

20

25

30

35

40

5.39

4.32

3.74

3.43

3.23

3.12

3.06

2.85

2.73

2.63

2.55

2.49

39.82

35.18

32.67

30.57

28.31

23.67

18.6117.87 17.42

16.5215.7

14.67

36.79

32.62

30.57

28.59

26.33

21.58

16.78 16.15 15.84 15.26 14.5913.66

Average MPKI: TAGE_N,2KAvergae MPKI: Paged Predictor with 32B Instruction Block Size (X=0)Avergae MPKI: Paged Predictor with 256B Instruction Block Size (X=3)% Improvement Over Baseline: Paged Predictor with 32B Instruction Block Size (X=0)% Improvement Over Baseline: Paged Predictor with 256B Instruction Block Size (X=3)

N

Av

era

ge

MP

KI

Ave

rag

e %

Imp

rove

men

t

Figure 6.13: Maximum potential improvement with an instruction block size of 256 B(X = 3)

Figure 6.13 shows the average MPKI of this predictor for different N . Also shown for

comparison are the average MPKI of the baseline predictor (TAGEN,2K) and the predic-

tor with equal instruction block and page sizes (X = 0). The two dashed plots show the

percentage improvement in the accuracy of the predictor with the two different instruc-

tion block sizes over the baseline. The difference between the two lines represents the

improvement loss due to using a larger instruction block size.

The figure shows that using a 256 B instruction block size removes ~1%-3% of the

improvement that can be achieved using a 32 B instruction block size. The improvement

loss decreases with larger values of N .

In the remainder of this chapter, we use a page size of 32 entries and an instruction

block size of 256 B, unless otherwise stated.

50

6.3. Design Space Exploration

6.3.6 L2 Latency

Up to this point, we have used a predictor model that allows access to the L2 table without

any delay. In this section, we evaluate the effect of L2 latency on the accuracy of the paged

predictor. The timing model described in Section 4.1.1 is used to emulate the effect of L2

delay.

In this section, we first demonstrate the effect of latency on the predictor accuracy. In

order to measure the impact of the delay magnitude, we vary latency from 0 (no latency) to

20 cycles in increments of 5 cycles. We find that delay has a negative impact on accuracy;

However, the impact is not proportional to the magnitude of the delay. Next, we show that

the virtualized design improves accuracy even with the high latencies used to represent

the L2 cache latency. We assume a latency of 20 cycles for the L2 cache (the access delay

of a dedicated L2 table of the same size is estimated by CACTI as 2 cycles). In addition,

we look at the L1 miss rate and page swap frequency for different latencies. We show that

larger delay increases the L1 miss rate, but not the page swap frequency. We conclude the

section by providing a detailed analysis of the behaviour of the predictor in the presence

of delay constraints.

Effect of L2 Latency on MPKI

Figure 6.14 shows the paged predictor’s percentage improvement in MPKI over the base-

line predictor (TAGEN,2K) for different values of L2 latency, and N . For this figure, we

opted to show improvement values rather than MPKI, since the MPKI values were very

close. The percentage improvement is a direct function of the MPKI.

The figure shows that increasing latency has a negative impact on accuracy improve-

ment. The largest drop in improvement is between no latency and a latency of 5 cycles.

Larger latencies result in worse improvement, but the drop is smaller than that of going

from no latency to a 5-cycle latency. The results for the L2 cache latency (20 cycles) are

summarized next.

51

6.3. Design Space Exploration

1 2 3 4 5 6 7 8 9 10 11 12

0

5

10

15

20

25

30

35

40

36.79

32.62

30.57

28.59

26.33

21.58

16.78

16.15

15.84

15.26

14.59

13.66

32.56

29.58

27.51

25.38

22.94

19.16

15.21

14.42

14.06

13.28

12.48

11.65

30.54

28.67

26.41

24.2

21.74

18.11

14.6

13.74

13.24

12.48

11.44

10.6

0-Cycle L2 Latency 5-Cycle L2 Latency 10-Cycle L2 Latency 15-Cycle L2 Latency 20-Cycle L2 Latency

N

Ave

rag

e %

Imp

rove

men

t O

ver

Bas

elin

e

Figure 6.14: Effect of latency on percentage improvement in MPKI

Page Size 32 EntriesInstruction Block Size 256 BytesL2 Size 1K PagesL2 Latency 20 Cycles

Table 6.3: Paging Scheme Virtualized Design Configuration

Virtualized Design Accuracy and Improvement Over Baseline

In this section, we present MPKI results for the virtualized design and provide a compari-

son with the MPKI of the baseline predictor (TAGEN,2K). In addition, in order to measure

the improvement loss due to the L2 table latency, we compare the accuracy of the virtual-

ized predictor with that of a predictor with a zero-latency L2 table. The parameters of the

virtualized design are summarized in Table 6.3.

Figure 6.15 shows the average MPKI of the virtualized predictor for different N . Also

shown for comparison are the average MPKI of the baseline predictor and the predictor

with a zero-latency L2 table. The two dashed plots show the percentage improvement in

MPKI for the two different values of L2 latency over the baseline. The difference between

52

6.3. Design Space Exploration

1 2 3 4 5 6 7 8 9 10 11 122

3

4

5

6

7

8

9

0

5

10

15

20

25

30

35

40

5.92

4.57

3.97

3.64

3.43

3.26

3.14

2.94

2.81

2.71

2.64

2.58

36.79

32.62

30.57

28.59

26.33

21.58

16.7816.15 15.84 15.26

14.5913.66

30.54

28.67

26.41

24.2

21.74

18.11

14.613.74 13.24

12.4811.44

10.6

Average MPKI: TAGE_N,2KAverage MPKI: Paged Predictor – Zero-Latency L2Average MPKI: Paged Predictor – 20-Cycleero LatencyAverage % Improvement over Baseline: Paged Predictor – Zero-Latency L2Average % Improvement over Baseline: Paged Predictor – 20-Cycle L2 Latency

N

Av

era

ge

MP

KI

Ave

rag

e %

Imp

rove

men

t

Figure 6.15: Potential improvement for virtualized design with 2K-entry dedicated tables

the two lines represents the improvement loss due to latency constraints. The figure shows

that latency constraints negatively impact accuracy, in the worst case removing up to a

sixth of the improvement with a zero-latency L2, at N = 1.

Next we compare the accuracy of the virtualized design with that of the baseline pre-

dictor. We described in Chapter 4 that virtualization can be used either to improve accu-

racy while keeping dedicated storage constant, or to reduce dedicated storage costs while

maintaining the same prediction accuracy. These two goals are evaluated next.

The figure shows the accuracy improvement over the baseline for eachN. The improve-

ment is larger for smaller N , where the extra capacity of the L2 table is comparatively

larger than the aggregate dedicated L1 capacity (The aggregate L1 capacity at each N is

~2N KB). The improvement ranges from 30.5% for N = 1 to 10.6% for N = 12.

It can be observed from the figure that the virtualized design can also be used to reduce

the number of predictor tables, while maintaining the same accuracy as a non-virtualized

53

6.3. Design Space Exploration

design. Comparing the MPKI values for the virtualized and baseline predictors, we see

that for N = 2 and N = 3, one table can be eliminated in the virtualized design, while

achieving better accuracy. For larger N , at least two tables can be removed while main-

taining accuracy. Depending on the number of dedicated predictor tables, 20% to 50% of

the tagged tables can be removed, while achieving same or better accuracy than a non-

virtualized design.

Effect of L2 Latency on L1 Miss Rate and Page Swap Frequency

In this section, we evaluate the effect of latency on the L1 miss rate, as well as on the

page swap frequency. The L1 miss rate is calculated as the number of instances where a

wrong page is accessed on the L1 table over the total number of predictions made. The

page swap frequency is measured as the total number of page swaps into the L1 over the

total number of branch predictions made, and determines the traffic to the L2 cache. For

a predictor with no latency, the two metrics are equivalent. For larger latencies, we expect

the L1 miss rate to be higher than the page swap frequency, as a branch may miss multiple

times before the correct page is fetched. Both metrics are properties of the benchmark and

are independent of N .

Figure 6.16 shows the L1 miss rate for each benchmark, as well as on average, for

different values of latency. The figure shows that the L1 miss rate is higher for larger

latencies. At latencies of 5 and 20 cycles respectively, the L1 miss rate is approximately

double and triple the miss rate for no latency.

For the L1 miss rate, a miss is counted regardless of whether the access is used for the

final prediction or not. As a result, although the miss rate is high, the percentage of times

a miss affects prediction is smaller. If 20% of the final predictions are provided by ltable,

then at a latency of 20 cycles (with 31% L1 miss rate), the percentage of times a miss affects

the final prediction is 0.20× 0.31 = 0.061, approximately 6%.

Figure 6.17 shows the page swap frequency for each benchmark, as well as on aver-

age, for different values of latency. It can be observed from the figure that the page swap

54

6.3. Design Space Exploration

0 5 10 15 20

0

5

10

15

20

25

30

35

40

45

50

11.2

23.1

27.7

29.931.1

Apache TPC-C TPC-H Average

L2 Latency

L1

Mis

s R

ate

(P

er

Co

nd

itio

na

l Bra

nc

h)

Figure 6.16: Effect of latency on L1 miss rate

frequency is equal to the L1 miss rate for a zero-latency L2 table. For other latencies, the

page swap frequency is smaller than the L1 miss rate, and is approximately equal to the

ideal page swap frequency. Although for higher latencies, the predictor incurs multiple

misses before a page is fetched (higher L1 miss rate), the number of page swaps remains

relatively constant.

The figure shows that the page swap frequency decreases marginally with longer la-

tencies. This effect can be explained with the following example: If the L1 table has a

capacity of a single page, and currently holds page A, for a sequence of page accesses

{B,A}, the sequence of page-ins for the zero-latency predictor is also {B,A}. For a predictor

with a non-zero latency, the sequence of page-ins is {B}, since page A already resides in the

L1 at the time of the second access.

Analysis of Effect of L2 Latency Constraints

In this section, we discuss different factors that influence the sensitivity of the paged de-

sign to the L2 table delay. We examine how latency affects the breakdown of MPKI contri-

55

6.3. Design Space Exploration

0 5 10 15 20

0

2

4

6

8

10

12

14

16

18

11.2 11.1 10.9 10.8 10.8

Apache TPC-C TPC-H Average

L2 Latency

Pag

e S

wap

Fre

qu

en

cy (

Pe

r C

on

dit

ion

al B

ran

ch)

Figure 6.17: Effect of latency on page swap frequency

bution from each of the predictor tables by comparing a zero-latency L2 with a 20-cycle

L2 delay. The results are presented for a paged design with a 32 B instruction block size.

Figure 6.18 (a) shows the breakdown of the MPKI contributed from the bimodal table,

ltable, and the other tagged tables for different N at two L2 latencies, 0 and 20. The figure

shows that for all N , the MPKI contribution of ltable is smaller at a latency of 20, while

the MPKI contribution of bimodal is larger. The aggregate MPKI contribution of the other

tagged tables is also generally smaller compared with zero-latency L2.

The difference in MPKI at different latencies depends on several factors. One impor-

tant factor is the original ltable contribution (at latency 0), which affects the sensitivity of

the predictor to latency for two reasons: First, the higher the number of predictions made

by this table, the higher the probability that predictions will have to be delegated to other

tables in case of adverse effects of latency. Second, ltable’s original contribution rate de-

termines the spacing between the time two predictions are required from the table. The

higher the contribution rate, the smaller the number of cycles between two such predic-

tions, and the less delay that the predictor can tolerate in fetching pages. Figure 6.18 (b)

56

6.3. Design Space Exploration

1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12

0

1

2

3

4

5

6

7

MPKI Contribution of Ltable – Zero-Latency L2 MPKI Contribution of Other Tagged Tables – Zero-Latency L2 MPKI Contribution of Bimodal – Zero-Latency L2 MPKI Contribution of Ltable – 20-Cycle L2 Latency MPKI Contribution of Other Tagged Tables – 20-Cycle L2 Latency MPKI Contribution of Bimodal – 20-Cycle L2 Latency

N

MP

KI

Apache TPC-C TPC-H

(a)

1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12

0

10

20

30

40

50

60

70

80

90

100

Ltable Contribution Rate – Zero-Latency L2 Other Tagged Tables Contribution Rate – Zero-Latency L2 Bimodal Contribution Rate – Zero-Latency L2 Ltable Contribution Rate – 20-Cycle L2 Latency Other Tagged Tables Contribution Rate – 20-Cycle L2 Latency Bimodal Contribution Rate – 20-Cycle L2 Latency

N

Co

ntr

ibu

tio

n R

ate

(%

)

Apache TPC-C TPC-H

(b)

Figure 6.18: effect of latency on predictor table’s contribution to MPKI and final prediction(a) MPKI contribution (b) contribution to final prediction

57

6.3. Design Space Exploration

shows the contribution of each of the ltable, bimodal table, and other tagged tables to the

final prediction at the two latencies.

The figure shows that the smaller the original contribution of ltable to final prediction

(at 0 latency), generally the less the difference there is between ltable’s MPKI contribution

at the two latencies in Figure 6.18 (a). The number of tagged tables also influences this

difference, since it determines how the new misses are offloaded to the other predictor

tables. The effect of latency can be explained most clearly at N = 1, where any misses in

the tagged table are offloaded to the bimodal predictor. The figure shows that the effect of

latency is largest at N = 1. The original contribution of ltable is also largest at this point.

The smallest difference in MPKI contribution of ltable (and the overall MPKI) for the two

latencies is at N = 7, and Figure 6.18 (b) shows that the contribution of ltable is smallest

at this point.

The number of cycles between the time a page is fetched and when the first prediction

from that page is needed is another factor that impacts the results. In a page fault scheme, a

page is normally fetched at the time when its data is needed. However, since the predictor

has multiple tables, the data in a requested page is not necessarily needed at the time

the request is made. If ltable is not the component providing the final prediction for the

branch that caused the page fetch, then the predictor is not affected by the latency to fetch

the page. A simple example of this case is if we consider page fetch and the contribution

of ltable to final prediction as periodic functions of time with the same period but a phase

difference of 20 cycles. In this case, the introduced latency would have no effect on the

predictor MPKI.

Finally, our choice to allow updates and allocates on a wrong page also plays a role

in the impact of latency on prediction accuracy. This effect can be explained with the

following example: If the L1 table has a capacity of a single page, and the L2 table contains

two pages, then an L1 miss rate of 100% does not change the accuracy of the predictor. In

this case, each branch always uses the other page to make predictions and allocate entries

on the table. Therefore, the accuracy of the predictor remains unaffected, since the only

58

6.3. Design Space Exploration

1 2 3 4 5 6 7 8 9 10 11 12

0

5

10

15

20

25

30

35

40

15.67

20.59

20.38

1918.42

17.04

15.86

15.04

14.6

13.92

13.87

13.37

34.59

30.26

23.74

20.84

20.03

15.94

14.89

13.38

11.64

10.16

9.13

7.96

TAGE_N,8K Idealized Predictor

N

Av

era

ge

% I

mp

rove

men

t

Figure 6.19: Maximum potential improvement achievable by adding a single 32K-entrytable – 4K-entry dedicated tables

difference is the new mapping of branches to pages.

6.3.7 Sensitivity to Initial Table Sizes

In this section, we evaluate the sensitivity of the results to the initial table sizes by virtu-

alizing a predictor with double the initial size as that used up to this point as an example.

For the baseline predictor, we fix the number of entries on the tagged components at 4K

entries (TAGEN,4K). The same parameters as the previous section (Table 6.3) are used for

the virtualized design.

Figure 6.19 shows the upper bound on the potential improvement that can be achieved

through the paging scheme for this table size, using the idealized predictor of Section

6.3.3. Also shown for comparison, is the improvement that can be achieved by doubling

the size of all predictor tables (TAGEN,8K). This figure is similar to Figure 6.4 for the

previous table size.

Comparing Figures 6.19 and 6.4 for the two table sizes, we see that the improvement

59

6.3. Design Space Exploration

1 2 3 4 5 6 7 8 9 10 11 12

0

5

10

15

20

25

30

35

40

19.2

15.5

12.09.7 8.9

6.95.0 4.8 5.0 4.6 4.5 4.1

7.6

3.4

3.9

4.24.4

3.2

1.5 1.7 1.8 2.0 2.1 2.1

4.2

3.1

2.4

2.3 2.3

2.2

2.2 1.9 1.5 1.2 1.0 0.9

3.7

8.3

5.5

4.7 4.4

3.5

6.25.0

3.42.4 1.5

0.9

Improvement Available Improvement Loss Due to L2 LatencyImprovement Loss Due to Instruction Block Size Improvement Loss Due to Page SizeMaximum Potential Improvement (Idealized Predictor)

N

Av

era

ge

% I

mp

rove

men

t

Figure 6.20: Improvement loss due to page size, instruction block size, and latency con-straints – 4K-entry dedicated tables

achieved by doubling the table sizes is smaller in this case, although the extra capacity

added to the predictor is twice as much for each N .

Similarly, the maximum potential improvement for the paging scheme is smaller for

this table size. For example, at N = 6, the maximum potential improvement is 16%, com-

pared to 28% for the previous table size. At N = 12, the maximum potential improvement

for this table size is 8%, half of that for the previous table size. Therefore, a larger size

for the first level reduces the maximum potential for virtualization even in the absence of

constraints.

Figure 6.20 summarizes the effect of page size, instruction block size, and latency con-

straints on the accuracy of the predictor. The segments show (from top to bottom) the

percentage improvement lost due to page size, instruction block size, and latency con-

straints respectively. The remainder shows the percentage improvement of the final virtu-

alized design over the baseline. The sum of the bars adds up to the maximum potential

60

6.4. Virtualized Design Accuracy and Improvement Over Baseline for Delay-Constrained Configurations

1 2 3 4 5 6 7 8 9 10 11 12

0

1

2

3

4

5

6

-10

-5

0

5

10

15

20

-6.19 -2.94

5.525.02

12.81

10.459.4

8.74

16.415.31

14.2813.4

Average % Improvement Over Non-Virtualized Predictor Average MPKI: Non-Virtualized PredictorAverage MPKI: Virtualized Predictor

N

Av

era

ge

MP

KI

Ave

rag

e %

Imp

rov

eme

nt

Figure 6.21: Accuracy improvement for virtualization of delay-constrained configurations

improvement.

Comparing the figure with Figure 6.15 for the previous table size, we observe that

the improvement achieved through virtualization is much smaller in this case. Although

small, virtualization is still beneficial, providing an improvement in the range of 19.2%

for N = 1 to 4.1% for N = 12.

6.4 Virtualized Design Accuracy and Improvement Over

Baseline for Delay-Constrained Configurations

In this section, we present MPKI results for the virtualization of the delay-constrained

configurations (Table 4.4), and provide a comparison with the non-virtualized predictor.

For the delay-constrained configurations, the size of the first-level tagged tables starts at

16 K entries at N = 1, and is successively cut in half at N = 2, N = 4, and N = 8. The

61

6.4. Virtualized Design Accuracy and Improvement Over Baseline for Delay-Constrained Configurations

1 2 3 4 5 6 7 8 9 10 11 12

0

2

4

6

8

10

12

14

16

2.5

5.19

7.91 7.91

10.77 10.77 10.77 10.77

13.44 13.44 13.44 13.44

1.96

2.64

3.33 3.17

4.27 4.46

3.423.71

5.04 5.17 5.11 5.21

L2 Read Rate L2 Write Back Rate (Dirty)

N

Ave

rag

e L

2 Tr

affi

c R

ate

(Per

Co

nd

itio

nal

Bra

nch

)

Figure 6.22: Effect of virtualization on L2 traffic for delay-constrained configurations

bimodal table is 64 K entries (its maximum delay-constrained size). The same parameters

as the previous section (Table 6.3) are used for the virtualized design.

Figure 6.21 shows the average MPKI of the virtualized and non-virtualized predictors

for different N . The bars show the percentage improvement in MPKI over the baseline.

For each group of fixed L1 size, the improvement is larger for smaller N . This is because

the L2 capacity is comparatively larger than the aggregate dedicated L1 capacity.

For N = 1, virtualizing the predictor results in a worse MPKI. At this point, the L2

size is only twice the size of the L1 (16 K entries). We also saw in Section 6.3.7 that larger

L1 sizes are affected more by page size constraints. The same reasoning applies to N = 2,

where the L2 is four times larger than the L1. Starting at N = 3, we see positive improve-

ment in accuracy. At this point, the L2 is eight times larger than the L1 table. This suggests

that in our scheme, the L2 table has to be much larger than the L1 to compensate for losses

due page size constraints.

The best accuracy for the non-virtualized design is achieved at N = 8. For this point,

the accuracy can be improved by 8.7% using the virtualized design. Alternatively, a vir-

62

6.5. Summary

tualized design with 5 tagged tables can be used to achieve the same accuracy as the best

baseline predictor. This reduces the dedicated storage of the predictor by ~12 KB, 25% of

the original dedicated predictor size.

To show the effect of virtualization on the L2 traffic, Figure 6.22 shows the L2 read

and write rates for each N , measured as the total number of L2 reads or writes over the

total number of conditional branches. The measured rates have a unit of read or write per

conditional branch. The L2 read rate is the same as the page swap frequency in previous

sections. L2 writes occur only for dirty pages, and as a result, the L2 write rate is smaller.

The figure shows that the maximum read and write rates are ~13.4% and ~5.2% both at

N = 12. For the virtualized design at N = 8, the L2 read rate is ~10.8%, while the write

rate is ~3.7%.

6.5 Summary

In this section, we presented the paging scheme, where the entries in the predictor table

are divided into pages shared by a group of branch instructions in close proximity in the

program code space. Exploring the design space of this scheme, we found that virtualizing

shorter history length tables provides the best benefit in improving prediction accuracy.

We found that smaller page sizes have higher MPKI compared with larger pages. We

demonstrated that by sharing a page among branches in a larger instruction block, the

traffic to and from the L2 can be decreased almost arbitrarily, but at the cost of an increased

MPKI. Virtualizing only a single predictor component allowed more tolerance to the L2

delay, as the table does not contribute to every prediction. Finally, we showed that the

virtualized design remains effective even for larger dedicated table sizes, although page

size limitations are worse in this case.

63

Chapter 7

Packaging Scheme

In this section, we present a different virtualization scheme for the TAGE predictor, the

packaging scheme. We present preliminary data demonstrating the potential of the scheme

by showing the improvement margin available before considering second-level predictor

table delay. We leave investigation into the effects of delay and methods for mitigating

these effects for future work.

7.1 Design

In the packaging scheme, the second-level predictor table stores packages, similar to the

Phantom-BTB design presented in Section 2.2. These packages contain predictor entries

that can be used as part of the prediction mechanism. For a packaging scheme, two is-

sues need to be addressed: the grouping of data in the second-level table and the fetch

trigger. The grouping refers to the organization of data in packages that are fetched from

the second-level table according to a fetch trigger. Phantom-BTB is an example of such a

packaging scheme. In Phantom-BTB, packaging is done on BTB table misses. As branches

miss in the dedicated BTB, they are packaged into a temporal group. For the fetch trigger,

each temporal group is associated with a region of code surrounding the branch previous

to the group that also missed in the BTB. Any subsequent misses for a branch within this

region triggers the prefetch of the temporal group.

In our packaging scheme, we borrow the idea of using the upper bits of the branch

address to associate branches with pages in the paging scheme. Similarly, we use the upper

bits of the branch PC to associate a branch with a specific package in the second-level

table. Whenever an entry is evicted from the dedicated predictor table, it is grouped into its

64

7.1. Design

corresponding package. In order to determine which package an evicted entry should be

placed in, we need to know the address of the branch to which the entry belongs. However,

in TAGE and in branch outcome predictors in general, both the index to the predictor table

and the entries’ tag fields are a hash of the branch address and other information such as

global and path histories. Therefore, the address of the branch can not be inferred from

the entry’s tag or its position in the predictor table. As a result, we augment the entries

in the predictor table with a field containing the package number of the branch that first

allocated the entry. This package number is used to determine the package to which the

entry should be placed, in the event of eviction.

Based on the address of the current branch, a corresponding package is fetched from

the L2 table, so that its entries can be used as part of the prediction mechansim. In general,

the two packages used for storing evicted entries (determined based on the address of the

branch that initially allocated the entry) and for contributing to prediction (determined

based on the current branch address) are two different packages.

The packaging scheme takes advantage of the spatial and temporal locality present in

the branch address stream. The entries in a package are evicted entries belonging to a set

of branches that are close together in the instruction address space (have the same upper

PC bits). Based on the current portion of the code, the corresponding package is fetched

and its entries are used to contribute to the final prediction.

Similar to the paging scheme, packaging is applied to a single tagged component in

the TAGE predictor, which we call the ptable. The package number field can be stored in

a separate array so as not to increase the access delay of the predictor table. Since this

field is only used to group evicted entries into their corresponding packages, it is not on

the critical path of making a prediction. Alternatively, it may be possible to merge the

package number with the entry’s tag field by limiting the XOR portion of the tag, similar

to way we limited the XOR portion of the table index in the paging scheme. It is necessary

to evaluate the sensitivity of the predictor to the tag hash function in order to determine

the feasibility of this option, which we leave for future work.

65

7.2. Architecture

An alternative to packaging evicted entries is to place newly allocated entries into the

packages, whenever the predictor table does not have enough space. The advantage of this

scheme is that it would not be necessary to keep track of the package number of the branch

that initially allocated the entry. We based our decision to package entries upon eviction

on the assumption that newly allocated entries have a higher probability of being useful

in the near future, especially as evicted entries are generally chosen because they have a

useful bit of 0, as explained in Section 2.1.1. With this scheme, we allow the dedicated

predictor table to be kept up-to-date and adapt to the different phases of the program.

Instead, the second-level storage is used to keep track of evicted entries that may be useful

in the future. With a “package on allocate” scheme, the accuracy of the predictor would

be more sensitive to the delay of the second-level table, since it would rely on the second-

level table to keep track of the most recent updates to the predictor table. In the eviction

scheme, in the worst case that no packages can be fetched in time to contribute to the

final prediction, the accuracy of the predictor will be no worse than without packages.

Therefore, although it may be feasible to explore packaging entries upon allocation, the

packaging of evicted entries represented a better initial design point to us.

7.2 Architecture

The architecture of the packaging design is depicted in Figure 7.1. In the rest of this

section, we use the terms ptable and L1 table interchangeably to refer to the dedicated

predictor table to which the packaging scheme is applied. We describe the components of

the design next.

L2 Table Unlike the paging mechanism, the organization of entries in the second-level

table in the packaging scheme differs from that of the first-level table. The L2 table is orga-

nized into packages that are equal in size to the L2 cache line, and each package contains

as many entries as can fit within this amount of storage. Each package entry contains the

L1 table entry fields and the L1 table index of the entry. In TAGE, since the index and tag

66

7.2. Architecture

index tag

tag match?

history

Package Generation Unit

Fetch Engine

chooser

evicted entry

PC

L2 Table

Prediction Buffer

Package numbers

ptable prediction

tag match?

L1Table

Packaging Buffer

Figure 7.1: Packaging scheme architecture

hash functions are computed differently, two entries in different positions on the predictor

table may have the same tag. Therefore, the combination of each entry’s index and tag

is used to uniquely identify each dynamic instance of a branch. The entries in the pack-

ages do not include the package number field from the first-level table, since the package

number is implicit in the package in which an entry resides.

Fetch Engine The fetch engine is used to fetch packages from the second-level table for

the purpose of being used in the prediction mechanism. These packages are placed in the

prediction buffer. For each branch, the dedicated predictor table and the prediction buffer

are accessed in parallel to provide a prediction.

To determine a hit in the prediction buffer we require that both the index and the tag of

the entry match the computed index and tag respectively. In case of a miss in the dedicated

predictor table and a hit in the prediction buffer, the entry in the buffer is used to provide

67

7.2. Architecture

n bBranch PC

Figure 7.2: Package number determined by n bits of branch PC

the prediction for the ptable component. The final prediction is selected among all the

TAGE tables as before. If the final prediction was provided by the prediction buffer, then

the entry is updated in-place in the prediction buffer.

Package Generation Unit The package generation unit is in charge of placing evicted

entries in their respective packages. The packaging buffer is used to store the package in

which an evicted entry is placed for the current branch.

As described before, the packages in the packaging and prediction buffers are generally

different. The package in the prediction buffer is determined by the package number of

the current branch, while an evicted entry is placed into a package belonging to the branch

that initially allocated it. We will see in Section 7.3.3 that the swap rate for packages in

the packaging buffer is very small, so that the major portion of the traffic to and from the

L2 comes from the prediction buffer.

Package Number Figure 7.2 shows how the branch address is used to determine the

package number for the branch. Similar to the paging scheme, the instruction address

space is conceptually divided into instruction blocks, where each instruction block is as-

signed its own package on the second-level predictor table. Branches in one instruction

block share a package to store their evicted entries. In the figure, b is log2 of the instruc-

tion block size and n is log2 of the number of packages that reside in the L2. The package

number is determined by bits [b+n− 1 : b] of the branch address.

Since each entry in the L1 is augmented with the package number, this scheme has

an overhead of M × log2(T ) bits of extra dedicated storage, where M is the number of L1

entries, and T is the number of packages in the L2.

68

7.3. Experimental Results

7.3 Experimental Results

In this section, we experimentally evaluate the potential benefit of the packaging scheme,

ignoring the effect of the L2 table latency. The experiments are organized to parallel those

for the paging scheme (Chapter 6).

Similar to the experiments presented in Section 6.3 for the paging scheme, we use a

fixed size of 2K entries for all tagged tables. The size of the bimodal table is 16K entries.

In all the experiments presented in this chapter, we use an L2 table with 1K 64 B

packages, for a total storage of 64 KB. The package size is chosen such that it fits in a

single L2 cache line. The L2 size is chosen equal to the L2 size for the paging scheme, to

allow for a fair comparison of the two schemes. The packaging scheme has the advantage

that the L2 represents storage additional to the L1 table. As a result, any L2 size can be

used to provide potential gains. Unlike this scheme, the L2 table has to be larger than the

L1 in the paging scheme, since the L2 is a superset of the L1.

A summary of the rest of the section is as follows:

• In Section 7.3.1, we determine the table that provides the best benefit from the extra

capacity in the form of packages.

• In Section 7.3.2, we show the maximum potential improvement that can be achieved

through virtualization with the packaging scheme in the absence of L2 delay con-

straints.

• In Section 7.3.4, we increase the instruction block size in order to reduce the package

swap rate (traffic to and from the L2 table).

• Finally, in Section 7.3.6, we discuss a prefetching mechansim that can be used to

prefetch packages in order to potentially reduce the effects of L2 delay.

69

7.3. Experimental Results

Baseline 1 2 3 4 5 6 7 8 9 10 11 12

2

3

4

5

6

7

8

9

N=1 N=2 N=3 N=4 N=5 N=6 N=7 N=8 N=9 N=10 N=11 N=12

Ptable

Av

era

ge

MP

KI

Figure 7.3: MPKI for TAGE with a table of 1K packages added to a single predictor table

7.3.1 Choosing Ptable

In this section, we determine which table benefits most from the extra capacity provided in

the form of packages. We use a predictor model that allows access to the packages without

any delay. In the next section, we use these results to show the maximum potential gain of

the packaging scheme.

In this model of the predictor, one of the tables is augmented with an L2 table orga-

nized into packages. The packages are placed in the prediction and packaging buffers

without any delay. We refer to this model as the zero-delay packaged design, and use this

model in the rest of this chapter. The accuracy of this predictor represents an upper bound

on the improvement that can be achieved with the packaging scheme.

In this section, we use an instruction block size of 32 B, same as the intial instruction

block size used in the paging scheme, to allow for a direct comparison of the two designs.

Figure 7.3 shows the average MPKI of the zero-delay packaged design for different val-

ues of N . The x-axis shows which table was augmented with an L2 table, while each line

70

7.3. Experimental Results

Number of Tagged Tables (N) Ptable1 to 5 1

6 to 12 2

Table 7.1: Default Ptable Configuration

Package Size 22 Entries (~64 B)Instruction Block Size 32 BytesL2 Size 1K Packages

Table 7.2: Packaging Scheme Configuration

represents a different N . The average MPKI of the baseline predictor (with no ptable) is

also shown at N = 0 for comparison. This figure is similar to Figure 6.3 used for choos-

ing ltable in the paging scheme. We expect the two figures to show similar results, since

both experiments effectively increase the capacity of a single predictor table, only in two

different ways.

It can be observed from the figure that in all cases, the first few tables benefit the

most from the additional capacity. The best MPKI is roughly achieved with ptable = 1 for

N <= 5, and ptable = 2 for N > 5. These results, which we refer to as the default ptable

configuration, are summarized in Table 7.1. We use the default ptable configuration in the

rest of this chapter.

For the default ptable configuration, each package entry is 23 bits (12 bits for the pre-

dictor table entry, plus 11 bits to store the entry’s index). As a result, each package (64 B)

contains 22 such entries.

7.3.2 Potential Improvement with the Packaging Scheme

In this section, we use the zero-delay packaged predictor with the default ptable config-

uration (Table 7.1) to show the potential improvement that can be achieved through the

packaging scheme. Table 7.2 summarizes the parameters of the packaged design. We com-

pare the results in this section with the results for the zero-delay paging scheme with the

same page size (64 B) and instruction block size (32 B) (Section 6.3.4).

Figure 7.4 shows the average MPKI of the zero-delay packaged predictor. Also shown

71

7.3. Experimental Results

(a)1 2 3 4 5 6 7 8 9 10 11 12

2

3

4

5

6

7

8

9

0

5

10

15

20

25

30

35

40

45

50

4.42

3.68

3.28

3.06

2.92 2.9

62.85

2.71

2.62

2.56

2.49

2.45

48.14

42.68

39.18

36.37

33.4

25.79

22.52

20.3219.05

17.4216.48

15.05

Average MPKI: TAGE_N,2K Average MPKI: Packaged PredictorAverage % Improvement Over Baseline (Second Y-axis)

N

Av

era

ge

MP

KI

Av

era

ge

% I

mp

rove

men

t

Figure 7.4: Maximum potential improvement with packaging scheme using 64 B packagesand instruction block size

for comparison is the average MPKI of the baseline predictor (TAGEN,2K). The dashed plot

shows the percentage improvement in the predictor accuracy over the baseline. The figure

shows that the potential improvement for this scheme is in the range of ~15% to ~48% for

different N . The potential is smaller for larger N . The discontinuity at N = 6 marks the

point where ptable changes from the first tagged table to the second table.

Next, we compare the potential improvement of the packaging scheme (Figure 7.4)

with the potential improvement of the paging scheme in Figure 6.6. The improvement

range for this scheme is between ~15% to ~48%, compared with the range of ~15% to

~41% for the paging scheme. The two schemes achieve similar results for N = 12. How-

ever, the packaging scheme is always better than the paging scheme for other values of N .

The difference between the potential improvement of the two schemes is larger for smaller

N . For example at N = 8, this scheme results in a potential improvement of 20.3%, while

the paging scheme results in an 18.3% potential improvement.

72

7.3. Experimental Results

1 2 3 4 5 6 7 8 9 10 11 12

0

10

20

30

40

50

6056.79 56.79 56.79 56.79 56.79 56.79 56.79 56.79 56.79 56.79 56.79 56.79

0.19 0.09 0.08 0.07 0.07 0.09 0.06 0.05 0.05 0.04 0.04 0.03

Prediction Buffer Packaging Buffer

N

Pac

kag

e S

wap

Fre

qu

ency

(P

er C

on

dit

ion

al B

ran

ch)

Figure 7.5: Package swap frequency for prediction and packaging buffers

In the next section, we measure the frequency of package swaps and show that the

swap frequency is larger than the page swap frequency of the paging scheme.

7.3.3 Package Swap Frequency

In this section, we measure the package swap frequency for both the prediction and pack-

aging buffers. The package swap frequency determines the impact of virtualization on the

traffic to the L2 cache. For each structure, we calculate the package swap frequency as the

total number of package fetches over the total number of branch predictions made (the

package swap frequency has a unit of swaps per conditional branch). We assume that the

two structures always fetch different packages. As a result, the total package swap rate is

the sum of the two buffers’ swap rate.

In our experiments, we use a victim buffer to store a single evicted package from the

prediction buffer. This method is employed to reduce the prediction buffer package swap

frequency. Package swaps between the victim and prediction buffers are not counted as

part of the structure’s package swap frequency.

73

7.3. Experimental Results

Figure 7.5 shows the package swap frequency for the packaging and prediction buffers,

as a function ofN . The prediction buffer package swap rate is independent ofN , since it is

determined by the stream of branch addresses (and, consequently, packages) encountered

in the program. Similar to page swap frequency, the prediction buffer package swap fre-

quency is a property of the benchmark itself and is a function of the entropy present in

the branch address bits. In contrast, the packaging buffer swap rate differs for each N , as

it is determined by the pattern of evictions from the L1 table.

The figure shows that the prediction buffer package swap frequency is 56.8%. For an

equivalent configuration, the paging scheme swap frequency is 32.8% (Figure 6.7). The

page swap frequency is lower, since the paging scheme allows multiple pages (64 pages in

our case) to simulatenously reside in the L1 table.

Interestingly, the package swap frequency for the packaging buffer is rather small. For

this buffer, the swap frequency decreases with larger N , with a maximum of 0.19% at

N = 1. Since this number is much smaller relative to the prediction buffer swap rate

(56.8%), we ignore its effects on the L2 traffic. From this point on, we use the prediction

buffer swap rate to measure the impact of virtualization on the L2 traffic, and use the term

package swap frequency to refer to the prediction buffer package swap frequency.

We address the issue of lowering the package swap frequency next.

7.3.4 Instruction Block Size

In the previous section, we showed that for a 32 B instruction block size, a package would

have to be swapped into the prediction buffer more often than every other conditional

branch. In this section, we vary the instruction block size in order to lower the package

swap frequency. We evaluate the effect of varying the instruction block size on package

swap frequency as well as on the predictor accuracy.

The instruction block size is varied between 32 B and 16 KB, in effect having branches

in that region of code share the same package for their evicted entries. The experiments in

this section are similar to those presented in Section 6.3.5 to lower the page swap frequency

74

7.3. Experimental Results

32 64 128 256 512 1K 2K 4K 8K 16K

0

10

20

30

40

50

60

70

56.79

43.88

34.44

27.7

22.54

17.815.52

14.0712.53 11.6

Apache TPC-C TPC-H Average

Instruction Block Size (Bytes)

Pa

ckag

e S

wap

Fre

qu

enc

y

Figure 7.6: Effect of instruction block size on prediction buffer package swap frequency

for the paging scheme.

Effect of Instruction Block Size on Package Swap Frequency

Figure 7.6 shows the package swap frequency as a function of the instruction block size,

for each benchmark and on average. The figure shows that as the instruction block size

is increased (and more branches share a package), the package swap frequency drops,

but at a slower rate. However, even with a 16 KB instruction block size, the package

swap frequency does not drop below 11%. The paging scheme achieves the same swap

frequency with an instruction block size of 256 B (Figure 6.10).

Next, we examine the effect of the instruction block size on the utilization of the L2

table packages. As the instruction block size is increased and higher portions of the PC

are used for the package number, some package numbers may not be encountered in the

program run. Figure 7.7 shows the percentage of such packages as a function of the in-

struction block size. For the two schemes respectively, the percentage of package and page

numbers not encountered in the program run are the same.

75

7.3. Experimental Results

32 64 128 256 512 1K 2K 4K 8K 16K

0

10

20

30

40

50

60

70

80

90

0 0 0 0.2 1.53

6.54

16.86

25.59

40.76

52.08

Apache TPC-C TPC-H Average

Instruction Block Size (Bytes)

Fra

cti

on

of

Pac

kag

es N

eve

r A

cce

sse

d (

%)

Figure 7.7: Effect of instruction block size on the fraction of packages never accessed

The figure shows that as instruction block size is increased, the percentage of packages

that are never accessed grows larger. For an instruction block size of 16 KB, more than

half the packages are never accessed in the entire program run.

In conclusion, a larger instruction block size can be used to reduce package swap fre-

quency, but not beyond a certain value. At the same time, larger instruction block sizes

lead to an underutilized L2 table. In Section 6.3.5, we described a method for decreasing

the swap rate with less impact on L2 utilization by incorporating branch history informa-

tion in the page fetch mechanism. The same method can be used for fetching packages.

Effect of Instruction Block Size on MPKI

Figure 7.8 shows the effect of varying the instruction block size on the average MPKI , for

different N . The figure shows that a larger instruction block size leads to a worse MPKI,

and the effect is more pronounced for smaller N . Up to an instruction block size of 256 B,

the effect on MPKI is minor. One reason for this may be that up to this point, all packages

are still in use (Figure 7.7).

76

7.3. Experimental Results

32 64 128 256 512 1K 2K 4K 8K 16K

2

2.5

3

3.5

4

4.5

5

5.5

6

N:1 N:2 N:3 N:4 N:5 N:6 N:7 N:8 N:9 N:10 N:11 N:12

Instruction Block Size (Bytes)

Av

era

ge

MP

KI

Figure 7.8: Effect of instruction block size on MPKI

Package Size 22 Entries (~64 B)Instruction Block Size 1 KBL2 Size 1K Packages

Table 7.3: Final Packaging Scheme Configuration

Based on Figures 7.6 and 7.8, it can be seen that lowering the package swap rate comes

at the cost of an increased MPKI. The first few values of the instruction block size represent

the best tradeoff between the two.

7.3.5 Improvement for Packaging Scheme Before Considering L2 Delay

In this section, we summarize the MPKI results for the packaging scheme before consid-

ering the L2 delay effects. Based on the previous section, we choose an instruction block

size of 1 KB. For this size, the package swap frequency is cut down to a third of its original

value with a 32 B instruction block size (from ~56.8% to ~17.8%). For this instruction

block size, an average of ~6.5% of the 1 K packages are never utilized. Table 7.3 summa-

rizes the parameters for this design.

77

7.3. Experimental Results

1 2 3 4 5 6 7 8 9 10 11 122

3

4

5

6

7

8

9

0

5

10

15

20

25

30

35

40

45

50

4.62

3.85

3.42

3.19

3.04

3.07

2.92

2.77

2.68

2.6 2.53

2.48

48.14

42.68

39.18

36.37

33.4

25.79

22.52

20.3219.05

17.4216.48

15.05

45.74

40.03

36.61

33.69

30.68

22.94

20.3718.52

17.3516.03

15.1913.92

Average MPKI: TAGE_N,2KAverage MPKI: Packaged Predictor – 32B Instruction Block SizeAverage MPKI: Packaged Predictor – 1KB Instruction Block SizeAverage % Improvement: Packaged Predictor – 32B Instruction Block SizeAverage % Improvement: Packaged Predictor – 1KB Instruction Block Size

N

Av

era

ge

MP

KI

Ave

rag

e %

Imp

rove

men

t

Figure 7.9: Potential improvement for packaging scheme using an instruction block sizeof 1 KB

Figure 7.9 shows the average MPKI of this predictor for different N . Also shown for

comparison are the average MPKI of the baseline predictor (TAGEN,2K) and predictor with

32 B instruction block size (Section 7.3.2). The two dashed plots show the percentage

improvement in the accuracy of the predictor with the two different instruction block sizes

over baseline. The difference between the two lines represents the improvement loss due

to using an instruction block size 32 times larger than the original. This instruction block

size removes ~2%-3% of the improvement that can be achieved using a 32 B instruction

block size.

Comparing the packaged design with the baseline predictor, the figure shows the ac-

curacy improvement over baseline for each N . The improvement is larger for smaller N ,

where the L2 capacity is comparatively larger than the aggregate dedicated L1 capacity

(The aggregate L1 capacity at each N is ~2N KB). The improvement ranges from 45.7%

for N = 1 to 13.9% for N = 12.

78

7.3. Experimental Results

1 2 3 4 5 6 7 8 9 10 11 12

0

5

10

15

20

25

30

35

40

45

50

36.8

32.6

30.6

28.6

26.3

21.6

16.8

16.1

15.8

15.3

14.6

13.7

45.7

40.0

36.6

33.7

30.7

22.9

20.4

18.5

17.4

16.0

15.2

13.9

Paged Predictor – 32B Instruction Block Size Packaged Predictor – 1KB Instruction Block Size

N

Ave

rag

e %

Imp

rove

men

t O

ver

Bas

elin

e

Figure 7.10: Maximum potential improvement before considering L2 delay for paging andpackaging schemes

Figure 7.10 compares the percentage improvement in accuracy for the packaging and

paging schemes for different N . The percentage improvement for paging scheme has been

reproduced from Figure 6.13. It can be seen in the figure that packaging always achieves

better accuracy improvement than paging. The difference between the two schemes is

generally larger for smaller N . The two schemes perform almost similarly for N = 12.

In conclusion, with no L2 delay, the packaging scheme can achieve better accuracy

than the paging scheme, especially for smaller N . We expect the difference between the

two schemes to be greater for larger L1 sizes, since losses in the paging scheme due to

page size constraints are more significant for larger L1 sizes. Compared with the paging

scheme, however, this scheme adds a larger dedicated storage overhead. In addition, it has

a larger swap frequency, causing larger traffic to and from the L2. Due to the higher swap

frequency, it may also be more sensitive to L2 delay. The impact of L2 delay on the scheme

will also depend on the contribution of ptable to final prediction, and specifically the

79

7.3. Experimental Results

164.gzip

175.vpr176.gcc

177.mesa179.art

181.mcf183.equake

188.ammp197.parser

254.gap255.vortex

256.bzip2300.twolf

Average0

10

20

30

40

50

60

70

80

90

100

1K Packages

Benchmark

PT

B A

ccu

racy

Rat

e (%

)

Figure 7.11: Accuracy of a 1K-entry PTB on the SPEC CPU 2000 benchmark set

contribution of the prediction buffer. The high package swap frequency may be relieved

by allowing more packages in the prediction buffer. Prefetching of packages may also

be helpful to lower the impact of L2 delay. In the next section, we introduce a simple

structure, the Package Target Buffer, which can be used to prefetch packages, and which

may help reduce the sensitivity of the scheme to the L2 delay.

7.3.6 Prefetching

In this section, we discuss a prefetching mechanism that may be useful in decreasing the

sensitivity of the scheme to the L2 delay. We propose a new structure called the Package

Target Buffer (PTB), which keeps track of the targets of packages. It is indexed with the

current package number in order to predict the next package number. The PTB is similar

to the BTB, which is used to predict targets of branches.

For our virtualized design (with a total of 1K packages), we evaluate a 1K-entry PTB.

Each PTB entry is 10 bits (log2(1K)) for a total storage of 1.25 KB. Figure 7.11 shows the

accuracy of this PTB on the SPEC CPU 2000 benchmarks. The figure shows that the PTB

80

7.4. Summary

has an average accuracy of 70%, achieving almost perfect accuracy for three of the bench-

marks. This motivates the use of the PTB for prefetching of packages.

Since page and package numbers are essentially the same (upper bits of the branch

PC), this prefetching mechanism can also be applied to the paging scheme.

7.4 Summary

In this section we presented the packaging scheme, where evicted entries from the first-

level predictor table are grouped into packages that are placed in the second-level table.

Unlike the paging scheme, the packaging scheme does not impose limitations on the num-

ber of L1 predictor table entries that can be accessed by each branch. Instead, it allows

branches to access any entry in the L1 table, while using the second level as additional

storage to retain evicted entries. Before considering the effects of L2 delay, we showed that

the packaging scheme achieves better accuracy than the paging scheme. However, since

only a single package is allowed to be active in the L1 at one time, the packaging scheme

has higher package swap rate. In addition, for the same L2 size, the packaging scheme

requires a larger dedicated storage overhead than the paging scheme, since it extends each

entry with the package number of the branch that allocated the entry.

81

Chapter 8

Conclusion

In this section we summarize the preceding chapters, comment on limitations, and discuss

future work.

8.1 Summary

Applications with large instruction footprints, such as commercial workloads, expose the

tradeoff among accuracy, latency and hardware cost in branch predictor design. To address

this tradeoff, this work introduces novel virtualization techniques for branch outcome pre-

dictors. Virtualization is applied to a state-of-the-art multi-table branch predictor, the

TAgged GEometric History Length (TAGE) predictor. One of the dedicated predictor tables

is augmented with a large virtual second-level table allocated in the second-level caches,

increasing its perceived capacity. Chapter 5 presented an overview of the two virtualiza-

tion methods proposed in this work: the paging scheme (Chapter 6) and the packaging

scheme (Chapter 7).

8.1.1 Paging Scheme

In the paging scheme, the entries in the predictor table are divided into pages shared by a

group of branch instructions in close proximity in the program code space. Based on the

current portion of the code, the corresponding page is fetched from the second level and

is placed in the first-level table. The first-level table caches the most recently used subset

of the second-level table pages.

Exploring the design space of this scheme, we found that virtualizing shorter history

length tables provides the best benefit in improving prediction accuracy. We evaluated the

82

8.1. Summary

effect of restricting branches to a single predictor table page for various page sizes. We

found that smaller page sizes have higher MPKI compared with larger pages. We showed

that the higher MPKI of smaller pages is not due to an increase in aliasing, but because of

a decrease in the frequency of tag matches. In addition, the effect of page size constraints

are smaller on a predictor with more tagged tables. We demonstrated that by sharing a

page among branches in a larger instruction block, the traffic to and from the L2 can be

decreased almost arbitrarily, but at the cost of an increased MPKI. Finally, we found that

virtualizing only a single predictor component allowed more tolerance to the L2 delay, as

the table does not contribute to every prediction.

Experimental results showed that the virtualized design remains effective even for

larger dedicated table sizes, although page size limitations are worse for larger tables.

We found that the relative size of the L2 to L1 has to be large in order to compensate for

losses due to page size constraints.

In our final virtualized design, branches in a 256 B block of code shared a single page

of 32 entries, which fits in a single L2 cache line. We used a second-level virtual table

containing 1K 64 B cache lines, for a total storage of 64 KB. The virtualized design has an

overhead of 704 bits of extra dedicated storage.

For a predictor with moderate table sizes (2K entries on the tagged components and

a bimodal table of 16K entries), virtualization improves accuracy in the range of 10.6%

to 30.5%, depending on the number of dedicated tables. Alternatively, the virtualized

design can be used to reduce the dedicated budget and complexity of the predictor without

affecting accuracy. In this case, depending on the number of dedicated predictor tables,

20% to 50% of the tagged tables can be removed, while achieving same or better accuracy

than a non-virtualized design.

For a predictor whose size is determined not by area constraints but by the delay to

access its contents, the best accuracy of the baseline predictor is achieved with 8 tagged

tables and can be improved by 8.7% using virtualization. Alternatively, the same accuracy

as the baseline predictor can be achieved using a virtualized design with 5 tagged tables,

83

8.1. Summary

removing approximately 25% of the dedicated storage.

8.1.2 Packaging Scheme

In the packaging scheme, evicted entries from the first-level predictor table are grouped

into packages that are placed in the second-level table. Similar to the paging method, a

group of branch instructions spaced closely in the program share a package. Based on the

current portion of the code, the correct package is placed in a buffer co-located with the

first-level predictor table, and its data is used as part of the prediction mechanism.

For the packaging scheme, we found that shorter history length tables benefit the most

from the extra capacity. We showed that increasing the instruction block size can be used

to reduce traffic to and from the L2 table, but only to a certain point.

In our final design, branches in a 1 KB block of code shared a single package of 22

entries, which fits in a single L2 cache line. The design adds an overhead of 20 K bits of

extra dedicated storage. We used a second-level virtual table containing 1K 64 B cache

lines, for a total storage of 64 KB. Before considering the effects of the L2 table delay, the

packaging scheme improves accuracy in the range of 13.9% to 45.7%, depending on the

number of dedicated tables.

8.1.3 Comparison of the Two Schemes

The paging and packaging schemes can be viewed as increasing the perceived length (size)

and width (associativity) of one of the predictor tables, respectively. In both schemes, the

upper bits of the branch address are used to determine pages or packages shared by a

group of branch instructions in close proximity in the program code space.

The paging scheme imposes limitations on the number of L1 predictor table entries

that can be accessed by each branch to within a single page. The packaging scheme re-

moves this limitation by allowing branches to access any entry in the L1 table, while using

the second level as additional storage to retain evicted entries. Before considering the

effects of L2 delay, the packaging scheme can achieve better accuracy than the paging

84

8.2. Limitations and Future Work

scheme, especially for a predictor with fewer dedicated tables. However, since only a sin-

gle package is allowed to be active in the L1 at one time, the packaging scheme has higher

package swap rate. Not only does this result in more traffic to the L2, but it may also

increase the sensitivity of the scheme to the L2 delay. Evaluation of the effect of latency

on the packaging scheme was left for future work. In addition, for the same L2 size, the

packaging scheme requires a larger dedicated storage overhead than the paging scheme,

since it extends each entry with the package number of the branch that allocated the entry.

8.2 Limitations and Future Work

In this work, we have used functional simulation to evaluate the accuracy of the proposed

branch predictor designs. This is typical of branch prediction studies. For the paging

scheme, we used instruction count to approximate processor cycles and to emulate delay of

the predictor L2 table. Cycle-accurate simulation would allow us to determine the impact

of our design on the L2 cache traffic, as well as the effect of the improved accuracy on the

overall processor performance. Furthermore, it would allow us to determine predictor L2

table sizes which benefit prediction accuracy without contending for the L2 cache data.

For the packaging scheme, evalaution of the effect of predictor L2 table delay is left for

future work.

To enhance the accuracy of the proposed designs, several optimizations can be ex-

plored. Extending the schemes to multiple predictor tables, incorporating history bits

into page/package numbers (Section 6.3.5), and prefetching of pages/packages (Section

7.3.6) are examples of such enhancements.

Our schemes can also be used to implement traditional hierarchical branch predictors.

The design of these predictors can be explored without the L2 cache line size and delay

constraints. For the paging scheme, we have tried to show the effects of all parameters

without the limitations of the L2 cache constraints, which can be used as a basis for eval-

uating these predictors.

85

Appendix A

Branch Prediction Schemes

A.1 Bimodal

The behavior of typical branches is far from random. Most branches are either usually

taken or usually not taken. Bimodal branch prediction takes advantage of this bimodal

distribution of branch behaviour and attempts to distinguish usually taken from usually

not-taken branches [3]. Figure A.1 shows a typical organization of the bimodal predictor.

In this scheme, a table of counters is indexed by the low order bits of the branch ad-

dress. For each taken branch, the appropriate counter is incremented. Likewise for each

not-taken branch, the appropriate counter is decremented. Each counter is two bits long

and is saturating. In other words, the counter is not decremented past zero, nor is it incre-

mented beyond three. The most significant bit determines the final prediction. Repeatedly

taken branches will be predicted to be taken, and repeatedly not-taken branches will be

predicted to be not-taken. By using a 2-bit counter, the predictor can tolerate a branch

going an unusual direction one time and keep predicting the usual branch direction. This

is especially useful for loop branches which are taken multiple times and not-taken only

once when exiting the loop. With a 1-bit counter, each loop branch would incur two mis-

predictions: on the loop exit and the first loop iteration the next time, while with a 2-bit

counter only the loop exit is mispredicted.

A.2 Two-Level Adaptive Predictors

Two-Level Adaptive Branch Prediction [14] uses two levels of branch history information

to make predictions. The first level is the history of the outcome of recently encountered

86

A.2. Two-Level Adaptive Predictors

Taken

PC

Counts

predictTaken

Figure 1: Bimodal Predictor Structure

repeatedly not-taken branches will be predicted to be not-taken. By using a 2-bit counter,the predictor can tolerate a branch going an unusual direction one time and keep predictingthe usual branch direction.

For large counter tables, each branch will map to a unique counter. For smaller tables,multiple branches may share the same counter, resulting in degraded prediction accuracy.One alternate implementation is to store a tag with each counter and use a set-associativelookup to match counters with branches. For a fixed number of counters, a set-associativetable has better performance. However, once the size of tags is accounted for, a simplearray of counters often has better performance for a given predictor size. This would notbe the case if the tags were already needed to support a branch target buffer.

To compare various branch prediction strategies, we will use the SPEC’89 benchmarks[SPE90] shown in Figure 2. These benchmarks include a mix of symbolic and numericapplications. However, to limit simulation time, only the first 10 million instructions fromeach benchmark was simulated. Execution traces were obtained on a DECstation 5000using the pixie tracing facility[Kil86, Smi91]. Finally, all counters are initially set as if allprevious branches were taken.

Figure 3 shows the average conditional branch prediction accuracy of bimodal predic-tion. The number plotted is the average accuracy across the SPEC’89 benchmarks with eachbenchmark simulated for 10 million instructions. The accuracy increases with predictorsize since fewer branches share counters as the number of counters increases. However,prediction accuracy saturates at 93.5% correct once each branch maps to a unique counter.A set-associative predictor would saturate at the same accuracy.

3

Figure A.1: Bimodal Predictor [3]

branches. The second level is the branch behavior for the occurrences of the specific pat-

tern of branch outcomes. The two levels of information are maintained in two major data

structures: the branch history register (HR) and the pattern history table (PHT). The his-

tory register is a shift register which shifts in bits representing the outcome of the most

recent branches. The branch predictor may use an array of history registers each tracking

the local history of a specific branch. This array of history registers is known as the branch

history register table (HRT). The history register is used as an index into the PHT to select

a 2-bit saturating counter that will provide the final prediction.

Depending on the number of history registers, three main classes of two-level adaptive

predictors are developed as follows: First, the Per-Address Schemes (PA), in which an array

of history registers each keeping track of the outcome of a single branch (local history) is

indexed by the branch address. Second, the Per-Set Schemes (SA), in which an array of

history registers each keeping track of the outcome of a set of branches is indexed by a

portion of the branch address. The number of branch history registers in this scheme is

smaller than the PA schemes. And Third, the Global Schemes (GA), in which a single

global history register is used to keep track of the outcome of all branches encountered.

Each of the three classes can be further divided into sub-schemes according to the

number of pattern history tables used. For example, for the PA scheme three sub-schemes

exist: In the PAg scheme, the HR determined by the address of the branch is used to access

87

A.3. Gshare

Global BranchHistory Register(GBHR)

GlobalPatternHistoryTable(GPHT)

Per-addressBranchHistory Table(PBHT)

GlobalPatternHistoryTable(GPHT) Per-address

BranchHistory Table(PBHT)

Per-addressPatternHistoryTables(PPHT)

GAg PAg PAp

Index

Index Index

Figure A.2: Two-Level Adaptive Predictor [14]

a single global PHT. In the PAp scheme, each branch has its own branch history register

and PHT. The HR determined by the address of the branch is used to access a PHT also

determined by the address of the branch, specifically keeping track of the patterns of the

branch. Finally, in the PAs scheme, the HR determined by the address of the branch is

used to access a PHT shared by a set of branches, determined by a portion of the branch

address. Figure A.2 shows the organization of the GAg, PAg, and PAp predictors.

The GAg scheme which uses a single branch history register used to index a single

pattern history table forms the basis of the Gshare predictor described next.

A.3 Gshare

In the GAg scheme, the index to the PHT is based solely the outcome of most recently

encountered branches. Instead, in the Gshare predictor the branch history is XORed with

88

A.4. 2BC-Gshare

PC

Counts

Taken predictTaken

GR

m

n nXOR

Figure 10: Global History Predictor with Index Sharing

For predictor sizes of 256 bytes and over, gshare-best outperforms gselect-best by a smallmargin. For smaller predictors, gshare underperforms gselect because there is already toomuch contention for counters between different branches and adding global informationjust makes it worse.

8 Combining Branch PredictorsThe different branch prediction schemes we have presented have different advantages. Anatural question is whether the different advantages can be combined in a new branchpredictionmethod with better prediction accuracy. One such method is shown in Figure 12.This combined predictor contains two predictors P1 and P2 that could be one of thepredictors discussed in the previous sections or indeed any kind of branch predictionmethod. In addition, the combined predictor contains an additional counter array whichserves to select the best predictor to use. As before, we will use 2-bit up/down saturatingcounters. Each counter keeps track of which predictor is more accurate for the branchesthat share that counter. Specifically, using the notation P1c and P2c to denote whetherpredictors P1 and P2 are correct respectively, the counter is incremented or decrementedby P1c-P2c as shown below:

P1c P2c P1c-P2c0 0 0 (no change)0 1 -1 (decrement counter)1 0 1 (increment counter)1 1 0 (no change)

12

Figure A.3: Gshare Predictor [3]

a portion of the branch address to create a hash function to access the PHT [3]. The XOR

operation allows accesses to the PHT to be evenly distributed, which would otherwise

be unequally distributed due to the non-uniform nature of branch histories. In this way,

the Gshare predictor increases accuracy by reducing the probability that two different

branches will interfere with each other in the PHT. Figure A.3 shows the Gshare predictor.

A.4 2BC-Gshare

The advantages of different branch prediction schemes can be combined in a hybrid branch

predictor, resulting in better prediction accuracy. One such combination of branch predic-

tors is a bimodal predictor and a gshare predictor, known as 2bc-gshare. In this com-

bination, global information can be used if it is worthwhile, otherwise the usual branch

direction as predicted by the bimodal scheme can be used. This predictor contains the two

bimodal and gshare predictor tables, plus an additional table, known as the meta predic-

tor, which serves to select the best predictor to use. The meta predictor also uses simple

2-bit up/down saturating counters to keep track of which predictor is more accurate for

the branches that share that counter. The final prediction is chosen between the bimodal

and gshare predictions based on the decision of the meta predictor. Figure A.4 shows the

high-level structure of the 2bc-gshare predictor.

89

A.5. The Skewed Predictor and 2BC-Gskew

gshare-best gselect-best global

|32

|64

|128

|256

|512

|1K

|2K

|4K

|8K

|16K

|32K

|64K

|88

|89

|90

|91

|92

|93

|94

|95

|96

|97

|98

Predictor Size (bytes)

Con

ditio

nal B

ranc

h Pr

edic

tion

Accu

racy

(%)

Figure 11: Global History with Index Sharing Performance

PC

Counts

P1c-P2c useP1

P1 P2

Figure 12: Combined Predictor Structure

13

Figure A.4: 2bc-gshare Predictor [3]

A.5 The Skewed Predictor and 2BC-Gskew

The basic principle of the skewed branch predictor [4] is to use three branch predictor

tables each indexed by a different and independent hash function of the same vector of

information, the branch address and global history, in order to reduce the impact of de-

structive aliasing. A prediction is read from each of the predictor tables and a majority

vote is used to select the final prediction. The rationale for using different hashing func-

tions for each table is that two branches that are aliased with each other on one table are

unlikely to be aliased on the other tables. This means that a destructive aliasing of the two

branches may occur in one table, but the overall prediction is still likely to be correct.

The 2bc-gskew predictor is a hybrid predictor that combines the skewed predictor with

a bimodal predictor, as shown in Figure A.5. The predictor consists of four tables of 2-bit

saturating counters. The first table is the bimodal predictor, which is also used as the first

table of the skewed predictor. Two more tables are used in the skewed predictor. The

fourth table is the meta predictor, which chooses whether the final prediction is provided

by the bimodal predictor or the skewed predictor. The bimodal predictor’s prediction is

one of the votes in the skewed predictor even if the final prediction is not provided by the

bimodal table.

The predictor uses a partial update method to update the table entries. On a wrong

90

A.6. YAGS

n

n

history

e!gskew prediction

bimodal prediction

metaprediction

Meta

G1

G0n

n

address

majorityvote

BIM

addressPREDICTION

Figure 1. The 2bcgskew branch predictor

(d) sharing physical tables between logical tables in the predictor. Normally a physical table is as-sociated with a logical table in the predictor, but one can share the physical tables as follows:depending on the parity of branch address (or on a bit in history), table P0 (resp. P1) will be ad-dress with “long history” (resp. “very long history” ) or “very long history” (resp. “long history”).4 physical tables may even be shared among the 4 logical tables.

(e) Among the many experiments we run, we observed several applications which had the following:different initializations of the predictor were leading to completely different misprediction rates.For a non-demanding application, we observed a variation of a factor 7 in misprediction rate(from 0.25 % to 1.80 %). This was obviously associated with some ping-pong phenomenom inthe predictor.In order to break the ping-pong phenomenom, we introduce a degree of randomness in the updateof the predictor as follows:randomly, in one out of 32 mispredictions, if the majority vote and the bimodal component areagreeing then the three predictions from G0, G1 and BIM are randomly forced to weakly goodprediction, if they are disageeing the meta predictor is forced to the correct component. Withthis introduction of this kind of randomness, the pathological behaviors associated with differentinitializations disappear.

The configurations that we propose in the distributed simulator were selected using set of traces fromSPEC95, SPEC2000 and IBS [5].

2

Figure A.5: 2bc-gskew Predictor [4]

prediction, all predictor tables are updated. On a correct prediction, however, only the

tables that participated in the correct prediction are updated. The reasoning behind the

partial update policy is that a table giving an incorrect prediction probably holds informa-

tion which belongs to a different branch. As a result, this table is not updated. The meta

predictor is updated when the two predictors disagree.

A.6 YAGS

The YAGS predictor is shown in Figure A.6. The YAGS predictor uses filtering and de-

interference techniques in order to reduce the impact of destructive aliasing on the pre-

dictor [7]. A bimodal predictor is used to store the bias of a branch, the direction usually

taken by the branch (filtering mechanism). Instances of the branch that do no comply with

this bias are recorded in two tagged tables called taken and not-taken direction caches.

91

A.6. YAGS

Figure 1. Gshare

Figure 4.Skew

Figure 2.Agree

Figure 3.Bi-Mode

Figure 6.YAGSFigure 5. Filter

Figure A.6: YAGS Predictor [7]

The use of small tags (6-8 bits) allows multiple instances of branches to be differentiated

from one another (de-interference). The tags store the least significant bits of the branch

address.

To make a prediction, the bimodal predictor is accessed to provide the default be-

haviour of the branch. If it indicates taken, then the not-taken cache is accessed to check

if this is a special case where the prediction does not agree with the bias. If there is miss in

the direction cache, the default prediction coming from the bimodal table is used. If there

is a hit, then the not taken cache supplies the final prediction. A similar set of actions is

taken if the bimodal predictor indicates not-taken, but in this case the check is done in the

taken cache.

If the outcome of a branch differs from the prediction of the bimodal predictor, the

corresponding direction cache is updated. A direction cache is also updated if a prediction

from it is used. The bimodal predictor is updated if it correctly predicts a branch or the

direction caches were incorrect.

92

A.7. Perceptron

1 -1 1 1 -1

PC Hash func.

Tableindex 8 4 -10 3 -6 1

x

Weights vector and bias

Branch history register

=

3 (dot product) + 1 (bias) = 4

Perceptron output > 0, so branch predicted taken.If branch was not taken, misprediction triggers training.Weights with a “-1” in the corresponding history positionare incremented, weights with a “1” in the position aredecremented.

!"# $%& !'() )*+& )%& ,'"-.% *( "..&((&/

Updated weights vector and bias

7 5 -11 2 -5 0

x

1 -1 1 1 -1

Branch history register

= -2 + 0 = -2

Perceptron output < 0, so this time the branch is predicted not taken.

!,# $%& -&0) )*+& )%& ,'"-.% *( "..&((&/

!"#$%& '( )&%*&+,%-. +%&/"*,"-. 0./ ,%0".".#

,1 234 5678 9- ":&'";&< "-/ "- *+='9:&+&-) 9> 233 56789:&' )%& =*&.&?*(& @*-&"' ,'"-.% ='&/*.)9'< 9-& 9> )%& ,&()='&:*9A(@1 ='9=9(&/ -&A'"@ ='&/*.)9' /&(*;-(2 B& &:"@A")&CDE6 A(*-; F"/&-.& .*'.A*) (*+A@")*9- "-/ (%9? )%") )%*(+*0&/G(*;-"@ -&A'"@ /&(*;- *( ",@& )9 *((A& ='&/*.)*9-( ")HIJK *- " LM-+ )&.%-9@9;1< .9-(A+*-; @&(( )%"- &*;%)+*@@*?"))( 9> =9?&' >9' )%& "-"@9; .9+=A)")*9-< "-/ '&(A@)*-;*- 9-@1 " 2N3 5678 /&.'&"(& *- "..A'".1 >'9+ "- *->&"(*,@&/*;*)"@ CD6 *+=@&+&-)")*9-2

!" #$%&'()*+, )+ -.*($/ 0(.,1%2)(359() ='9=9("@( >9' -&A'"@ ,'"-.% ='&/*.)9'( /&'*:& >'9+)%& =&'.&=)'9- ,'"-.% ='&/*.)9' OHP< OLP2 8- )%*( .9-)&0)<" =&'.&=)'9- *( " :&.)9' 9> h + 1 (+"@@ *-)&;&' ?&*;%)(<?%&'& h *( )%& !"#$%&' ()*+$! 9> )%& ='&/*.)9'2 E )",@& 9>n =&'.&=)'9-( *( Q&=) *- " >"() +&+9'12 E ;@9,"@ %*()9'1(%*>) '&;*()&' 9> )%& h +9() '&.&-) ,'"-.% 9A).9+&( !N >9')"Q&-< R -9) )"Q&-# *( "@(9 Q&=)2 $%& (%*>) '&;*()&' "-/ )",@&9> =&'.&=)'9-( "'& "-"@9;9A( )9 )%& (%*>) '&;*()&' "-/ )",@& 9>.9A-)&'( *- )'"/*)*9-"@ ;@9,"@ )?9G@&:&@ ='&/*.)9'( OMP< (*-.&,9)% )%& *-/&0&/ .9A-)&' "-/ )%& *-/&0&/ =&'.&=)'9- "'& A(&/)9 .9+=A)& )%& ='&/*.)*9-2

$9 ='&/*.) " ,'"-.%< " =&'.&=)'9- *( (&@&.)&/ A(*-; " %"(%>A-.)*9- 9> )%& ,'"-.% 6F< &2;2 )"Q*-; )%& 6F +9/ n2 $%&9A)=A) 9> )%& =&'.&=)'9- *( .9+=A)&/ "( )%& /9) ='9/A.)9> )%& =&'.&=)'9- "-/ )%& %*()9'1 (%*>) '&;*()&'< ?*)% )%& R!-9)G)"Q&-# :"@A&( *- )%& (%*>) '&;*()&'( ,&*-; *-)&'='&)&/ "(GN2 E//&/ )9 )%& /9) ='9/A.) *( "- &0)'" ,"-# .)"+!$ *-)%& =&'.&=)'9-< ?%*.% )"Q&( *-)9 "..9A-) )%& )&-/&-.1 9>" ,'"-.% )9 ,& )"Q&- 9' -9) )"Q&- ?*)%9A) '&;"'/ >9' *)(.9''&@")*9- )9 9)%&' ,'"-.%&(2 8> )%& /9)G='9/A.) '&(A@) *( ")@&"() R< )%&- )%& ,'"-.% *( ='&/*.)&/ )"Q&-S 9)%&'?*(&< *) *(='&/*.)&/ -9) )"Q&-2 D&;")*:& ?&*;%) :"@A&( /&-9)& *-:&'(&.9''&@")*9-(2 T9' &0"+=@&< *> " ?&*;%) ?*)% " GNR :"@A& *(+A@)*=@*&/ ,1 GN *- )%& (%*>) '&;*()&' !*2&2 -9) )"Q&-#< )%& :"@A&!1"!10 = 10 ?*@@ ,& "//&/ )9 )%& /9)G='9/A.) '&(A@)< ,*"(*-;)%& '&(A@) )9?"'/ " )"Q&- ='&/*.)*9- (*-.& )%& ?&*;%) *-/*.")&(" -&;")*:& .9''&@")*9- ?*)% )%& -9)G)"Q&- ,'"-.% '&='&(&-)&/,1 )%& %*()9'1 ,*)2 $%& +";-*)A/& 9> )%& ?&*;%) *-/*.")&( )%&

()'&-;)% 9> )%& =9(*)*:& 9' -&;")*:& .9''&@")*9-2 E( ?*)% 9)%&'='&/*.)9'(< )%& ,'"-.% %*()9'1 (%*>) '&;*()&' *( (=&.A@")*:&@1A=/")&/ "-/ '9@@&/ ,".Q 9- " +*(='&/*.)*9-2

B%&- )%& ,'"-.% 9A).9+& ,&.9+&( Q-9?-< )%& =&'.&=)'9-)%") ='9:*/&/ )%& ='&/*.)*9- +"1 ,& A=/")&/2 $%& =&'.&=)'9-*( )'"*-&/ 9- " +*(='&/*.)*9- 9' ?%&- )%& +";-*)A/& 9> )%&=&'.&=)'9- 9A)=A) *( ,&@9? " (=&.*!&/ )%'&(%9@/ :"@A&2 U=9-)'"*-*-;< ,9)% )%& ,*"( ?&*;%) "-/ )%& h .9''&@")*-; ?&*;%)("'& A=/")&/2 $%& ,*"( ?&*;%) *( *-.'&+&-)&/ 9' /&.'&+&-)&/ *>)%& ,'"-.% *( )"Q&- 9' -9) )"Q&-< '&(=&.)*:&@12 V".% .9''&@")*-;?&*;%) *- )%& =&'.&=)'9- *( *-.'&+&-)&/ *> )%& ='&/*.)&/,'"-.% %"( )%& ("+& 9A).9+& "( )%& .9''&(=9-/*-; ,*) *-)%& %*()9'1 '&;*()&' !=9(*)*:& .9''&@")*9-# "-/ /&.'&+&-)&/9)%&'?*(& !-&;")*:& .9''&@")*9-# ?*)% (")A'")*-; "'*)%+&)*.28> )%&'& *( -9 .9''&@")*9- ,&)?&&- )%& ='&/*.)&/ ,'"-.% "-/" ,'"-.% *- )%& %*()9'1 '&;*()&'< )%& @"))&'W( .9''&(=9-/*-;?&*;%) ?*@@ )&-/ )9?"'/ R2 8> )%&'& *( %*;% =9(*)*:& 9' -&;")*:&.9''&@")*9-< )%& ?&*;%) ?*@@ %":& " @"';& +";-*)A/&2

T*;A'& N *@@A()'")&( )%& .9-.&=) 9> " =&'.&=)'9- ='9/A.*-;" ='&/*.)*9- "-/ ,&*-; )'"*-&/2 E %"(% >A-.)*9-< ,"(&/ 9- )%&6F< "..&((&( )%& ?&*;%)( )",@& )9 9,)"*- " =&'.&=)'9- ?&*;%)(:&.)9'< ?%*.% *( )%&- +A@)*=@*&/ ,1 )%& ,'"-.% %*()9'1< "-/(A++&/ ?*)% )%& ,*"( ?&*;%) !N *- )%*( ."(&# )9 >9'+ )%&=&'.&=)'9- 9A)=A)2 8- )%*( &0"+=@&< )%& =&'.&=)'9- *-.9''&.)@1='&/*.)( )%") )%& ,'"-.% *( )"Q&-2 $%& +*.'9"'.%*)&.)A'& "/GXA()( )%& ?&*;%)( ?%&- *) /*(.9:&'( )%& +*(='&/*.)*9-2 B*)%)%& "/XA()&/ ?&*;%)(< "((A+*-; )%") )%& %*()9'1 *( )%& ("+&)%& -&0) )*+& )%*( ,'"-.% *( ='&/*.)&/< )%& =&'.&=)'9- 9A)=A)*( -&;")*:&< (9 )%& ,'"-.% ?*@@ ,& ='&/*.)&/ -9) )"Q&-2

$%& "..A'".1 9> )%& =&'.&=)'9- ='&/*.)9' .9+="'&/ >":9'G",@1 ?*)% 9)%&' ,'"-.% ='&/*.)9'( 9> )%& )*+& "-/ " -A+,&'9> 9=)*+*K")*9-( =A) )%& =&'.&=)'9- ='&/*.)9' ?*)%*- '&".%9> "- &>!.*&-) /*;*)"@ *+=@&+&-)")*9-2 T9' *-()"-.&< )%& /9)G='9/A.) .9+=A)")*9- ."- ,& /9-& A(*-; "- "//&' )'&&< (*-.&+A@)*=@1*-; ,1 " ,*=9@"' :"@A& !*2&2 N 9' GN# *( )%& ("+& "("//*-; 9' (A,)'".)*-; OLP2 $%& *-.'&+&-)Y/&.'&+&-) 9=&'"G)*9-( *- )%& )'"*-*-; "@;9'*)%+ ."- ,& ZA*.Q@1 =&'>9'+&/ ,1n + 1 .9A-)&'( *- ="'"@@&@2 [")&' '&(&"'.% ?9A@/ (*;-*!."-)@1*+='9:& ,9)% @")&-.1 "-/ "..A'".1 O\P< O4P< O]P2

448

Figure A.7: Perceptron Predictor [24]

A.7 Perceptron

Figure A.7 shows the perceptron predictor [21]. A perceptron is a vector of h + 1 small

integer weights, where h is the history length of the predictor. The global history is kept in

a shift register and the perceptrons are kept in a table analogous to the shift register and

table of counters in traditional global two-level predictors.

To predict a branch, the lower bits of the branch address are used to select a percep-

tron from the table. The output of the perceptron is computed as the dot product of the

perceptron and the history shift register, with the 0 (not-taken) values in the shift regis-

ter interpreted as -1. Added to the dot product is an extra bias weight in the perceptron,

which takes into account the bias of the branch. If the result of the dot product is greater

than or equal to 0, then the branch is predicted taken; otherwise, it is predicted not-taken.

Negative weight values denote inverse correlation between the outcome of the branch and

the corresponding history bit. The magnitude of the weight indicates the strength of the

positive or negative correlation.

At update time, the perceptron is trained on a misprediction or when the magnitude

of the perceptron output is below a specified threshold value. Upon training, both the

bias weight and the h correlating weights are updated. The bias weight is incremented or

decremented if the branch is taken or not taken, respectively. Each correlating weight in

the perceptron is incremented if the predicted branch has the same outcome as the corre-

93

A.8. O-GEHL

Analysis of the O-GEometric History Length branch predictor

Andre SeznecIRISA/INRIA/HIPEAC

Campus de Beaulieu, 35042 Rennes Cedex, Franceseznec@irisa.fr

Abstract

In this paper, we introduce and analyze the OptimizedGEometric History Length (O-GEHL) branch Predictorthat efficiently exploits very long global histories in the 100-200 bits range.The GEHL predictor features several predictor tables(e.g. 8) indexed through independent functions of the

global branch history and branch address. The set of usedglobal history lengths forms a geometric series, i.e.,

. This allows the GEHL predictor to efficientlycapture correlation on recent branch outcomes as well ason very old branches. As on perceptron predictors, the pre-diction is computed through the addition of the predictionsread on the predictor tables.The O-GEHL predictor further improves the ability of

the GEHL predictor to exploit very long histories throughthe addition of dynamic history fitting and dynamic thresh-old fitting.The O-GEHL predictor can be ahead pipelined to pro-

vide in time predictions on every cycle.

1. Introduction

Modern processors feature moderate issue width, butdeep pipelines. Therefore any improvement in branch pre-diction accuracy directly translates in performance gain.The O-GEHL predictor [22] was recently proposed at the

first Championship Branch Prediction (Dec. 2004). Whilereceiving a best practice award, the 64 Kbits O-GEHL pre-dictor achieves higher or equivalent accuracy as the exoticdesigns that were presented at the Championship BranchPrediction workshop [4, 15, 10, 1]. It also outperformsMichaud’s tagged PPM-like predictor [17] which was theonly other implementable predictor presented at the Cham-pionShip.

This work was partially supported by an Intel research grant and anIntel research equipment donation

In this paper, we analyze the O-GEHL branch predictorin great details and explore the various degrees of freedomin its design space.The O-GEHL predictor (Figure 1) implements multiple

predictor tables (typically between 4 to 12) . Each table pro-vides a prediction as a signed counter. As on a perceptronpredictor [8], the overall prediction is computed through anadder tree. The index functions of the tables are hash func-tions combining branch (and path) history and instructionaddress. GEHL stands for GEometric History Length sincewe are using an approximation of a geometric series for thehistory lengths L(i), i.e., .

Sign= prediction

!

M tables T(i)

Figure 1. The GEHL predictor

The O-GEHL implements a simple form of dynamic his-tory length fitting [12] to adapt the behavior of the predic-tor to each application. In practice, for demanding appli-cations, most of the storage space in the O-GEHL predic-tor is devoted to tables indexed with short history lengths.But, the combination of geometric history length and of dy-namic history lengths fitting allows the O-GEHL predictorto exploit very long global histories (typically in the 200-bits range) on less demanding applications. The O-GEHLpredictor also implements a threshold based partial updatepolicy augmented with a simple dynamic threshold fitting.We explore in details the design space for the O-GEHL

predictor: number of predictor tables, counter width, min-

Figure A.8: GEHL Predictor [25]

sponding bit in the history register and decremented otherwise with saturating arithmetic.

If there is no correlation between the predicted branch and a branch in the history register,

the corresponding weight will tend toward 0. If there is high positive or negative correla-

tion, the weight will have a large magnitude.

A.8 O-GEHL

The main component of the O-GEHL predictor [25] is the GEHL predictor which is com-

posed of several predictor tables indexed through independent functions of branch history

and branch address as illustrated in Figure A.8. The set of global history lengths used for

the predictor tables forms a geometric series i.e. L(i) = a(i−1) ×L(1). This allows the predic-

tor to capture correlation on very old branches by using long history lengths, while still

devoting most of the predictor resources to short history lengths. Similar to the percep-

tron predictor, the prediction is computed through the addition of the predictions read

from each of the predictor tables. The O-GEHL predictor further optimizes the predictor

through the addition of dynamic history length fitting [38] and dynamic threshold fitting.

In this section, we focus on the structure of the GEHL predictor.

The predictor features M distinct predictor tables, indexed with a hash of global and

94

A.8. O-GEHL

path histories and the branch address. The predictor tables store predictions as signed

saturated counters. To compute a prediction, a single counter is read from each predictor

table. The final prediction is computed as the sign of the sum S of the M counters.The

branch is predicted taken when S >= 0 and not-taken otherwise.

Distinct history lengths are used for computing the index of the distinct tables. The

first table is indexed using the branch address only (similar to the 2bc-gskew predictor).

The history lengths used for computing the index functions to the other predictor tables

form a geometric series.

The predictor update policy is derived from the perceptron predictor update policy.

The predictor entries are only updated on mispredictions or when the absolute value of

the computed sum S is smaller than a threshold, using saturated arithmetic.

95

References

[1] Agner Fog. The microarchitecture of intel, amd, and via cpus.

http://www.agner.org/optimize/microarchitecture.pdf.

[2] Stuart Sechrest, Chih-Chieh Lee, and Trevor Mudge. Correlation and aliasing in dy-

namic branch predictors. In Proceedings of the 23rd annual international symposium on

Computer architecture, ISCA ’96, pages 22–32, New York, NY, USA, 1996. ACM.

[3] Scott McFarling. Combining branch predictors. Technical report, TN-36m, Digital

Western Research Laboratory, June 1993.

[4] Andre Seznec and Pierre Michaud. De-aliased hybrid branch predictors. Technical

report, Technical Report RR-3618, February 1999.

[5] Chih-Chieh Lee, I-Cheng K. Chen, and Trevor N. Mudge. The bi-mode branch pre-

dictor. In Proceedings of the 30th annual ACM/IEEE international symposium on Mi-

croarchitecture, MICRO 30, pages 4–13, Washington, DC, USA, 1997. IEEE Computer

Society.

[6] Eric Sprangle, Robert S. Chappell, Mitch Alsup, and Yale N. Patt. The agree predictor:

a mechanism for reducing negative branch history interference. In Proceedings of the

24th annual international symposium on Computer architecture, ISCA ’97, pages 284–

291, New York, NY, USA, 1997. ACM.

[7] A. N. Eden and T. Mudge. The yags branch prediction scheme. In Proceedings of

the 31st annual ACM/IEEE international symposium on Microarchitecture, MICRO 31,

pages 69–77, Los Alamitos, CA, USA, 1998. IEEE Computer Society Press.

96

Appendix A. References

[8] Daniel A. Jiménez, Stephen W. Keckler, and Calvin Lin. The impact of delay on the

design of branch predictors. In Proceedings of the 33rd annual ACM/IEEE international

symposium on Microarchitecture, MICRO 33, pages 67–76, New York, NY, USA, 2000.

ACM.

[9] Ioana Burcea, Stephen Somogyi, Andreas Moshovos, and Babak Falsafi. Predictor vir-

tualization. In Proceedings of the 13th international conference on Architectural support

for programming languages and operating systems, ASPLOS XIII, pages 157–167, New

York, NY, USA, 2008. ACM.

[10] Ioana Burcea and Andreas Moshovos. Phantom-btb: a virtualized branch target

buffer design. In Proceeding of the 14th international conference on Architectural sup-

port for programming languages and operating systems, ASPLOS ’09, pages 313–324,

New York, NY, USA, 2009. ACM.

[11] Andre Seznec. The l-tage branch predictor. Journal of Instruction Level Parallelism, 9,

April 2007.

[12] John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative

Approach. Morgan Kaufmann, fourth edition, 2007.

[13] James E. Smith. A study of branch prediction strategies. In Proceedings of the 8th

annual symposium on Computer Architecture, ISCA ’81, pages 135–148, Los Alamitos,

CA, USA, 1981. IEEE Computer Society Press.

[14] Tse-Yu Yeh and Yale N. Patt. Two-level adaptive training branch prediction. In Pro-

ceedings of the 24th annual international symposium on Microarchitecture, MICRO 24,

pages 51–61, New York, NY, USA, 1991. ACM.

[15] Tse-Yu Yeh and Yale N. Patt. Alternative implementations of two-level adaptive

branch prediction. In Proceedings of the 19th annual international symposium on Com-

puter architecture, ISCA ’92, pages 124–134, New York, NY, USA, 1992. ACM.

97

Appendix A. References

[16] Pierre Michaud, André Seznec, and Richard Uhlig. Trading conflict and capacity

aliasing in conditional branch predictors. In Proceedings of the 24th annual interna-

tional symposium on Computer architecture, ISCA ’97, pages 292–303, New York, NY,

USA, 1997. ACM.

[17] K. Diefendorff. K7 challenges intel. Microprocessor Report, 12, October 1998.

[18] R. E. Kessler. The alpha 21264 microprocessor. IEEE Micro, 19:24–36, March 1999.

[19] André Seznec, Stephen Felix, Venkata Krishnan, and Yiannakis Sazeides. Design

tradeoffs for the alpha ev8 conditional branch predictor. In Proceedings of the 29th

annual international symposium on Computer architecture, ISCA ’02, pages 295–306,

Washington, DC, USA, 2002. IEEE Computer Society.

[20] V. Uzelac and A. Milenkovic. Experiment flows and microbenchmarks for reverse

engineering of branch predictor structures. In Performance Analysis of Systems and

Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 207 –217, april

2009.

[21] D.A. Jimenez and C. Lin. Dynamic branch prediction with perceptrons. In High-

Performance Computer Architecture, 2001. HPCA. The Seventh International Symposium

on, pages 197 –206, 2001.

[22] Daniel A. Jiménez. Fast path-based neural branch prediction. In Proceedings of

the 36th annual IEEE/ACM International Symposium on Microarchitecture, MICRO 36,

pages 243–, Washington, DC, USA, 2003. IEEE Computer Society.

[23] David Tarjan and Kevin Skadron. Merging path and gshare indexing in perceptron

branch prediction. ACM Trans. Archit. Code Optim., 2:280–300, September 2005.

[24] Renee St. Amant, Daniel A. Jimenez, and Doug Burger. Low-power, high-

performance analog neural branch prediction. In Proceedings of the 41st annual

IEEE/ACM International Symposium on Microarchitecture, MICRO 41, pages 447–458,

Washington, DC, USA, 2008. IEEE Computer Society.

98

Appendix A. References

[25] Andre Seznec. Analysis of the o-geometric history length branch predictor. In Pro-

ceedings of the 32nd annual international symposium on Computer Architecture, ISCA

’05, pages 394–405, Washington, DC, USA, 2005. IEEE Computer Society.

[26] Pierre Michaud. A ppm-like tag-based predictor. Journal of Instruction Level Paral-

lelism, 7, April 2005.

[27] Andre Seznec. A 64-kb isl-tage branch predictor. http://www.jilp.org/jwac-

2/program/cbp3_03_seznec.pdf.

[28] Muawya Al-Otoom, Elliott Forbes, and Eric Rotenberg. Exact: explicit dynamic-

branch prediction with active updates. In Proceedings of the 7th ACM international

conference on Computing frontiers, CF ’10, pages 165–176, New York, NY, USA, 2010.

ACM.

[29] André Seznec, Stéphan Jourdan, Pascal Sainrat, and Pierre Michaud. Multiple-block

ahead branch predictors. In Proceedings of the seventh international conference on Archi-

tectural support for programming languages and operating systems, ASPLOS-VII, pages

116–127, New York, NY, USA, 1996. ACM.

[30] Karel Driesen and Urs Holzle. The cascaded predictor: economical and adaptive

branch target prediction. In Proceedings of the 31st annual ACM/IEEE international

symposium on Microarchitecture, MICRO 31, pages 249–258, Los Alamitos, CA, USA,

1998. IEEE Computer Society Press.

[31] The 2nd jilp championship branch prediction competition (cbp-2).

http://cava.cs.utsa.edu/camino/cbp2/.

[32] Nikolaos Hardavellas, Stephen Somogyi, Thomas F. Wenisch, Roland E. Wunderlich,

Shelley Chen, Jangwoo Kim, Babak Falsafi, James C. Hoe, and Andreas G. Nowatzyk.

Simflex: a fast, accurate, flexible full-system simulation framework for performance

evaluation of server architecture. SIGMETRICS Perform. Eval. Rev., 31:31–34, March

2004.

99

Appendix A. References

[33] Doug Burger and Todd M. Austin. The simplescalar tool set, version 2.0. SIGARCH

Comput. Archit. News, 25:13–25, June 1997.

[34] The spec cpu 2000 benchmark suite. http://www.spec.org/cpu2000/.

[35] Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. Cacti 6.0:

A tool to model large caches. Technical report, HP Laboratories, April 2009.

[36] Intel core i7-2600k processor. http://ark.intel.com/Product.aspx?id=52214processor=i7-

2600Kspec-codes=SR00C.

[37] The ibm power7 processor. http://en.wikipedia.org/wiki/POWER7.

[38] Toni Juan, Sanji Sanjeevan, and Juan J. Navarro. Dynamic history-length fitting: a

third level of adaptivity for branch prediction. In Proceedings of the 25th annual inter-

national symposium on Computer architecture, ISCA ’98, pages 155–166, Washington,

DC, USA, 1998. IEEE Computer Society.

100

Recommended