1 ECE 587 Advanced Computer Architecture I Chapter 7 Branch Prediction Herbert G. Mayer, PSU Status 7/21/2015

1

ECE 587Advanced Computer Architecture I

Chapter 7Branch Prediction

Herbert G. Mayer, PSUHerbert G. Mayer, PSUStatus 7/21/2015Status 7/21/2015

2

Motivation If only we could predict the future, computation would

be swift and accurate In such an imaginary world with clairvoyance the

effect of stalls, created by branches in a pipelined architecture could be eliminated: As soon as a branch is decoded, the pipeline could be primed again with the right instruction stream from the new location, destination of the predicted branch

Unfortunately, we generally don’t know, whether a conditional branch is taken until the condition is completely evaluated

We also don’t know the destination of a taken branch, conditional or not, until that address has been computed. The same holds for other flow-control instructions, such as calls, returns, exceptions, etc.

3

Motivation But we can guess the outcome of a conditional

branch, we can guess the destination address of any branch, and we can guess wrong

We cannot predict with certainty! To help us guess the right branch destination, we could remember the branch destination from the last time this branch transferred control, and then predict that this time the destination just might be the same

If this would help us guess right most of the time, we would get some advantage. Guessing always right would be nicer, but we are, after all, mere mortals

Branch prediction strategies intend to learn from the past to guess future behavior correctly most of the time. Practically this means, one can reach > 97% accurate prediction, which helps in a pipelined architecture

4

Motivation

In fact, to take advantage of pipelined execution, very high prediction accuracy is mandatory

Each time a prediction is wrong, the pipe has to be flushed: multiple arithmetic units hold invalid operands that are not needed and thus all HW speed-up methods were in vain

The deeper the pipes on a pipelined architecture, the more stringent are the accuracy requirements for a branch prediction scheme

On an Intel Cedar Mill (2006) processor, like Prescott, the pipeline is over 2 dozen stages deep; inside that complex there are about 5-6 branches on average, without branch prediction practically never reaching the steady state

5

Syllabus

Definitions Introduction What’s Bad About Branches? Static Branch Prediction Dynamic Branch Prediction A Two-Level Dynamic Prediction Scheme Yeh and Patt Nomenclature Prediction Accuracies for SPECint92 Bibliography

6

Some Definitions

7

DefinitionsBHT, acronym for Branch History Table (BHT)

The Branch History Table (BHT) is the collection of Branch History Registers (HR), used in Single-Level or Two-Level dynamic branch prediction

There could be a.) one HR per conditional branch, b.) one HR each for the last n > 1 branches, or c.) just a single HR for all conditional branches

The cost for the choice a.) can be excessive, yet is more accurate. Choice c.), while being the least accurate, also costs the least in terms of HW

Often architects must select a compromise

This trade-off of resource cost vs. accuracy is akin to the mapping policy employed in cache design

8

DefinitionsBHT, Cont’d

On actual branch prediction HW, just the last few branches executed have their associated HR, otherwise too much HW –silicon space– for the BHT would be consumed

Each HR records for the last k executions of its associated conditional branch whether that branch was taken

In a Two-Level dynamic branch prediction scheme, the HR has an associated Pattern Table (PT), indexed by the HR

The entry in the PT guesses, whether the next branch will be taken. The cost in bits can be contained, because not all branches need to have an associated HR

9

DefinitionsBranch Prediction

Heuristic that guesses –based on past branching history– the destination of the current branch, the Boolean outcome of the next condition for a branch, or both, as soon as a branch instruction is being decoded

100% accurate prediction of a branch is, of course, not possible; neither the condition, nor the target

Heuristics aim at guessing right most of the time

For highly pipelined and superscalar architectures “most of the time” has to mean 97% or more

10

DefinitionsBranch Profiling Compile a program with a special compiler directive.

Then measure at run-time, for each conditional branch, how many times each branch was taken

Next time this same program is compiled, the measured results of the prior run are available to the compiler. That info enables a compiler to bias conditional branches according to past behavior

Underlying this scheme is the assumption that past behavior is a reflection of the future. Branch profiling is one of the static branch prediction schemes. It costs one additional execution and costs HW instruction bits, for the compiler to set the branch bias one way or another

Generally, static prediction, even with the benefit of a profiling run, are not sufficiently effective

11

DefinitionsBTAC, Branch Target Address Cache

For very fast performance, it is not sufficient to know (i.e. guess) ahead of time, whether a conditional branch will be taken

For any branch –including unconditional– the branch destination should be known a priori

For this reason, each branch in a BTAC implementation has an associated target address, used by the instruction fetch unit to continue filling the pipeline from places other than the next one

After complete decoding of an instruction, the target is also computed. But knowing the target earlier speeds up filling a potentially stalled pipeline

12

DefinitionsBTB, Branch Target Buffer

For very fast performance, it is best to know ahead of time whether or not a conditional branch will be taken, and where to such a branch leads

The former can be implemented using a BHT with Pattern Table; the latter can be implemented using a BTAC

The combination of these two is called the BTB

This scheme is implemented on Intel Pentium Pro® and newer Intel architectures

13

DefinitionsBTFN, Backwards Taken Forward Not

A static prediction heuristic assuming that program execution time is dominated by loops, especially While Loops

While loops are characterized by an unconditional branch at the end of the loop body back to the condition, and a conditional branch if false at the start, leading to the successor of the loop body

The backward branch is always taken, and to the same destination; the forward branch, if the condition is false, is taken just once

14

DefinitionsBTFN, Backwards Taken Forward Not

Since While Statements are often executed repeatedly the BTFN heuristic guesses correctly the majority of the time

This method has the inherent limitations of static schemes

Also, many optimizers re-arrange the object code for While Statements in a way that the condition is moved to the end, obscuring this whole scheme

Exercise to students: how to convert while-code with conditional branch at top + unconditional branch back an end, to a single conditional at end? Hint, there will be initial, fixed-cost overhead!

15

DefinitionsDelay of Transfer, Delay Transfer Slot

Certain pipelined CPUs execute another instruction before the current unconditional branch

That step before is the target instruction physically at the target of the branch

The reason is to greedily recover some of the lost time caused by the pipeline stall. Thus, compilers or programmers can physically place the target instruction of the branch physically after the branch: Placed after the branch, executed before the branch completes, never reached normally

Since it is supposed to be executed anyway, as soon as a branch has reached its target, and since the HW already executes it before completing the branch, time is saved

16

DefinitionsDelay of Transfer, Delay Transfer Slot

Note that at the target of such an unconditional branch the relocated instruction must be skipped; that enables the time saving!

Example: Intel i860 architecture: When a suitable candidate cannot be found, a NOP instruction is placed physically after the branch, i.e. into the delay slot

Done also on Sun SPARC architecture

There are restrictions: for example, branch instructions and other control-transfer instructions cannot be placed into the delay slot. If that would happen, a phenomenon called code visiting would occur, with unpredictable side-effects at times; hence the restriction

17

DefinitionsDynamic Branch Prediction

Branch prediction policy that changes dynamically with the execution of the program

Dynamic branch prediction is architecture transparent, i.e. no bits are visible in the opcode

Different from some static branch prediction methods, which have suitable bits in their opcode

Antonym: Static Branch Prediction

We focus on dynamic branch prediction here

18

DefinitionsHistory Register (HR)

k-bit shift register, associated with a conditional branch

The bits indicate for each of the last k executions of that associated conditional branch, whether it was taken, 1 saying yes

The newest bit shifts out the oldest, since a HR has only some limited, fixed length k of bits available

19

DefinitionsInterference, Branch Interference

When multiple branches are associated with one HW data structure (such as an HR or PT) the behavior of each branch will influence the data structure’s state

However, the data will be used for the next branch, even if it is not the one having modified the most recent state

Reason for doing this is limited HW availability, i.e. cost saving of HW (of silicon space)

The effect is diminished precision

20

Definitions

IPC

Instructions per cycle: A measure for Instruction Level Parallelism

IPC quantifies how many different instructions are being executed –not necessarily all to completion—during one single cycle?

Desired to have an IPC rate > 1

Given sufficient parallelism, IPC can be >> 1

On conventional UP CISC architectures it is typical to have IPC << 1

21

Definitions

Mispredicted Branch, AKA Miss

The branch condition or branch destination were predicted incorrectly

As a consequence, the control of execution took a different flow than predicted

This requires dynamic correction at run time and costs time

The cost often is a stalled pipeline that has to be flushed and re-loaded

22

DefinitionsMispredicted Branch Penalty

Number of cycles lost, due to having incorrectly guessed the change in flow of control, caused by a branch instruction

Since prediction accuracy is never 100%, there always shall be some Mispredicted Branch Penalty

Goal is to keep the number of mispredictions well below 3% of all branches executed

23

DefinitionsPattern Table (PT)

A HW table of entries, each specifying whether its associated conditional branch will be taken

An entry in the PT is selected by using the history bits of a branch History Register (HR)

This can be done by indexing, in which case the number of entries in the PT is 2k, with k being the number of bits stored in the History Register

Otherwise, if the number of entries is < 2k, a hashing scheme is applied; causing interference!

Each PT entry holds boolean information about the next conditional branch: will it be taken or not?

24

Definitions

Pipelining

Mode of execution, in which one instruction is initiated every cycle and ideally one retires every cycle, even though each requires multiple (possibly many) cycles to complete

Highly pipelined Xeon processors, for example, have a > 20-stage pipeline

25

DefinitionsSaturating Counter

HW n-bit unsigned integer counter, n typically being 2 .. 16 for branch prediction HW

When all bits are on and counting up continues, a saturating counter simply stays at the maximum value

Similarly, when all bits are off and counting down continues, the saturating counter stays at 0

Creates a limited hysteresis effect on the behavior of the specific event that depends on this counter

Architecture challenge: to select a history length (n bits) such that the cost is low and the accuracy sufficient to support overall goal > 97%

26

DefinitionsShift Register

HW register with small number of bits, tracking a binary event

If the event did occur, a 1 bit is shifted into the register at one end. This will be the newest bit

The oldest bit is shifted out at the opposite end

Conversely, if the event did NOT occur, a 0 bit is shifted in, and the oldest bit is shifted out

All other bits shift their bit position by one place

At any moment the shift register holds a history of the associated event’s last n occurrences

27

Definitions

Static Branch Prediction

A branch prediction policy that is embedded in the binary code –ISA visible

Or implemented in the hardware executing the branches –not ISA visible

The policy does not change during execution of the program, even if known to be wrong all the time

In the latter case, execution would be better off without branch prediction

28

DefinitionsStatic Branch Prediction

BTFN heuristic is a static branch prediction policy

Requires no opcode bits, hence is NOT ISA visible

HW compares the destination of a branch with the conditional branch’s own address. Destinations smaller lead backwards and are assumed taken

Destination addresses larger than the branch address are assumed not taken, and the next instruction predicted is the successor of the conditional branch

Typical industry benchmarks (SPECint89) achieve almost 65% correct prediction with this simple scheme

29

DefinitionsTwo-Level Branch Prediction

Instead of solely associating a local branch history register with a conditional branch, a two-level branch prediction scheme associates history bits (pattern table) with branch execution history

Thus, each pattern of past branch behaviors has its own future prediction, costing more HW, but yielding better accuracy

For example, each conditional branch may have a k-bit Branch History Register, which records for each of the last k executions, whether or not the condition was satisfied

And each history pattern has an associated prediction of the future in another data structure; typically implemented as 2-bit saturating counter

30

DefinitionsWide Issue

Older architectures issue (i.e. fetch, decode, etc.) one instruction at a time; for example, 1 instruction per clock cycle on a RISC architecture

Computers after 1980 issue more than 1 instruction at a time; this is called a wide issue

Synonym: super-scalar architecture

More precisely, superscalar architectures require wide issue I-fetches

Antonym: Single-issue

31

Introduction

Execution on a highly pipelined and wide issue architecture suffers severe degradation, whenever an instruction disrupts the prefetched flow of operations, being in various stages of partial completion

Typically, control-transfer instructions cause pipeline hazards

The higher the degree of pipelining, the more partially executed (fetched, decoded, operand-fetched, etc.) instructions must be discarded

The pipeline must be flushed and then primed again, i.e. be filled again with other, soon partially executed instructions

32

Introduction

However, more than one in five operations are control-flow instruction –e.g. branch, call, return, conditional branch, exit, abort, exception, etc.

This almost invalidates the architectural advantage of pipelining

If it were possible to predict a condition, and if the machine could predict the destination of a branch before generating it from the instruction stream, then as soon as any branch is fetched, the pipe could be filled correctly; stalls would be avoided

Next follow statistics about branch prediction accuracy for a common benchmark on widely used processors

33

IntroductionIntel Core Duo two-level dynamic branch prediction vs. AMD K8,

show benefit of Intel’s branch prediction investment; see [13]

34

What’s Bad About Branches?

Performance Penalties, Delay, Disturbance:

Disruption of sequential control flow, hence the anticipated flow of the pipeline is disturbed

The higher the number of pipeline stages, the greater the penalty. Another case in point: High Pipelining is a liability not purely goodness!

Branches cause I-cache disturbance due to some new address range

Conditional branch must determine the future direction: fall-through or to the new target?

Unconditional branches must determine the new instruction’s target

35


Determine Branch Direction:

Cannot immediately fetch subsequent instruction, since it is not known

Remedy: if possible, move instructions to compute branch-condition away from branch, so that waiting for the condition is minimized

Or make use of penalty, see Branch Delay Slot

Bias the case toward NOT taken, or vice versa; done in some static prediction schemes

36


Determine Branch Direction, Cont’d:

Fill delay slot with useful instruction (Intel 860 processor)

This HW trick is being used less and less in the 2000s; often ends up being a noop anyway

Execute both paths speculatively. Once condition is known, kill the superfluous path. Requires more HW, and can cause explosion of HW when jumping to further branches; so done successfully on Itanium Processor Family (IPF); high HW cost

Or predict branch direction, discussed here!

Determine Branch Target: Must know target address, to fetch next; for that, use prediction

37

What’s Bad About Branches? Saturating Counter prediction algorithm, with 2 bits

reaching 80% accuracy. Awesome policy: 2 data bits plus logic suffice for remarkable accuracy! Even global for all branches; works despite interference!

Two-Bit Saturating Counter, Taken vs. Not Taken

n

38

Static Branch Prediction Common to static branch prediction: small cost in

extra hardware and cache

Achieves ~70% accuracy in prediction, though cheap!

Is generally insufficient for highly pipelined or for multi-way, superscalar architectures

Typical static prediction schemes are:

condition not taken: assumes conditional branch is not taken; pipeline continues to be filled with instruction physically after the conditional branch; example early Intel ® 486; but this proved to be correct only little over 40%; hence it would have been better to abstain from this prediction

39

Static Branch Prediction

condition taken: assumes conditional branches are taken; pipeline continues to be filled with instructions at destination of conditional branch

correct about 60% of the time; can be advantageous for low degrees of pipelining

BTFN: assumes execution is dominated by while loops; true in some code

while-loops un-optimized have conditional branch around the loop body; direction being forward to first instruction after the loop body

while-loops un-optimized, then use unconditional branch back to beginning of loop body; hence BTFN prediction; accurate up to ~65%

40

Static Branch Prediction Single-bit bias, no profile: provide conditional

instructions with bit in the opcode, indicating whether the condition is likely true; is a clue for the HW

Compiler can analyze source code and makes reasonable guesses about condition’s outcome; this is encoded in the extra bit; reaches ~70% accuracy

For example, exceptions and assertions are almost never taken; compiler generates clue in object code

Single-bit bias, with profiling: run the program, initially compiled without profile in bias bit. Then, for all conditional branches, count number of times whether the condition was true during run; use count to set bias bit; achieves ~75% accuracy

Note: Trace Scheduling similarity! There the penalty for wrong prediction is correction code

41

Dynamic Branch Prediction One prediction bit per I-cache line: this scheme

encodes no information in the instruction stream, i.e. no information is assembled into the conditional branch instruction

Instead, each cache line holding a set of x instructions in the I-cache has an associated prediction bit

If set, bit predicts that next executed conditional branch in this I-cache line will be taken

Problem: There may be no conditional branch in the line at all, thus wasting the bit in the cache

More serious for performance, there may be multiple conditional branches, causing interference about the predictions of their respective conditions

42

Dynamic Branch Prediction Advantage: low cost; order 1% of cache area; reaches

up to 80% accuracy; amazingly

2 prediction bits per I-cache line: similar to above, but uses 2-bit saturating counter to predict next branch; can achieve additional accuracy; a single wrong guess does not disrupt the scheme; yet suffers similarly from waste & interference

Branch History Table BHT: use history bit or saturating 2-bit counter, or longer shift register for each represented branch

Contain cost of history cache area: allot entries only for the last k different branch instructions executed; advantage: increases accuracy to 85%; implemented in Pentium ®. Total cache size is significantly smaller than possible # of branches in program; so evictions will occur, like in regular data cache

43

Dynamic Branch Prediction

Two-Level Dynamic Branch Prediction

The direction of the last k conditional branches in a special-purpose prediction cache, implemented as a shift-register, is named History Register (HR)

It can be global or local; global is one HR for all branches; local is one HR per branch

Target addresses of the last branches can also reside in a special purpose prediction cache, called the branch target address cache (BTAC)

44


Two-Level Dynamic Branch Prediction

Use the HR as an index into an array of patterns, called the pattern table (PT)

Each pattern typically implemented as a 2-bit counter predicting the future condition for this situation

Once the current branch has been completely computed, update the HR by shifting in the current condition –shifting out and losing the oldest– and updating the PT [HR] as it was indexed by the last history register state

45


Local Branch Prediction

Local means that each conditional branch has its own, private branch prediction history cache

For example, each conditional branch may have its own two-level, adaptive branch predictor, with a unique history buffer, and either a local pattern history table, or a global one, shared between conditional branches

For example, the Intel Pentium MMX, Pentium II, and Pentium III used local branch predictors, with a local 4-bit branch history and a local pattern history table of 16 entries (entries = 24 per conditional); see [12]

46


Two-Level Dynamic Prediction by Yeh and Patt

Uses the by now familiar History Register (HR)

Whether this is one local register per conditional branch, or a single global register, we differentiate later

A HR has an associated Pattern Table (PT)

The HR is a k-bit shift register that stores the history of the last k outcomes of its associated conditional branch --or possibly of all k branches

The PT is accessed (indexed) by this history pattern, so the identified entry can predict the next condition’s outcome

47

Dynamic Branch PredictionTwo-Level Dynamic Prediction by Yeh and Patt That prediction is performed by an FSA (finite

state automaton) using the stored bits of the PT to make a guess

The new state of the PT is derived from 2 inputs: previous state and real outcome of the branch, once the condition has actually been computed --or corrected, if required

Also the HR is updated by left-shifting the new branch bit (1 if taken, else 0) in, and the oldest bit out of the HR

Usually each PT entry is a 2-bit saturating counter Reaches accuracy of ~ 97%. Yeh and Patt argue

that for super-pipelined, high-issue architectures 97% is still poor!

48



Figure below shows the scheme for conditional branch instruction C0

HR can exist once, in which case it applies globally to all branch instructions, and then interferes with the prediction of any other branch

Or architecture may dedicate one local HR per branch, replicating n HRs, one for each of the last n distinct branch instructions

Also, PT may exist once globally for all HR, or a private PT may exist for each HR, provided HRs are replicated per branch

49



10010100000001

00000101

00001011

00001100

10010101

11110001

11110110

11111011

11111111

C0

History Register (HR)Pattern Table (PT)

Use HR value as index

50

Dynamic Branch PredictionYeh and Patt Nomenclature

Prediction scheme by Yeh and Patt (ref. [4] - [7]) can be effective, but consumes ample cache space

For each branch instruction it consumes a Branch History Register of k bits, an address tag, and a PT of 2k+1 entries for a 2-bit prediction pattern each is consumed

Could this same space be used better?

Yeh and Patt measured varying accuracies for the same program, the same number of cache bits, varying the scheme as follows:

Instead of using always one BHR per branch and one PT per branch, their experiments associate a variation of multiple PT entries with one BHR

51

Dynamic Branch PredictionYeh and Patt Nomenclature Unintuitive as this may sound, Yeh and Patt

observed good prediction accuracy for one global PT and measured this variation as well

Since the number of bits consumed for the cache was decided to remain constant, a larger number of history bits and/or a larger number of last executed branches could be used

Varying the number of BH registers and number of PT, lead to the following nomenclature:

Varying BHT AND Varying PT

P one BH per branch p one PT per branch A G one global BH for all branches

g one global PT for all branches

52

Dynamic Branch PredictionYeh and Patt Nomenclature Theoretically 4, but practically there are just 3

meaningful choices: GAg, PAg, and Pap Complete measurements were conducted for a

growing number budget of bits, from 8k to 128 k bits total cache space

Interestingly, for sufficiently large cache storage, Yeh and Patt found that the best scheme, constrained by 128 k bits, is not PAp but the PAg scheme

Also unintuitively, PAg is most cost-effective. This delivered the highest accuracy for a fixed HW budget, despite interference

For other HW budgets, Patt and Yeh found different optimal schemes

53

Prediction Accuracy For SPECint92

Vertical axis below shows percentages of predicting branches in SPECint92. The horizontal axis shows accuracies in improving order left to right

0

20

40

60

80

100

120

alwaystaken

nevertaken

BTFN 1 bit bias,no profile

1 bit bias,with

profiling

1 bit dynhistory

2 bits dynhistory

2-levelbranch

prediction

Approximate Prediction Accuracies in %

54

Summary

Without good branch prediction pipelined architectures would not be useful

The number transfer-of-control instructions dynamically executed is too large for even reaching the steady state

Static branch predictions are cost effective, but inadequate for deep pipes

Dynamic branch prediction is needed to achieve the > 97% prediction accuracy, needed for reaching the steady state

55

Bibliography1. Gwennap L. [1995]. “New Algorithm Improves Branch Prediction,”

Microprocessor Report, March 1995, pp. 17-212. Gwennap L. [1995]. “New Algorithm Improves Branch Prediction,” MicroDesign

Resources, Vol. 9, No. 4, March 27, 1995, on web at: https://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15213-f00/docs/mpr-branchpredict.pdf

3. Smith, J. [1981]. “A Study or Branch Prediction Strategies,” 8th International Symposium on Computer Architecture, May 1981, pp. 135-148

4. Yeh, T. and Y. Patt [1991]. “Two-Level Adaptive Branch Prediction,” 24th International Symposium on Computer Architecture, November 1991, pp. 51-61

5. Yeh, T. and Y. Patt [1991]. “Alternative Implementations of Two-Level Adaptive Branch Prediction,” 19th International Symposium on Computer Architecture, May 1992, pp. 124-134

6. Yeh, T. and Y. Patt [1993]. “A Comparison of Dynamic Branch Predictors That Use Two Levels of Branch History,” 20th International Symposium on Computer Architecture, May 1993, pp. 257-266

7. Yeh, Tse-Yu, and Yale N. Patt [1992]. “Alternative Implementation of Two-Level Adaptive Branch Prediction”, 19th Annual International Symposium on Computer Architecture, pp 124-134. Can be located on web pages of University of Michigan

56

Bibliography

8. McFarling, Scott [1993]. “Combining Branch Predictors”, WRL Technical Note TN 36, Digital Western Research Lab, June 1993

9. Hilgendorf, R. B., et al. [1999]. “Evaluation of branch-prediction methods on traces from commercial applications.” www.research.ibm.com/journal/rd/434/hilgendorf.html IBM Journal of Research & Development

10. Hsien-Hsin Sean Lee: “Branch Prediction”, http://users.ece.gatech.edu/~sudha/academic/class/ece4100-6100/Lectures/Module3-BranchPrediction/branch.prediction.pdf

11. Daniel A. Jiménez, Calvin Lin, [6.2000]. “Dynamic Branch Prediction with Perceptrons.” Proceedings of the 7th International Symposium on High Performance Computer Architecture

12. Wikipedia, 2011, http://en.wikipedia.org/wiki/Branch_predictor13. Real world Technologies: http://www.realworldtech.com/cpu-perf-analysis/5/

Documents

1 ECE 587 Advanced Computer Architecture I Chapter 7 Branch Prediction Herbert G. Mayer, PSU Status 7/21/2015