Upload
regina-little
View
221
Download
0
Embed Size (px)
Citation preview
1
ECE 587Advanced Computer Architecture I
Chapter 7Branch Prediction
Herbert G. Mayer, PSUHerbert G. Mayer, PSUStatus 7/21/2015Status 7/21/2015
2
Motivation If only we could predict the future, computation would
be swift and accurate In such an imaginary world with clairvoyance the
effect of stalls, created by branches in a pipelined architecture could be eliminated: As soon as a branch is decoded, the pipeline could be primed again with the right instruction stream from the new location, destination of the predicted branch
Unfortunately, we generally don’t know, whether a conditional branch is taken until the condition is completely evaluated
We also don’t know the destination of a taken branch, conditional or not, until that address has been computed. The same holds for other flow-control instructions, such as calls, returns, exceptions, etc.
3
Motivation But we can guess the outcome of a conditional
branch, we can guess the destination address of any branch, and we can guess wrong
We cannot predict with certainty! To help us guess the right branch destination, we could remember the branch destination from the last time this branch transferred control, and then predict that this time the destination just might be the same
If this would help us guess right most of the time, we would get some advantage. Guessing always right would be nicer, but we are, after all, mere mortals
Branch prediction strategies intend to learn from the past to guess future behavior correctly most of the time. Practically this means, one can reach > 97% accurate prediction, which helps in a pipelined architecture
4
Motivation
In fact, to take advantage of pipelined execution, very high prediction accuracy is mandatory
Each time a prediction is wrong, the pipe has to be flushed: multiple arithmetic units hold invalid operands that are not needed and thus all HW speed-up methods were in vain
The deeper the pipes on a pipelined architecture, the more stringent are the accuracy requirements for a branch prediction scheme
On an Intel Cedar Mill (2006) processor, like Prescott, the pipeline is over 2 dozen stages deep; inside that complex there are about 5-6 branches on average, without branch prediction practically never reaching the steady state
5
Syllabus
Definitions Introduction What’s Bad About Branches? Static Branch Prediction Dynamic Branch Prediction A Two-Level Dynamic Prediction Scheme Yeh and Patt Nomenclature Prediction Accuracies for SPECint92 Bibliography
6
Some Definitions
7
DefinitionsBHT, acronym for Branch History Table (BHT)
The Branch History Table (BHT) is the collection of Branch History Registers (HR), used in Single-Level or Two-Level dynamic branch prediction
There could be a.) one HR per conditional branch, b.) one HR each for the last n > 1 branches, or c.) just a single HR for all conditional branches
The cost for the choice a.) can be excessive, yet is more accurate. Choice c.), while being the least accurate, also costs the least in terms of HW
Often architects must select a compromise
This trade-off of resource cost vs. accuracy is akin to the mapping policy employed in cache design
8
DefinitionsBHT, Cont’d
On actual branch prediction HW, just the last few branches executed have their associated HR, otherwise too much HW –silicon space– for the BHT would be consumed
Each HR records for the last k executions of its associated conditional branch whether that branch was taken
In a Two-Level dynamic branch prediction scheme, the HR has an associated Pattern Table (PT), indexed by the HR
The entry in the PT guesses, whether the next branch will be taken. The cost in bits can be contained, because not all branches need to have an associated HR
9
DefinitionsBranch Prediction
Heuristic that guesses –based on past branching history– the destination of the current branch, the Boolean outcome of the next condition for a branch, or both, as soon as a branch instruction is being decoded
100% accurate prediction of a branch is, of course, not possible; neither the condition, nor the target
Heuristics aim at guessing right most of the time
For highly pipelined and superscalar architectures “most of the time” has to mean 97% or more
10
DefinitionsBranch Profiling Compile a program with a special compiler directive.
Then measure at run-time, for each conditional branch, how many times each branch was taken
Next time this same program is compiled, the measured results of the prior run are available to the compiler. That info enables a compiler to bias conditional branches according to past behavior
Underlying this scheme is the assumption that past behavior is a reflection of the future. Branch profiling is one of the static branch prediction schemes. It costs one additional execution and costs HW instruction bits, for the compiler to set the branch bias one way or another
Generally, static prediction, even with the benefit of a profiling run, are not sufficiently effective
11
DefinitionsBTAC, Branch Target Address Cache
For very fast performance, it is not sufficient to know (i.e. guess) ahead of time, whether a conditional branch will be taken
For any branch –including unconditional– the branch destination should be known a priori
For this reason, each branch in a BTAC implementation has an associated target address, used by the instruction fetch unit to continue filling the pipeline from places other than the next one
After complete decoding of an instruction, the target is also computed. But knowing the target earlier speeds up filling a potentially stalled pipeline
12
DefinitionsBTB, Branch Target Buffer
For very fast performance, it is best to know ahead of time whether or not a conditional branch will be taken, and where to such a branch leads
The former can be implemented using a BHT with Pattern Table; the latter can be implemented using a BTAC
The combination of these two is called the BTB
This scheme is implemented on Intel Pentium Pro® and newer Intel architectures
13
DefinitionsBTFN, Backwards Taken Forward Not
A static prediction heuristic assuming that program execution time is dominated by loops, especially While Loops
While loops are characterized by an unconditional branch at the end of the loop body back to the condition, and a conditional branch if false at the start, leading to the successor of the loop body
The backward branch is always taken, and to the same destination; the forward branch, if the condition is false, is taken just once
14
DefinitionsBTFN, Backwards Taken Forward Not
Since While Statements are often executed repeatedly the BTFN heuristic guesses correctly the majority of the time
This method has the inherent limitations of static schemes
Also, many optimizers re-arrange the object code for While Statements in a way that the condition is moved to the end, obscuring this whole scheme
Exercise to students: how to convert while-code with conditional branch at top + unconditional branch back an end, to a single conditional at end? Hint, there will be initial, fixed-cost overhead!
15
DefinitionsDelay of Transfer, Delay Transfer Slot
Certain pipelined CPUs execute another instruction before the current unconditional branch
That step before is the target instruction physically at the target of the branch
The reason is to greedily recover some of the lost time caused by the pipeline stall. Thus, compilers or programmers can physically place the target instruction of the branch physically after the branch: Placed after the branch, executed before the branch completes, never reached normally
Since it is supposed to be executed anyway, as soon as a branch has reached its target, and since the HW already executes it before completing the branch, time is saved
16
DefinitionsDelay of Transfer, Delay Transfer Slot
Note that at the target of such an unconditional branch the relocated instruction must be skipped; that enables the time saving!
Example: Intel i860 architecture: When a suitable candidate cannot be found, a NOP instruction is placed physically after the branch, i.e. into the delay slot
Done also on Sun SPARC architecture
There are restrictions: for example, branch instructions and other control-transfer instructions cannot be placed into the delay slot. If that would happen, a phenomenon called code visiting would occur, with unpredictable side-effects at times; hence the restriction
17
DefinitionsDynamic Branch Prediction
Branch prediction policy that changes dynamically with the execution of the program
Dynamic branch prediction is architecture transparent, i.e. no bits are visible in the opcode
Different from some static branch prediction methods, which have suitable bits in their opcode
Antonym: Static Branch Prediction
We focus on dynamic branch prediction here
18
DefinitionsHistory Register (HR)
k-bit shift register, associated with a conditional branch
The bits indicate for each of the last k executions of that associated conditional branch, whether it was taken, 1 saying yes
The newest bit shifts out the oldest, since a HR has only some limited, fixed length k of bits available
19
DefinitionsInterference, Branch Interference
When multiple branches are associated with one HW data structure (such as an HR or PT) the behavior of each branch will influence the data structure’s state
However, the data will be used for the next branch, even if it is not the one having modified the most recent state
Reason for doing this is limited HW availability, i.e. cost saving of HW (of silicon space)
The effect is diminished precision
20
Definitions
IPC
Instructions per cycle: A measure for Instruction Level Parallelism
IPC quantifies how many different instructions are being executed –not necessarily all to completion—during one single cycle?
Desired to have an IPC rate > 1
Given sufficient parallelism, IPC can be >> 1
On conventional UP CISC architectures it is typical to have IPC << 1
21
Definitions
Mispredicted Branch, AKA Miss
The branch condition or branch destination were predicted incorrectly
As a consequence, the control of execution took a different flow than predicted
This requires dynamic correction at run time and costs time
The cost often is a stalled pipeline that has to be flushed and re-loaded
22
DefinitionsMispredicted Branch Penalty
Number of cycles lost, due to having incorrectly guessed the change in flow of control, caused by a branch instruction
Since prediction accuracy is never 100%, there always shall be some Mispredicted Branch Penalty
Goal is to keep the number of mispredictions well below 3% of all branches executed
23
DefinitionsPattern Table (PT)
A HW table of entries, each specifying whether its associated conditional branch will be taken
An entry in the PT is selected by using the history bits of a branch History Register (HR)
This can be done by indexing, in which case the number of entries in the PT is 2k, with k being the number of bits stored in the History Register
Otherwise, if the number of entries is < 2k, a hashing scheme is applied; causing interference!
Each PT entry holds boolean information about the next conditional branch: will it be taken or not?
24
Definitions
Pipelining
Mode of execution, in which one instruction is initiated every cycle and ideally one retires every cycle, even though each requires multiple (possibly many) cycles to complete
Highly pipelined Xeon processors, for example, have a > 20-stage pipeline
25
DefinitionsSaturating Counter
HW n-bit unsigned integer counter, n typically being 2 .. 16 for branch prediction HW
When all bits are on and counting up continues, a saturating counter simply stays at the maximum value
Similarly, when all bits are off and counting down continues, the saturating counter stays at 0
Creates a limited hysteresis effect on the behavior of the specific event that depends on this counter
Architecture challenge: to select a history length (n bits) such that the cost is low and the accuracy sufficient to support overall goal > 97%
26
DefinitionsShift Register
HW register with small number of bits, tracking a binary event
If the event did occur, a 1 bit is shifted into the register at one end. This will be the newest bit
The oldest bit is shifted out at the opposite end
Conversely, if the event did NOT occur, a 0 bit is shifted in, and the oldest bit is shifted out
All other bits shift their bit position by one place
At any moment the shift register holds a history of the associated event’s last n occurrences
27
Definitions
Static Branch Prediction
A branch prediction policy that is embedded in the binary code –ISA visible
Or implemented in the hardware executing the branches –not ISA visible
The policy does not change during execution of the program, even if known to be wrong all the time
In the latter case, execution would be better off without branch prediction
28
DefinitionsStatic Branch Prediction
BTFN heuristic is a static branch prediction policy
Requires no opcode bits, hence is NOT ISA visible
HW compares the destination of a branch with the conditional branch’s own address. Destinations smaller lead backwards and are assumed taken
Destination addresses larger than the branch address are assumed not taken, and the next instruction predicted is the successor of the conditional branch
Typical industry benchmarks (SPECint89) achieve almost 65% correct prediction with this simple scheme
29
DefinitionsTwo-Level Branch Prediction
Instead of solely associating a local branch history register with a conditional branch, a two-level branch prediction scheme associates history bits (pattern table) with branch execution history
Thus, each pattern of past branch behaviors has its own future prediction, costing more HW, but yielding better accuracy
For example, each conditional branch may have a k-bit Branch History Register, which records for each of the last k executions, whether or not the condition was satisfied
And each history pattern has an associated prediction of the future in another data structure; typically implemented as 2-bit saturating counter
30
DefinitionsWide Issue
Older architectures issue (i.e. fetch, decode, etc.) one instruction at a time; for example, 1 instruction per clock cycle on a RISC architecture
Computers after 1980 issue more than 1 instruction at a time; this is called a wide issue
Synonym: super-scalar architecture
More precisely, superscalar architectures require wide issue I-fetches
Antonym: Single-issue
31
Introduction
Execution on a highly pipelined and wide issue architecture suffers severe degradation, whenever an instruction disrupts the prefetched flow of operations, being in various stages of partial completion
Typically, control-transfer instructions cause pipeline hazards
The higher the degree of pipelining, the more partially executed (fetched, decoded, operand-fetched, etc.) instructions must be discarded
The pipeline must be flushed and then primed again, i.e. be filled again with other, soon partially executed instructions
32
Introduction
However, more than one in five operations are control-flow instruction –e.g. branch, call, return, conditional branch, exit, abort, exception, etc.
This almost invalidates the architectural advantage of pipelining
If it were possible to predict a condition, and if the machine could predict the destination of a branch before generating it from the instruction stream, then as soon as any branch is fetched, the pipe could be filled correctly; stalls would be avoided
Next follow statistics about branch prediction accuracy for a common benchmark on widely used processors
33
IntroductionIntel Core Duo two-level dynamic branch prediction vs. AMD K8,
show benefit of Intel’s branch prediction investment; see [13]
34
What’s Bad About Branches?
Performance Penalties, Delay, Disturbance:
Disruption of sequential control flow, hence the anticipated flow of the pipeline is disturbed
The higher the number of pipeline stages, the greater the penalty. Another case in point: High Pipelining is a liability not purely goodness!
Branches cause I-cache disturbance due to some new address range
Conditional branch must determine the future direction: fall-through or to the new target?
Unconditional branches must determine the new instruction’s target
35
What’s Bad About Branches?
Determine Branch Direction:
Cannot immediately fetch subsequent instruction, since it is not known
Remedy: if possible, move instructions to compute branch-condition away from branch, so that waiting for the condition is minimized
Or make use of penalty, see Branch Delay Slot
Bias the case toward NOT taken, or vice versa; done in some static prediction schemes
36
What’s Bad About Branches?
Determine Branch Direction, Cont’d:
Fill delay slot with useful instruction (Intel 860 processor)
This HW trick is being used less and less in the 2000s; often ends up being a noop anyway
Execute both paths speculatively. Once condition is known, kill the superfluous path. Requires more HW, and can cause explosion of HW when jumping to further branches; so done successfully on Itanium Processor Family (IPF); high HW cost
Or predict branch direction, discussed here!
Determine Branch Target: Must know target address, to fetch next; for that, use prediction
37
What’s Bad About Branches? Saturating Counter prediction algorithm, with 2 bits
reaching 80% accuracy. Awesome policy: 2 data bits plus logic suffice for remarkable accuracy! Even global for all branches; works despite interference!
Two-Bit Saturating Counter, Taken vs. Not Taken
n
38
Static Branch Prediction Common to static branch prediction: small cost in
extra hardware and cache
Achieves ~70% accuracy in prediction, though cheap!
Is generally insufficient for highly pipelined or for multi-way, superscalar architectures
Typical static prediction schemes are:
condition not taken: assumes conditional branch is not taken; pipeline continues to be filled with instruction physically after the conditional branch; example early Intel ® 486; but this proved to be correct only little over 40%; hence it would have been better to abstain from this prediction
39
Static Branch Prediction
condition taken: assumes conditional branches are taken; pipeline continues to be filled with instructions at destination of conditional branch
correct about 60% of the time; can be advantageous for low degrees of pipelining
BTFN: assumes execution is dominated by while loops; true in some code
while-loops un-optimized have conditional branch around the loop body; direction being forward to first instruction after the loop body
while-loops un-optimized, then use unconditional branch back to beginning of loop body; hence BTFN prediction; accurate up to ~65%
40
Static Branch Prediction Single-bit bias, no profile: provide conditional
instructions with bit in the opcode, indicating whether the condition is likely true; is a clue for the HW
Compiler can analyze source code and makes reasonable guesses about condition’s outcome; this is encoded in the extra bit; reaches ~70% accuracy
For example, exceptions and assertions are almost never taken; compiler generates clue in object code
Single-bit bias, with profiling: run the program, initially compiled without profile in bias bit. Then, for all conditional branches, count number of times whether the condition was true during run; use count to set bias bit; achieves ~75% accuracy
Note: Trace Scheduling similarity! There the penalty for wrong prediction is correction code
41
Dynamic Branch Prediction One prediction bit per I-cache line: this scheme
encodes no information in the instruction stream, i.e. no information is assembled into the conditional branch instruction
Instead, each cache line holding a set of x instructions in the I-cache has an associated prediction bit
If set, bit predicts that next executed conditional branch in this I-cache line will be taken
Problem: There may be no conditional branch in the line at all, thus wasting the bit in the cache
More serious for performance, there may be multiple conditional branches, causing interference about the predictions of their respective conditions
42
Dynamic Branch Prediction Advantage: low cost; order 1% of cache area; reaches
up to 80% accuracy; amazingly
2 prediction bits per I-cache line: similar to above, but uses 2-bit saturating counter to predict next branch; can achieve additional accuracy; a single wrong guess does not disrupt the scheme; yet suffers similarly from waste & interference
Branch History Table BHT: use history bit or saturating 2-bit counter, or longer shift register for each represented branch
Contain cost of history cache area: allot entries only for the last k different branch instructions executed; advantage: increases accuracy to 85%; implemented in Pentium ®. Total cache size is significantly smaller than possible # of branches in program; so evictions will occur, like in regular data cache
43
Dynamic Branch Prediction
Two-Level Dynamic Branch Prediction
The direction of the last k conditional branches in a special-purpose prediction cache, implemented as a shift-register, is named History Register (HR)
It can be global or local; global is one HR for all branches; local is one HR per branch
Target addresses of the last branches can also reside in a special purpose prediction cache, called the branch target address cache (BTAC)
44
Dynamic Branch Prediction
Two-Level Dynamic Branch Prediction
Use the HR as an index into an array of patterns, called the pattern table (PT)
Each pattern typically implemented as a 2-bit counter predicting the future condition for this situation
Once the current branch has been completely computed, update the HR by shifting in the current condition –shifting out and losing the oldest– and updating the PT [HR] as it was indexed by the last history register state
45
Dynamic Branch Prediction
Local Branch Prediction
Local means that each conditional branch has its own, private branch prediction history cache
For example, each conditional branch may have its own two-level, adaptive branch predictor, with a unique history buffer, and either a local pattern history table, or a global one, shared between conditional branches
For example, the Intel Pentium MMX, Pentium II, and Pentium III used local branch predictors, with a local 4-bit branch history and a local pattern history table of 16 entries (entries = 24 per conditional); see [12]
46
Dynamic Branch Prediction
Two-Level Dynamic Prediction by Yeh and Patt
Uses the by now familiar History Register (HR)
Whether this is one local register per conditional branch, or a single global register, we differentiate later
A HR has an associated Pattern Table (PT)
The HR is a k-bit shift register that stores the history of the last k outcomes of its associated conditional branch --or possibly of all k branches
The PT is accessed (indexed) by this history pattern, so the identified entry can predict the next condition’s outcome
47
Dynamic Branch PredictionTwo-Level Dynamic Prediction by Yeh and Patt That prediction is performed by an FSA (finite
state automaton) using the stored bits of the PT to make a guess
The new state of the PT is derived from 2 inputs: previous state and real outcome of the branch, once the condition has actually been computed --or corrected, if required
Also the HR is updated by left-shifting the new branch bit (1 if taken, else 0) in, and the oldest bit out of the HR
Usually each PT entry is a 2-bit saturating counter Reaches accuracy of ~ 97%. Yeh and Patt argue
that for super-pipelined, high-issue architectures 97% is still poor!
48
Dynamic Branch Prediction
Two-Level Dynamic Prediction by Yeh and Patt
Figure below shows the scheme for conditional branch instruction C0
HR can exist once, in which case it applies globally to all branch instructions, and then interferes with the prediction of any other branch
Or architecture may dedicate one local HR per branch, replicating n HRs, one for each of the last n distinct branch instructions
Also, PT may exist once globally for all HR, or a private PT may exist for each HR, provided HRs are replicated per branch
49
Dynamic Branch Prediction
Two-Level Dynamic Prediction by Yeh and Patt
10010100000001
00000101
00001011
00001100
10010101
11110001
11110110
11111011
11111111
C0
History Register (HR)Pattern Table (PT)
Use HR value as index
50
Dynamic Branch PredictionYeh and Patt Nomenclature
Prediction scheme by Yeh and Patt (ref. [4] - [7]) can be effective, but consumes ample cache space
For each branch instruction it consumes a Branch History Register of k bits, an address tag, and a PT of 2k+1 entries for a 2-bit prediction pattern each is consumed
Could this same space be used better?
Yeh and Patt measured varying accuracies for the same program, the same number of cache bits, varying the scheme as follows:
Instead of using always one BHR per branch and one PT per branch, their experiments associate a variation of multiple PT entries with one BHR
51
Dynamic Branch PredictionYeh and Patt Nomenclature Unintuitive as this may sound, Yeh and Patt
observed good prediction accuracy for one global PT and measured this variation as well
Since the number of bits consumed for the cache was decided to remain constant, a larger number of history bits and/or a larger number of last executed branches could be used
Varying the number of BH registers and number of PT, lead to the following nomenclature:
Varying BHT AND Varying PT
P one BH per branch p one PT per branch A G one global BH for all branches
g one global PT for all branches
52
Dynamic Branch PredictionYeh and Patt Nomenclature Theoretically 4, but practically there are just 3
meaningful choices: GAg, PAg, and Pap Complete measurements were conducted for a
growing number budget of bits, from 8k to 128 k bits total cache space
Interestingly, for sufficiently large cache storage, Yeh and Patt found that the best scheme, constrained by 128 k bits, is not PAp but the PAg scheme
Also unintuitively, PAg is most cost-effective. This delivered the highest accuracy for a fixed HW budget, despite interference
For other HW budgets, Patt and Yeh found different optimal schemes
53
Prediction Accuracy For SPECint92
Vertical axis below shows percentages of predicting branches in SPECint92. The horizontal axis shows accuracies in improving order left to right
0
20
40
60
80
100
120
alwaystaken
nevertaken
BTFN 1 bit bias,no profile
1 bit bias,with
profiling
1 bit dynhistory
2 bits dynhistory
2-levelbranch
prediction
Approximate Prediction Accuracies in %
54
Summary
Without good branch prediction pipelined architectures would not be useful
The number transfer-of-control instructions dynamically executed is too large for even reaching the steady state
Static branch predictions are cost effective, but inadequate for deep pipes
Dynamic branch prediction is needed to achieve the > 97% prediction accuracy, needed for reaching the steady state
55
Bibliography1. Gwennap L. [1995]. “New Algorithm Improves Branch Prediction,”
Microprocessor Report, March 1995, pp. 17-212. Gwennap L. [1995]. “New Algorithm Improves Branch Prediction,” MicroDesign
Resources, Vol. 9, No. 4, March 27, 1995, on web at: https://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15213-f00/docs/mpr-branchpredict.pdf
3. Smith, J. [1981]. “A Study or Branch Prediction Strategies,” 8th International Symposium on Computer Architecture, May 1981, pp. 135-148
4. Yeh, T. and Y. Patt [1991]. “Two-Level Adaptive Branch Prediction,” 24th International Symposium on Computer Architecture, November 1991, pp. 51-61
5. Yeh, T. and Y. Patt [1991]. “Alternative Implementations of Two-Level Adaptive Branch Prediction,” 19th International Symposium on Computer Architecture, May 1992, pp. 124-134
6. Yeh, T. and Y. Patt [1993]. “A Comparison of Dynamic Branch Predictors That Use Two Levels of Branch History,” 20th International Symposium on Computer Architecture, May 1993, pp. 257-266
7. Yeh, Tse-Yu, and Yale N. Patt [1992]. “Alternative Implementation of Two-Level Adaptive Branch Prediction”, 19th Annual International Symposium on Computer Architecture, pp 124-134. Can be located on web pages of University of Michigan
56
Bibliography
8. McFarling, Scott [1993]. “Combining Branch Predictors”, WRL Technical Note TN 36, Digital Western Research Lab, June 1993
9. Hilgendorf, R. B., et al. [1999]. “Evaluation of branch-prediction methods on traces from commercial applications.” www.research.ibm.com/journal/rd/434/hilgendorf.html IBM Journal of Research & Development
10. Hsien-Hsin Sean Lee: “Branch Prediction”, http://users.ece.gatech.edu/~sudha/academic/class/ece4100-6100/Lectures/Module3-BranchPrediction/branch.prediction.pdf
11. Daniel A. Jiménez, Calvin Lin, [6.2000]. “Dynamic Branch Prediction with Perceptrons.” Proceedings of the 7th International Symposium on High Performance Computer Architecture
12. Wikipedia, 2011, http://en.wikipedia.org/wiki/Branch_predictor13. Real world Technologies: http://www.realworldtech.com/cpu-perf-analysis/5/