EE382A Lecture 4:
Instruction Fetch TechniquesInstruction Fetch Techniques
Department of Electrical EngineeringStanford University
EE382A – Autumn 2009 John P. ShenLecture 4 - 1
Stanford University
http://eeclass.stanford.edu/ee382a
Announcements
• Start thinking about the research projectT 2 3 t d t– Team: 2-3 students
– Topic: an open problem in computer architecture• See our suggestions or propose your own• Don’t limit yourselves to scope of lectures• Don t limit yourselves to scope of lectures
– Project proposal due by 10/14
N i d• No new required paper– But you have one from previous lecture due today– One optional paper on-line, don’t forget textbook readings
• Class photos– Upload your picture on axess
EE382A – Autumn 2009 John P. ShenLecture 4 - 2
– Upload your picture on axess
A Dynamic Superscalar Processor
Instruction/decode buffer
Fetch
Decode
Dispatch buffer
Dispatch
In o
rder
Reservationstations
Issue
Finish
Execute
Out
of
orde
r
Reorder/completion buffer
Store buffer
Complete
In o
rder
EE382A – Autumn 2009 John P. ShenLecture 4 - 3
Sto e bu e
Retire
Lecture 4 Outline
1. Superscalar Pipeline Implementationa. Instruction Fetch & Decodeb. Instruction Dispatch & Issuec. Instruction Executec. Instruction Executed. Instruction Complete & Retire
2. Instruction Flow Techniques
EE382A – Autumn 2009 John P. ShenLecture 4 - 4
1 Superscalar Pipeline Implementation1. Superscalar Pipeline Implementation
EE382A – Autumn 2009 John P. ShenLecture 4 - 5
a. Instruction Fetch & Decode
• Goal of instruction fetch & decode– Supply pipeline with maximum number of useful instructions per cycle– The fetch logic sets the maximum possible performance (IPCmax)
• Impediments– Instruction cache misses
I t ti li tInstruction Memory
PC
– Instruction alignment– Complex instruction sets
• CISC (x86, 390, etc)
y
3 instructions fetched– Branches and jumps
• Determining the instruction address (branch direction & targets) • Will start in this lecture & finish discussion in next one…
3 instructions fetched
EE382A – Autumn 2009 John P. ShenLecture 4 - 6
Instruction Cache Organization
1 cache line = 1 physical row 1 cache line = 2 physical rows
EE382A – Autumn 2009 John P. ShenLecture 4 - 7
The Fetch Alignment Problem
PC � XX00001PC XX00001
00000 01 10 11
001001
w d
ecod
er
111Row
Fetch group
EE382A – Autumn 2009 John P. ShenLecture 4 - 8
g p
Row width
Solving the Alignment Problem
• Software solution: align taken branch targets to I-cache row startsEffect on code size?– Effect on code size?
– Effect on the I-cache miss rate? – What happens when we go to the next chip?
• Hardware solutionDetect (mis)alignment case– Detect (mis)alignment case
– Allow access of multiple rows (current and sequential next)• True multi-ported cache, over-clocked cache, multi-banked cache, …• Or keep around the cache line from previous access• Or keep around the cache line from previous access
– Assuming a large basic block
– Collapse the two fetched rows into one instruction group
EE382A – Autumn 2009 John P. ShenLecture 4 - 9
IBM RS/6000 2-way I-Cacheand Fetch Logicand Fetch Logic
Tags
Tags
EE382A – Autumn 2009 John P. ShenLecture 4 - 10
Instruction Decoding Issues
• Primary decoding tasksIdentify individual instructions for CISC ISAs– Identify individual instructions for CISC ISAs
– Determine instruction types (especially branches)• Drop instructions after (predicted) taken branches and jumps
P t ti ll t t th f t h i li f t t dd• Potentially restart the fetch pipeline from target address
– Determine dependences between instructions
• Two important factors– Instruction set architecture (CISC vs RISC)
• Determines decoding difficulty• Determines decoding difficulty
– Pipeline width• Determines the number of comparators needed for dependence detection
EE382A – Autumn 2009 John P. ShenLecture 4 - 11
Pentium Pro Fetch/Decode
EE382A – Autumn 2009 John P. ShenLecture 4 - 12
I-Cache Pre-decoding in the AMD K5
From memory
Byte1 Byte2 Byte88 instruction bytes 64
Predecode logic5 bits 5 bits 5 bits
Byte1 Byte2 Byte88 instruction bytes � predecode bits 64 � 40
I-Cache
16 instruction bytes � predecode bits 128 � 80
Decode, translate,d di h • Overheads of pre-decoding?
Up to 4 ROPs ROP1 ROP2 ROP3 ROP4
and dispatch • Overheads of pre-decoding?
• Any uses for RISC processors?
EE382A – Autumn 2009 John P. ShenLecture 4 - 13
b. Instruction Dispatch & Issue
Instruction fetching
Instruction decoding
Instruction dispatching
FU nFU 3FU 2FU 1
EE382A – Autumn 2009 John P. ShenLecture 4 - 14
Instruction execution
Centralized Reservation Station
Centralized reservationt ti (di t h b ff )
Dispatch(i ) station (dispatch buffer)(issue)
Execute
Completion buffer
EE382A – Autumn 2009 John P. ShenLecture 4 - 15
Completion buffer
Distributed Reservation Station
Dispatch bufferDispatch
Distributedreservationstations
Issue
Execute
Completion buffer
Finish
EE382A – Autumn 2009 John P. ShenLecture 4 - 16
Completion buffer
Complete
Instruction Fetch Buffer
FetchUnit
Out-of-orderCore
• Smoothes out the rate mismatch between fetch and execution– Neither the fetch bandwidth nor the execution bandwidth is consistent
F t h b d idth h ld b hi h th ti b d idth• Fetch bandwidth should be higher than execution bandwidth– We prefer to have a stockpile of instructions in the buffer to hide cache
miss latencies. This requires both raw cache bandwidth + control flow l ti
EE382A – Autumn 2009 John P. ShenLecture 4 - 17
speculation
c. Instruction Execute
• Current trends– More parallelism bypassing very challenging– More parallelism bypassing very challenging– Deeper pipelines– More diversity
• Functional unit types– Integer– Integer– Floating point– Load/store most difficult to make parallel– Branch– Specialized units (media)
EE382A – Autumn 2009 John P. ShenLecture 4 - 18
Bypass Networks
PCI-Cache
BRFetch Q BR Scan
BR Predict
Fetch Q
Decode
R d B ffBR/CRFX/LD 1 FX/LD 2FP Reorder BufferBR/CRIssue Q
CRUnit
BRUnit
FX/LD 1Issue Q
FX1Unit LD1
Unit
FX/LD 2Issue Q
LD2Unit
FX2Unit
FPIssue Q
FP1Unit
FP2Unit
• O(n2) interconnect from/to FU inputs and outputs
StQ
D-Cache
• O(n ) interconnect from/to FU inputs and outputs• Associative tag‐match to find operands• Solutions (hurt IPC, help cycle time)
– Use RF only (IBM Power4) with no bypass network
EE382A – Autumn 2009 John P. ShenLecture 4 - 19
Use RF only (IBM Power4) with no bypass network– Decompose into clusters (Alpha 21264)
Specialized units
Intel Pentium 4 d ddstaggered adders
– FireballRun at 2x clock f
Carry
frequencyTwo 16‐bit bitslicesDependent ops execute on half‐cycle boundariesFull result not
l bl l f llavailable until full cycle later
EE382A – Autumn 2009 John P. ShenLecture 4 - 20
Specialized units
FP multiply‐accumulate
ALU 0ALU 2Shift
accumulateR = (A x B) + CDoubles FLOP/instruction
A � B
(A � B) � C
FLOP/instructionLose RISC instruction format symmetry:
3 source operands
ALU C
(A � B) � C
Round/Normalize
– 3 source operandsWidely used
(a) (b)
EE382A – Autumn 2009 John P. ShenLecture 4 - 21
(a) (b)
d. Instruction Complete & Retire
• Out‐of‐order execution– ALU instructionsALU instructions– Load/store instructions
• In‐order completion/retirement– Precise exceptions– Memory coherence and consistency
• Solutions• Solutions– Reorder buffer– Store buffer– Load queue snooping (later)
EE382A – Autumn 2009 John P. ShenLecture 4 - 22
Exceptional Limitations
• Precise exceptions: when exception occurs on instruction i– Instruction i has not modified the processor state (regs, memory)– All older instructions have fully completed– No younger instruction has modified the processor state (regs, memory)
• How do we maintain precise exception in a simple pipelined ?processor?
• What makes precise exceptions difficult on a – In-order diversified pipeline?– Out-of-order pipeline?
EE382A – Autumn 2009 John P. ShenLecture 4 - 23
Out of order pipeline?
Precise Exceptions & OOO Processors
• Solution: force instructions to update processor state in-orderRegister and memory updates done in order– Register and memory updates done in order
– Usually called instruction retirement or graduation or commitment
• Implementation: Re-order Buffer (ROB)– A FIFO for instruction tracking– 1 entry per instruction
• PC, register/memory address, new value, ready, exceptiong y y p
• ROB algorithm– Allocate ROB entries at dispatch time in-order (FIFO)
I t ti d t th i ROB t h th l t– Instructions update their ROB entry when they complete– Examine head of ROB for in-order retirement
• If done retire, otherwise wait (this forces order)If ti fl h t t f ROB t t f i t ti PC ft h dl
EE382A – Autumn 2009 John P. ShenLecture 4 - 24
• If exception, flush contents of ROB, restart from instruction PC after handler
Re-order Buffer
In-order Allocate
Out-of-order Complete
In-order Retire
EE382A – Autumn 2009 John P. ShenLecture 4 - 25
PC Dst Value Dst Address Ready Exception Info
Re-order Buffer Issues
• ProblemAl d t d lt t it i th d b ff h th b– Already computed results must wait in the reorder buffer when they may be needed by other instructions which could otherwise execute.
– Data dependent instructions must wait until the result has been committed to registerregister
• Solution– Forwarding from the re-order buffer
• Allows data in ROB to be used in place of data in registers– Forwarding implementation 1: search ROB for values when registers read
• Only latest entry in ROB can be used• Many comparators but logic is conceptually simple• Many comparators, but logic is conceptually simple
– Forwarding implementation 2: use score-board to track results in ROB• Register scoreboard notes if latest value in ROB and # of ROB entry• Don’t need to track FU any more; an instruction is fully identified by ROB entry
EE382A – Autumn 2009 John P. ShenLecture 4 - 26
Don t need to track FU any more; an instruction is fully identified by ROB entry
ROB Alternatives: History File
• A FIFO buffer operated similarly to the ROB but logs old register valuesEntry format just like ROB– Entry format just like ROB
• Algorithm– Entries allocated in-order at dispatch – Entries updated out-of-order at completion time
• Destination register updated immediately• Old value of register noted in re-order buffer
– Examine head of history file in-order• If no exception just de-allocate• If exception, reverse history file and undo all register updates before flushing
• Advantage: no need for separate forwarding from ROB
• Disadvantage: slower recovery from exceptions
EE382A – Autumn 2009 John P. ShenLecture 4 - 27
ROB Alternatives: Future File
• Use two separate register files:Architectural file ⇒ Represents sequential execution– Architectural file ⇒ Represents sequential execution.
– Future file ⇒ Updated immediately upon instruction execution and used as the working file.
• Algorithm– When instruction reaches the head of ROB, it is committed to the
architectural file– On an exception changes are brought over from the architectural file to the
future file based on which instructions are still represented in the ROB
• Advantage: no need for separate forwarding from ROBg p g
• Disadvantage: slower recovery from exceptions
EE382A – Autumn 2009 John P. ShenLecture 4 - 28
Impediments to High IPC
I-cache
FETCHBranchPredictor Instruction
Buffer
InstructionFlow
DECODE
Buffer
Integer Floating-point Media Memory
MemoryData
EXECUTE Flow
COMMITD-cacheStore
Queue
ReorderBufferRegister
Data (ROB)
Flow
EE382A – Autumn 2009 John P. ShenLecture 4 - 29
Queue
2 Instruction Flow Techniques2. Instruction Flow Techniques
EE382A – Autumn 2009 John P. ShenLecture 4 - 30
Control Flow Graph
• Shows possible paths of control flow through basic blocksBB 1
BB 2
main: addi r2, r0, A addi r3, r0, B addi r4, r0, C BB 1 addi r5, r0, N add r10,r0, r0
b 10 5 d
BB 3 BB 4
bge r10,r5, end loop: lw r20, 0(r2) lw r21, 0(r3) BB 2 bge r20,r21,T1 sw r21, 0(r4) BB 3
b T2
BB 5
b T2 T1: sw r20, 0(r4) BB 4 T2: addi r10,r10,1 addi r2, r2, 4 addi r3, r3, 4 BB 5
• Control DependenceNode X is control dependant on Node Y if the computation in Y determines
add 3, 3, 5 addi r4, r4, 4 blt r10,r5, loop end:
EE382A – Autumn 2009 John P. ShenLecture 4 - 31
– Node X is control dependant on Node Y if the computation in Y determines whether X executes
Mapping CFG to Linear Instruction Sequence
AA
BBB
C
D
D
C
EE382A – Autumn 2009 John P. ShenLecture 4 - 32
Disruption of Sequential Control Flow
Instruction/Decode Buffer
Fetch
Instruction/Decode Buffer
Dispatch Buffer
Decode
Reservation
Dispatch
StationsI StationsIssue
Execute
Branch
Reorder/
Complete
FinishCompletion Buffer
EE382A – Autumn 2009 John P. ShenLecture 4 - 33
Store Buffer
Retire
Condition Resolution
Decode Buffer
Fetch
Dispatch Buffer
Decode
Dispatch
CCreg.
GPreg.value
Reservation
p
StationsIssue
valuecomp.
Execute
Branch
Complete
Finish Completion Buffer
EE382A – Autumn 2009 John P. ShenLecture 4 - 34
Store Buffer
Retire
Target Address Generation
Decode Buffer
Fetch
PC-rel.
Dispatch Buffer
Decode
Dispatch
rel.Reg.ind.
Reg.ind.with
Reservation
p
StationsIssue
withoffset
Execute
Branch
Complete
Finish Completion Buffer
EE382A – Autumn 2009 John P. ShenLecture 4 - 35
Store Buffer
Retire
What’s So Bad About Branches?
• Performance Penalties– Use up execution resources– Fragmentation of I-Cache lines– Disruption of sequential control flows up o o seque a co o o
• Need to determine branch direction (conditional branches)• Need to determine branch target
Robs instruction fetch bandwidth and ILP
E l• Example:– We have an N-way superscalar processor (N is very large)– A branch every 5 instructions and each branch takes 3 cycles to resolve
EE382A – Autumn 2009 John P. ShenLecture 4 - 36
– What is the effective fetch bandwidth?
Riseman and Foster’s Study
• 7 benchmark programs on CDC-3600
• Assume infinite machine:– Infinite memory and instruction stack, register file, functional units
Consider only true dependency at data-flow limity p y
• If bounded to single basic block, i.e. no bypassing of branches ⇒maximum speedup is 1.72
• Suppose one can bypass conditional branches and jumps (i.e. assume the actual branch path is always known such that branches do not impede instruction execution)p )
Br. Bypassed: 0 1 2 8 32 128
Max Speedup:1 72 2 72 3 62 7 21 24 4 51 2
EE382A – Autumn 2009 John P. ShenLecture 4 - 37
Max Speedup:1.72 2.72 3.62 7.21 24.4 51.2
Determining Branch Direction
Problem: Cannot fetch subsequent instructions until branch direction is d t i ddetermined
• Minimize penalty– Move the instruction that computes the branch condition away fromMove the instruction that computes the branch condition away from
branch (ISA & compiler)– 3 branch components can be separated
S if d f BB if diti if t t• Specify end of BB, specify condition, specify target
• Make use of penalty– Bias for not-taken– Fill delay slots with useful/safe instructions (ISA &compiler)– Follow both paths of execution (hardware)
EE382A – Autumn 2009 John P. ShenLecture 4 - 38
– Predict branch direction (hardware)
Determining Branch Target
Problem: Cannot fetch subsequent instructions until branch target is determineddetermined
• Minimize delayy– Generate branch target early in the pipeline (ISA & compiler)
• Make use of delay– Bias for not taken– Predict branch target (hardware)Predict branch target (hardware)
PC-relative vs Register Indirect targets
EE382A – Autumn 2009 John P. ShenLecture 4 - 39
Branch Condition Speculation Options
• Software PredictionEncode an extra bit in the branch instruction– Encode an extra bit in the branch instruction
• Predict not taken: set bit to 0• Predict taken: set bit to 1
Bit set by compiler or user; can use profiling– Bit set by compiler or user; can use profiling– Static prediction, same behavior every time
• Biased For Not Taken– Does not affect the instruction set architecture– Not effective for loops
• Prediction Based on Branch Offsets• Prediction Based on Branch Offsets– Positive offset: predict not taken– Negative offset: predict taken
EE382A – Autumn 2009 John P. ShenLecture 4 - 40
• Prediction Based on History
Branch Instruction SpeculationnPC to Icache
Fetch
nPC to Icache
nPC(seq.) = PC+4PCBranch
specu. target
prediction FA-mux
Decode Buffer
Fetch
Decode
BranchPredictor(using a BTB)
BTB
specu. cond.
Dispatch Buffer
Dispatch
update(target addr.and history)
ReservationStations
Issue
Execute
Branch
EE382A – Autumn 2009 John P. ShenLecture 4 - 41
Finish Completion Buffer
Branch Target Buffer (BTB)
• A small “cache-like” memory in the instruction fetch stage
……. ……. ……
currentPC
Branch Inst.Address (tag)
Branch History
Branch Target(Most Recent)
• Remembers previously executed branches, their addresses, information to aid prediction, and most recent target addresses
• I-fetch stage compares current PC against those in BTB to “guess” nPC– If matched then prediction is made else nPC=PC+4– If predict taken then nPC=target address in BTB else nPC=PC+4
EE382A – Autumn 2009 John P. ShenLecture 4 - 42
• When branch is actually resolved, BTB is updated
More on BTB (aka BTAC)
• Typically a large associative structure Pentium3: 512 entries 4 way SA– Pentium3: 512 entries, 4-way SA
– Opteron: 2K entries, 4-way SA• Entry format
V lid bit dd t (PC) t t dd f ll th h BB dd (l th– Valid bit, address tag (PC), target address, fall-through BB address (length of BB), branch type info, branch direction prediction
• BTB provides both target and direction prediction– More on direction prediction later on
• Multi-cycle BTB access? – The case in most modern processors (2 cycle BTB)– Start BTB access along with I-cache in cycle 0– In cycle 1, fetch from BTB+N (predict not-taken)– In cycle 2, use BTB output to verify
EE382A – Autumn 2009 John P. ShenLecture 4 - 43
• 1 cycle fetch bubble if branch was taken
I-Cache & BTB Integration
• Why does this make sense?
Pl BTB t i h h li• Place a BTB entry in each cache line– Each cache line tells you which line to address next– Do not need the full PC, just an index the cache + a way select– This is called way & line prediction
Implemented in Alpha 21264 processor• Implemented in Alpha 21264 processor– On refills, prediction value points to the next sequential fetch line.– Prediction is trained later on as program executes…
• When correct targets are known…
– Line prediction is verified in next cycle. If line-way prediction is incorrect, slot stage is flushed and PC generated using instruction info and direction
EE382A – Autumn 2009 John P. ShenLecture 4 - 44
prediction
Alpha 21264 Line & Way Predictionp y
EE382A – Autumn 2009 John P. ShenLecture 4 - 45
Source: IEEE Micro, March-April 1999
Branch Direction UCB Study [Lee and Smith 1984]UCB Study [Lee and Smith, 1984]
• Benchmarks– 26 programs (traces on IBM 370, DEC PDP-11, CDC 6400)p g ( , , )– Use trace-driven simulation with parameterized machine models
• Branch types– Unconditional: always taken or always not taken– Subroutine call: always taken– Loop control: usually taken (loop back)– Decision: either way, e.g. IF-THEN-ELSE– Computed GOTO: always taken, with changing targetComputed GOTO: always taken, with changing target– Supervisor call: always taken– “Execute”: always taken (IBM 370)
• Branch behavior: Taken vs Not TakenIBM1 IBM2 IBM3 IBM4 DEC CDC Average
T 0.640 0.657 0.704 0.540 0.738 0.778 0.676NT 0.360 0.343 0.296 0.460 0.262 0.222 0.324
EE382A – Autumn 2009 John P. ShenLecture 4 - 46
Branch Prediction Function
• Based on opcode only (%)IBM1 IBM2 IBM3 IBM4 DEC CDCIBM1 IBM2 IBM3 IBM4 DEC CDC66 69 71 55 80 78
• Based on history of branchBased on history of branch– Branch prediction function F (X1, X2, .... )– Use up to 5 previous branches for history (%)
IBM1 IBM2 IBM3 IBM4 DEC CDCIBM1 IBM2 IBM3 IBM4 DEC CDC0 64.1 64.4 70.4 54.0 73.8 77.81 91.9 95.2 86.6 79.7 96.5 82.32 93 3 96 5 90 8 83 4 97 5 90 62 93.3 96.5 90.8 83.4 97.5 90.63 93.7 96.7 91.2 83.5 97.7 93.54 94.5 97.0 92.0 83.7 98.1 95.35 94.7 97.1 92.2 83.9 98.2 95.7
EE382A – Autumn 2009 John P. ShenLecture 4 - 47
Example Prediction Algorithm based on History
• Prediction accuracy approaches maximum with as few as 2 preceding branch occurrences used as historypreceding branch occurrences used as history
TT
last two branches
TTT
N
NTT
TT
TTT next prediction
N
TNTTNT
NNN
N
TN
Results (%)IBM1 IBM2 IBM3 IBM4 DEC CDC93 3 96 5 90 8 83 4 97 5 90 6
NN
EE382A – Autumn 2009 John P. ShenLecture 4 - 48
93.3 96.5 90.8 83.4 97.5 90.6
Other Prediction Algorithms Based on History
N
TN
TN
Tt
Tt?tt?
T NSaturation Hysteresis
N
T
TNTn? N
TNT
T N
n? nn
T NCounter Counter
TT
TNN
• Combining prediction accuracy with BTB hit rate (86.5% for 128 sets of 4 entries each), branch prediction can provide the net prediction accuracy of approximately 80%. This implies a 5-20% performance enhancement.
EE382A – Autumn 2009 John P. ShenLecture 4 - 49
Designing History-based FSMs for Prediction
– 2-bit FSM as good as n-bit FSM– Saturating counter as good as any FSM IBM Study [Nair, 1992]g g y
• 220 possible FSMs, 5248 actually interestingBenchmark Optimal “Counter”
spice2g6 97.2 97.0
N
Tspice2g6 97.2 97.0
doduc 94.3 94.3 *
T
gcc 89.1 89.1
espresso 89.1 89.1 *
*
li 87.1 86.8
eqntott 87.9 87.2
*
*
EE382A – Autumn 2009 John P. ShenLecture 4 - 50
* Initial state Predict NT Predict T
Simple Branch History Table (BHT)
• Can be integrated with the BTB
• See any shortcoming?
EE382A – Autumn 2009 John P. ShenLecture 4 - 51
PowerPC 604 Branch Predictor
FA
FA-m
ux
I-cache
FA FA
FAR
Branch TargetAddress Cache
x
Branch HistoryTable (BHT)
BTAC
(BTAC)+4
Decode Buffer
Dispatch Buffer
DecodeBHT update
updateBHT prediction
Dispatch Buffer
ReservationDispatch
StationsSFX SFX CFX FPU LSBRN
BTAC predictionBTAC:- 64 entries- fully associative
hit => predict takenIssue
ExecuteBranch
- hit => predict taken
BHT:- 512 entries- direct mapped
2 bit saturating counter
EE382A – Autumn 2009 John P. ShenLecture 4 - 52
Finish Completion Buffer
- 2-bit saturating counter history based prediction- overrides BTAC prediction
Branch Correlation
• So far, the prediction of each static branch instruction is based solely on its own past behavior and not the behaviors of other neighboringon its own past behavior and not the behaviors of other neighboring static branch instructions
• How about this one? x=0;If (someCondition) x=3; /* Branch A*/If (someOtherCondition) y+=19; /* Branch B*/If (someOtherCondition) y 19; / Branch B /If (x<=0) dosomething(); /* Branch C*?
• Other correlation examples?
EE382A – Autumn 2009 John P. ShenLecture 4 - 53
Global History Branch Predictor
• BHR: a shift register for global history (Shift in latest result in each cycle)
• Advantages & shortcomings?
EE382A – Autumn 2009 John P. ShenLecture 4 - 54
Local History Branch Predictor
• What could be the patterns detected by such a predictor?
• Advantages & shortcomings?
EE382A – Autumn 2009 John P. ShenLecture 4 - 55
2-Level Adaptive Prediction [Yeh & Patt]
Nomenclature: {G,P}A{g,p,s}{ , } {g,p, }Pattern History Table (PHT)PC
00...0000...0100...10
Branch History Shift
(shift left when update)Register (BHSR)
1 0 1 1 1 PHTBits
old1 1 1 0 0
11...1011...11 Prediction
index
FSM
new
1 1 1 1 0
To achieve 97% average prediction accuracy: G (1) BHR: 18 bits; g (1) PHT: 218 x 2 bits total = 524 kbits
Branch ResultLogic
EE382A – Autumn 2009 John P. ShenLecture 4 - 56
P (512x4) BHR: 12 bits; g (1) PHT: 212 x 2 bits total = 33 kbitsP (512x4) BHR: 6 bits; s (512) PHT: 26 x 2 bits total = 78 kbits
Example: Global BHSR Scheme (GAs)
Branch Addressj bitj bits
ctio
n
Branch HistoryShift Register (BHSR)
Pre
di
k bits
EE382A – Autumn 2009 John P. ShenLecture 4 - 57
BHT of 2 x 2 j+k
Example: Per-Branch BHSR Scheme (PAs)
Branch Address
j bits Standard BHT
Branch HistoryShift R i t (BHSR)
j bitsi bits
Standard BHT
ctio
n
yShift Register (BHSR)k x 2 i
Pre
di
k bit
EE382A – Autumn 2009 John P. ShenLecture 4 - 58
BHT of 2 x 2 j+kk bits
Gshare Branch Prediction [McFarling]
Branch Address
j bits
xor
tionBranch History
Shift Register (BHSR)
Pre
dict
k bits
EE382A – Autumn 2009 John P. ShenLecture 4 - 59
BHT of 2 x 2 max(j,k)
Next Lecture
• More Branch Prediction
EE382A – Autumn 2009 John P. ShenLecture 4 - 60